Requirements
Must have:
What You Bring AI / ML Experience At least 3–5 years of experience in machine learning or applied AI. Practical experience working with LLMs in production or advanced prototypes. Model Training & Fine-Tuning Experience with PyTorch or TensorFlow. Familiarity with fine-tuning techniques and training pipelines. Evaluation & Experimentation Strong understanding of experimental design. Experience building evaluation harnesses. Programming Skills Strong Python skills. Familiarity with REST APIs and backend integration. Data Handling & MLOps Experience with dataset preprocessing, labeling pipelines, and versioning. Familiarity with Docker, CI/CD, and model deployment. Analytical Mindset Ability to reason about model behavior and failure modes. Communication Good verbal and written communication in English and German. Startup Mentality Comfortable with ambiguity, fast iteration, and high ownership.
Responsibilities:
Key Responsibilities LLM Evaluation & Testing Design and maintain systematic evaluation frameworks for LLMs, including: Automated test suites, Golden datasets, Regression benchmarks Define quantitative metrics (e.g., accuracy, latency, hallucination rate, task success) and qualitative evaluation protocols. Perform error analysis and root-cause investigations on model failures. Task Alignment & Optimization Focus on rapid prototyping and operationalization of customer use cases Improve model performance on specific tasks using a prompt-first workflow (system prompts, few-shot examples, tool instructions). Build and iterate evaluation sets; run experiments to measure quality, latency, and cost. Curate high-signal datasets for automated prompt optimization (cleaning, labeling, filtering, augmentation). Apply lightweight adaptation when beneficial (prompt tuning, parameter-efficient methods like LoRA/adapters). Use supervised fine-tuning / instruction tuning when prompting and lightweight methods don’t reach...