assert response == expected — doesn’t work with LLMs. The answer is different every time. We need a new testing paradigm.
New Approaches¶
Property-based testing: Test properties, not exact output. Metamorphic testing: A small change in input must not change the facts. LLM-as-judge: GPT-4 evaluates based on a rubric.
Evaluation Pipeline¶
- Golden dataset: 100+ pairs
- Automatic run on every PR
- Metrics: faithfulness, relevance, toxicity
- Regression detection: alert on >5% drop
Red Teaming¶
Automated adversarial testing: prompt injection, jailbreak, PII leakage. In CI, not as a one-off.
AI Testing Is Software Testing 2.0¶
Property-based tests + LLM-as-judge + evaluation pipeline = production-ready.
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us