8 harnesses
Evaluation framework specifically designed to measure quality of RAG pipelines.
Microsoft's toolkit for designing, evaluating, and deploying LLM-powered workflows.
Industry-standard benchmark harness for evaluating language models across hundreds of tasks.
Pytest-style evaluation framework for LLM apps with metrics for hallucinations, faithfulness, and more.
Open-source evaluation and observability framework for LLM apps and RAG pipelines.
OpenAI's HumanEval benchmark for measuring code generation correctness.
Stanford's holistic LLM evaluation framework spanning accuracy, robustness, fairness, and more.
Rigorous code benchmark that extends HumanEval and MBPP with significantly more test cases.