Explore Harnesses

8 harnesses

Sort

explodinggradientsverifiedVerifiedApache-2.0

ragas

Evaluation framework specifically designed to measure quality of RAG pipelines.

analyticsEval Harnessespythonragevaluationfaithfulness

star14.2kdownload34.5kv0.2.6

↓ install

microsoftverifiedVerifiedMIT

promptflow

Microsoft's toolkit for designing, evaluating, and deploying LLM-powered workflows.

analyticsEval Harnessespythonazureworkflowvisual

star11.1kdownload7.3kv1.16.0

↓ install

EleutherAIverifiedVerifiedMIT

lm-evaluation-harness

Industry-standard benchmark harness for evaluating language models across hundreds of tasks.

analyticsEval Harnessespythonacademicleaderboardhuggingface

star8.9kdownload15.7kv0.4.5

↓ install

confident-aiverifiedVerifiedApache-2.0

deepeval

Pytest-style evaluation framework for LLM apps with metrics for hallucinations, faithfulness, and more.

analyticsEval Harnessespythontestingpytestmetrics

star4.8kdownload11.2kv2.0.9

↓ install

trueraverifiedVerifiedMIT

trulens

Open-source evaluation and observability framework for LLM apps and RAG pipelines.

analyticsEval Harnessespythonobservabilitytracingrag-evaluation

star3.4kdownload4.1kv1.2.5

↓ install

openaiverifiedVerifiedMIT

human-eval

OpenAI's HumanEval benchmark for measuring code generation correctness.

analyticsEval Harnessespythonbenchmarkcode-generationhumaneval

star3.2kdownload15.2kv1.0.0

↓ install

stanford-crfmverifiedVerifiedApache-2.0

HELM

Stanford's holistic LLM evaluation framework spanning accuracy, robustness, fairness, and more.

analyticsEval Harnessespythonbenchmarkholisticstanford

star1.9kdownload9.4kv0.5.0

↓ install

evalplusverifiedVerifiedApache-2.0

EvalPlus

Rigorous code benchmark that extends HumanEval and MBPP with significantly more test cases.

analyticsEval Harnessespythonbenchmarkhumanevalmbpp

star1.6kdownload6.8kv0.3.1

↓ install