HHarnessHub
ExploreCollectionsBenchmarksDocs
+ Submit
HHarnessHub

The npm of AI Agent Harnesses.

Discover

  • Explore
  • Collections
  • Benchmarks

Build

  • Submit
  • Docs

Company

  • Privacy
  • Terms
  • Contact
© 2026 HarnessHub · Built for the AI agent community.

Filters

Category

Model Compatibility

Language

License

Quality Signals

Explore Harnesses

8 harnesses

Sort
explodinggradientsverifiedVerifiedApache-2.0

ragas

Evaluation framework specifically designed to measure quality of RAG pipelines.

analyticsEval Harnessespythonragevaluationfaithfulness
star14.2kdownload34.5kv0.2.6
↓ install
microsoftverifiedVerifiedMIT

promptflow

Microsoft's toolkit for designing, evaluating, and deploying LLM-powered workflows.

analyticsEval Harnessespythonazureworkflowvisual
star11.1kdownload7.3kv1.16.0
↓ install
EleutherAIverifiedVerifiedMIT

lm-evaluation-harness

Industry-standard benchmark harness for evaluating language models across hundreds of tasks.

analyticsEval Harnessespythonacademicleaderboardhuggingface
star8.9kdownload15.7kv0.4.5
↓ install
confident-aiverifiedVerifiedApache-2.0

deepeval

Pytest-style evaluation framework for LLM apps with metrics for hallucinations, faithfulness, and more.

analyticsEval Harnessespythontestingpytestmetrics
star4.8kdownload11.2kv2.0.9
↓ install
trueraverifiedVerifiedMIT

trulens

Open-source evaluation and observability framework for LLM apps and RAG pipelines.

analyticsEval Harnessespythonobservabilitytracingrag-evaluation
star3.4kdownload4.1kv1.2.5
↓ install
openaiverifiedVerifiedMIT

human-eval

OpenAI's HumanEval benchmark for measuring code generation correctness.

analyticsEval Harnessespythonbenchmarkcode-generationhumaneval
star3.2kdownload15.2kv1.0.0
↓ install
stanford-crfmverifiedVerifiedApache-2.0

HELM

Stanford's holistic LLM evaluation framework spanning accuracy, robustness, fairness, and more.

analyticsEval Harnessespythonbenchmarkholisticstanford
star1.9kdownload9.4kv0.5.0
↓ install
evalplusverifiedVerifiedApache-2.0

EvalPlus

Rigorous code benchmark that extends HumanEval and MBPP with significantly more test cases.

analyticsEval Harnessespythonbenchmarkhumanevalmbpp
star1.6kdownload6.8kv0.3.1
↓ install