HHarnessHub
ExploreCollectionsBenchmarksDocs
+ Submit
HHarnessHub

The npm of AI Agent Harnesses.

Discover

  • Explore
  • Collections
  • Benchmarks

Build

  • Submit
  • Docs

Company

  • Privacy
  • Terms
  • Contact
© 2026 HarnessHub · Built for the AI agent community.
trophy

Benchmark Leaderboard

Compare AI agent harnesses by real-world benchmark scores.

RankHarnessCategoryScoreModelDate
1EleutherAIlm-evaluation-harness· GSM8KanalyticsEval Harnesses
9420.0%
claude-sonnet-4-52024-10
2vllm-projectvLLM· MT-Benchaccount_treeData Pipeline
8860.0%
gpt-4o2024-10
3microsoftAutoGen· HumanEvalhubMulti-Agent
8850.0%
gpt-4o2024-07
4run-llamallama_index· RAGAStravel_exploreRAG Frameworks
7840.0%
gpt-4o2024-09
5openaiOpenAI Agents SDK· AgentBenchhubMulti-Agent
7580.0%
gpt-4o2024-11
6microsoftAutoGen· AgentBenchhubMulti-Agent
7240.0%
gpt-4o2024-07
7ollamaOllama· MT-BenchbuildTool-Use Wrappers
7210.0%
llama-3.1-70b2024-09
8langchain-ailangchain· ToolBenchtravel_exploreRAG Frameworks
7020.0%
claude-sonnet-4-52024-11
9Aider-AIaider· Aider PolyglotcodeCoding Agents
6540.0%
claude-sonnet-4-52024-12
10langchain-ailangchain· AgentBenchtravel_exploreRAG Frameworks
6450.0%
gpt-4o2024-08
11browser-usebrowser-use· Mind2WebpublicBrowser Agents
6280.0%
claude-sonnet-4-52024-12
12browser-usebrowser-use· WebArenapublicBrowser Agents
5530.0%
claude-sonnet-4-52025-01
13run-llamallama_index· BEIRtravel_exploreRAG Frameworks
5270.0%
gpt-4o2024-07
14browser-usebrowser-use· WebArenapublicBrowser Agents
4720.0%
gpt-4o2024-10
15openaiOpenAI Agents SDK· SWE-bench VerifiedhubMulti-Agent
2850.0%
gpt-4o2024-12
16clinecline· SWE-bench LitecodeCoding Agents
2580.0%
claude-opus-42025-01
17princeton-nlpSWE-agent· SWE-bench VerifiedcodeCoding Agents
2370.0%
claude-sonnet-4-52024-10
18clinecline· SWE-bench VerifiedcodeCoding Agents
2140.0%
claude-sonnet-4-52024-12
19Aider-AIaider· SWE-bench VerifiedcodeCoding Agents
1890.0%
claude-sonnet-4-52024-11
20EleutherAIlm-evaluation-harness· HellaSwaganalyticsEval Harnesses
87.3%
llama-3-70b2025-02
21continuedevcontinue· HumanEvalcodeCoding Agents
81.2%
claude-sonnet-4-62025-03
22EleutherAIlm-evaluation-harness· MMLUanalyticsEval Harnesses
79.2%
llama-3-70b2025-02
23princeton-nlpSWE-agent· SWE-bench VerifiedcodeCoding Agents
67.8%
claude-sonnet-4-62026-04
24gpt-engineer-orggpt-engineer· HumanEvalcodeCoding Agents
67.4%
gpt-4o2025-02
25Aider-AIaider· SWE-bench VerifiedcodeCoding Agents
49.1%
claude-sonnet-4-62026-03
26assafelovicgpt-researcher· GAIAbiotechResearch Agents
42.1%
gpt-4o2025-03
27princeton-nlpSWE-agent· SWE-bench LitecodeCoding Agents
41.6%
claude-sonnet-4-62026-04
28OpenBMBXAgent· AgentBenchbiotechResearch Agents
29.8%
gpt-4o2025-02
29lavague-aiLaVague· WebArenapublicBrowser Agents
21.8%
gpt-4o2025-04
1

EleutherAI

lm-evaluation-harness

GSM8K

9420.0%
analyticsEval Harnesses·claude-sonnet-4-5·2024-10View harness →
2

vllm-project

vLLM

MT-Bench

8860.0%
account_treeData Pipeline·gpt-4o·2024-10View harness →
3

microsoft

AutoGen

HumanEval

8850.0%
hubMulti-Agent·gpt-4o·2024-07View harness →
4

run-llama

llama_index

RAGAS

7840.0%
travel_exploreRAG Frameworks·gpt-4o·2024-09View harness →
5

openai

OpenAI Agents SDK

AgentBench

7580.0%
hubMulti-Agent·gpt-4o·2024-11View harness →
6

microsoft

AutoGen

AgentBench

7240.0%
hubMulti-Agent·gpt-4o·2024-07View harness →
7

ollama

Ollama

MT-Bench

7210.0%
buildTool-Use Wrappers·llama-3.1-70b·2024-09View harness →
8

langchain-ai

langchain

ToolBench

7020.0%
travel_exploreRAG Frameworks·claude-sonnet-4-5·2024-11View harness →
9

Aider-AI

aider

Aider Polyglot

6540.0%
codeCoding Agents·claude-sonnet-4-5·2024-12View harness →
10

langchain-ai

langchain

AgentBench

6450.0%
travel_exploreRAG Frameworks·gpt-4o·2024-08View harness →
11

browser-use

browser-use

Mind2Web

6280.0%
publicBrowser Agents·claude-sonnet-4-5·2024-12View harness →
12

browser-use

browser-use

WebArena

5530.0%
publicBrowser Agents·claude-sonnet-4-5·2025-01View harness →
13

run-llama

llama_index

BEIR

5270.0%
travel_exploreRAG Frameworks·gpt-4o·2024-07View harness →
14

browser-use

browser-use

WebArena

4720.0%
publicBrowser Agents·gpt-4o·2024-10View harness →
15

openai

OpenAI Agents SDK

SWE-bench Verified

2850.0%
hubMulti-Agent·gpt-4o·2024-12View harness →
16

cline

cline

SWE-bench Lite

2580.0%
codeCoding Agents·claude-opus-4·2025-01View harness →
17

princeton-nlp

SWE-agent

SWE-bench Verified

2370.0%
codeCoding Agents·claude-sonnet-4-5·2024-10View harness →
18

cline

cline

SWE-bench Verified

2140.0%
codeCoding Agents·claude-sonnet-4-5·2024-12View harness →
19

Aider-AI

aider

SWE-bench Verified

1890.0%
codeCoding Agents·claude-sonnet-4-5·2024-11View harness →
20

EleutherAI

lm-evaluation-harness

HellaSwag

87.3%
analyticsEval Harnesses·llama-3-70b·2025-02View harness →
21

continuedev

continue

HumanEval

81.2%
codeCoding Agents·claude-sonnet-4-6·2025-03View harness →
22

EleutherAI

lm-evaluation-harness

MMLU

79.2%
analyticsEval Harnesses·llama-3-70b·2025-02View harness →
23

princeton-nlp

SWE-agent

SWE-bench Verified

67.8%
codeCoding Agents·claude-sonnet-4-6·2026-04View harness →
24

gpt-engineer-org

gpt-engineer

HumanEval

67.4%
codeCoding Agents·gpt-4o·2025-02View harness →
25

Aider-AI

aider

SWE-bench Verified

49.1%
codeCoding Agents·claude-sonnet-4-6·2026-03View harness →
26

assafelovic

gpt-researcher

GAIA

42.1%
biotechResearch Agents·gpt-4o·2025-03View harness →
27

princeton-nlp

SWE-agent

SWE-bench Lite

41.6%
codeCoding Agents·claude-sonnet-4-6·2026-04View harness →
28

OpenBMB

XAgent

AgentBench

29.8%
biotechResearch Agents·gpt-4o·2025-02View harness →
29

lavague-ai

LaVague

WebArena

21.8%
publicBrowser Agents·gpt-4o·2025-04View harness →