Benchmark Leaderboard
Compare AI agent harnesses by real-world benchmark scores.
| Rank | Harness | Category | Score | Model | Date |
|---|---|---|---|---|---|
| 1 | EleutherAIlm-evaluation-harness· GSM8K | analyticsEval Harnesses | 9420.0% | claude-sonnet-4-5 | 2024-10 |
| 2 | vllm-projectvLLM· MT-Bench | account_treeData Pipeline | 8860.0% | gpt-4o | 2024-10 |
| 3 | microsoftAutoGen· HumanEval | hubMulti-Agent | 8850.0% | gpt-4o | 2024-07 |
| 4 | run-llamallama_index· RAGAS | travel_exploreRAG Frameworks | 7840.0% | gpt-4o | 2024-09 |
| 5 | openaiOpenAI Agents SDK· AgentBench | hubMulti-Agent | 7580.0% | gpt-4o | 2024-11 |
| 6 | microsoftAutoGen· AgentBench | hubMulti-Agent | 7240.0% | gpt-4o | 2024-07 |
| 7 | ollamaOllama· MT-Bench | buildTool-Use Wrappers | 7210.0% | llama-3.1-70b | 2024-09 |
| 8 | langchain-ailangchain· ToolBench | travel_exploreRAG Frameworks | 7020.0% | claude-sonnet-4-5 | 2024-11 |
| 9 | Aider-AIaider· Aider Polyglot | codeCoding Agents | 6540.0% | claude-sonnet-4-5 | 2024-12 |
| 10 | langchain-ailangchain· AgentBench | travel_exploreRAG Frameworks | 6450.0% | gpt-4o | 2024-08 |
| 11 | browser-usebrowser-use· Mind2Web | publicBrowser Agents | 6280.0% | claude-sonnet-4-5 | 2024-12 |
| 12 | browser-usebrowser-use· WebArena | publicBrowser Agents | 5530.0% | claude-sonnet-4-5 | 2025-01 |
| 13 | run-llamallama_index· BEIR | travel_exploreRAG Frameworks | 5270.0% | gpt-4o | 2024-07 |
| 14 | browser-usebrowser-use· WebArena | publicBrowser Agents | 4720.0% | gpt-4o | 2024-10 |
| 15 | openaiOpenAI Agents SDK· SWE-bench Verified | hubMulti-Agent | 2850.0% | gpt-4o | 2024-12 |
| 16 | clinecline· SWE-bench Lite | codeCoding Agents | 2580.0% | claude-opus-4 | 2025-01 |
| 17 | princeton-nlpSWE-agent· SWE-bench Verified | codeCoding Agents | 2370.0% | claude-sonnet-4-5 | 2024-10 |
| 18 | clinecline· SWE-bench Verified | codeCoding Agents | 2140.0% | claude-sonnet-4-5 | 2024-12 |
| 19 | Aider-AIaider· SWE-bench Verified | codeCoding Agents | 1890.0% | claude-sonnet-4-5 | 2024-11 |
| 20 | EleutherAIlm-evaluation-harness· HellaSwag | analyticsEval Harnesses | 87.3% | llama-3-70b | 2025-02 |
| 21 | continuedevcontinue· HumanEval | codeCoding Agents | 81.2% | claude-sonnet-4-6 | 2025-03 |
| 22 | EleutherAIlm-evaluation-harness· MMLU | analyticsEval Harnesses | 79.2% | llama-3-70b | 2025-02 |
| 23 | princeton-nlpSWE-agent· SWE-bench Verified | codeCoding Agents | 67.8% | claude-sonnet-4-6 | 2026-04 |
| 24 | gpt-engineer-orggpt-engineer· HumanEval | codeCoding Agents | 67.4% | gpt-4o | 2025-02 |
| 25 | Aider-AIaider· SWE-bench Verified | codeCoding Agents | 49.1% | claude-sonnet-4-6 | 2026-03 |
| 26 | assafelovicgpt-researcher· GAIA | biotechResearch Agents | 42.1% | gpt-4o | 2025-03 |
| 27 | princeton-nlpSWE-agent· SWE-bench Lite | codeCoding Agents | 41.6% | claude-sonnet-4-6 | 2026-04 |
| 28 | OpenBMBXAgent· AgentBench | biotechResearch Agents | 29.8% | gpt-4o | 2025-02 |
| 29 | lavague-aiLaVague· WebArena | publicBrowser Agents | 21.8% | gpt-4o | 2025-04 |
EleutherAI
lm-evaluation-harness
GSM8K
vllm-project
vLLM
MT-Bench
microsoft
AutoGen
HumanEval
run-llama
llama_index
RAGAS
openai
OpenAI Agents SDK
AgentBench
microsoft
AutoGen
AgentBench
ollama
Ollama
MT-Bench
langchain-ai
langchain
ToolBench
Aider-AI
aider
Aider Polyglot
langchain-ai
langchain
AgentBench
browser-use
browser-use
Mind2Web
browser-use
browser-use
WebArena
run-llama
llama_index
BEIR
browser-use
browser-use
WebArena
openai
OpenAI Agents SDK
SWE-bench Verified
cline
cline
SWE-bench Lite
princeton-nlp
SWE-agent
SWE-bench Verified
cline
cline
SWE-bench Verified
Aider-AI
aider
SWE-bench Verified
EleutherAI
lm-evaluation-harness
HellaSwag
continuedev
continue
HumanEval
EleutherAI
lm-evaluation-harness
MMLU
princeton-nlp
SWE-agent
SWE-bench Verified
gpt-engineer-org
gpt-engineer
HumanEval
Aider-AI
aider
SWE-bench Verified
assafelovic
gpt-researcher
GAIA
princeton-nlp
SWE-agent
SWE-bench Lite
OpenBMB
XAgent
AgentBench
lavague-ai
LaVague
WebArena