Skip to main content

Introducing EthIQ

March 5, 20265 min read
Introducing EthIQ

If you ask an AI model about consensus constants, EVM execution, or state transitions, does it actually know what it's talking about? We built EthIQ, a benchmark that measures how well AI models understand Ethereum protocol internals.

Models are tested in two modes: API (direct API calls with a system prompt, no tools) and Agentic (CLI tools like Claude Code and Codex running in sandboxed Docker containers with bash, file I/O, and Node.js).

Motivation

LLMs are now a daily tool at ethPandaOps, and we needed a way to evaluate them in our specific context. EthIQ gives us a quick signal when a new model releases, and a meaningful eval suite to drive improvements as we invest more in agentic tooling (prompt optimization, fine-tuning, workflows, etc.)

Questions

The question set is designed to stress two distinct model capabilities: world knowledge (tested by categories like Constants) and raw reasoning (tested by auto-generated categories like EVM Execution). World knowledge questions are intentional. We want to explicitly probe what Ethereum protocol knowledge made it into a model's training data, since recall of these values is genuinely useful.

To prevent memorization in the raw reasoning tasks, auto-generated questions start from official Ethereum spec test fixtures but mutate the inputs with a randomized seed (balances, storage values, calldata, deposit amounts, etc.) Ground-truth answers are then re-derived from the mutated inputs using the Python reference implementations. If a model saw the original fixtures during training, its answers won't match. New forks get a fresh seed, keeping the benchmark honest over time.

Questions are organized into datasets tied to Ethereum forks. The first dataset is fusaka, with 325 questions across these categories:

CategoryWhat it asksExample
ConstantsExact protocol constant valuesWhat is the value of SLOTS_PER_EPOCH on Mainnet?
EVM executionTrace through bytecode, report the outcomeGiven 0x..., what's in storage slot 0x1?
Consensus state transitionsApply slashings, deposits, etc. and compute resulting stateAfter this attester slashing, what is validator 6's balance?
Consensus epoch processingCalculate rewards, penalties, and balance deltasWhat are the reward/penalty deltas after processing this epoch?
Consensus fork choiceReplay block trees and determine the canonical headAfter these attestations, which block is the head?
Consensus shufflingCompute validator committee assignmentsWhich validators are in committee index 2 at slot 4?
CalculationsMulti-step arithmetic using protocol constantsHow many seconds are in one Ethereum epoch?
ConceptualOpen-ended explanations graded by LLM rubricExplain how RANDAO bias works in validator shuffling
Cross-forkWhat changed between forksWhat changed about max effective balance in Electra?
EIP interactionsHow specific EIPs interact with each otherHow do EIP-4844 blob commitments appear in beacon blocks?
TrickQuestions with wrong premises"When did Verkle Trees ship on mainnet?" (They didn't.)

The Results

EthIQ performance chart showing model pass rates over time

Figure 1: Model performance by release date on the fusaka dataset (325 questions). Whiskers show 95% confidence intervals.

Frontier

In API mode, Anthropic's claude-opus-4.6-high leads at 77.5%, followed by OpenAI's gpt-5.3-codex-xhigh at 76.3% and Google's gemini-3.1-pro-preview-high at 72.3%. claude-sonnet-4.6-high is still in progress.

When Agentic runs are included, gpt-5.3-codex-xhigh takes the lead at 83.7% via the Codex CLI.

We observed some models (minimax-m2.5-high) suffering degraded performance as their reasoning_effort increased. After investigation, we found that these models were exhausting their maximum token output allocation, effectively thinking themselves to death.

EVM execution questions in API mode are a particularly good vibe check for model capability. Watching Kimi K2.5 step through executing the EVM in it's thinking traces is quite an experience (read: concerning!) We felt bad asking llama-3.2-1b to do the same. In general, we weren't expecting models to be so capable at executing the EVM in-context. Fortunately we capped the difficulty of the EVM questions at generation time (based on a few heuristics), so once this dataset set begins to saturate we can raise these limits and unleash a new very hard class of questions.

Open Weights

EthIQ performance chart showing model pass rates over time for only open weight models

Figure 2: Open Weights model performance by release date on the fusaka dataset (325 questions). Whiskers show 95% confidence intervals.

Ethereum and open weights models go hand-in-hand, so we added the ability to show just open weight models. Kimi's k2.5-high is the stand out amongst the open weight models, scoring 60.6%.

minimax-m2.5-high was disappointing, landing at 37.4%. This bulk of this disparity is in Consensus Constants, with k2.5-high at 94.7% compared to minimax-m2.5-high at 58.9%. k2.5 is a much larger model at 1 trillion parameters versus minimax-m2.5 at 230 billion. World knowledge is an important factor when using an LLM for Ethereum!

The Canary 🐦

Thankfully all API evaluations returned failures for Consensus shuffling. This would require the models to compute SHA256 in-context. If this canary dies you'll find the ethPandaOps team in a remote location far away from any electricity.

Try it

Browse the full results at ethiq.ethpandaops.io. You can filter by question category, difficulty, and evaluation mode.

Keep an eye out as we'll be updating this as new forks ship and models are released!

Next Post
EVM Gas Profiling: New Execution Trace Data