Skip to main content

Open Sourcing Xatu Data

·854 words·5 mins·
Xatu Data Mainnet Sepolia Holesky
samcm
DevOps Engineer
savid
DevOps Engineer
Table of Contents
The Ethereum Foundation is running a Data Challenge for EIP4844! Click here for more info.

Introduction
#

We’re thrilled to share that the EthPandaOps Xatu dataset is now open source!

The dataset contains a wealth of information about the Ethereum network, including detailed data on beacon chain events, mempool activity, and canonical chain events.

ethpandaops/xatu-data

Jupyter Notebook
17
3

Summary:

  • The data is dedicated to the public domain under the CC BY 4.0 license
  • The entire schema is available here
  • Data is partitioned by hour or day in Apache Parquet files
  • We’ve already published the last 60 days of data for Mainnet, Holesky & Sepolia
    • We’ll publish everything we have for Mainnet over the following weeks – it’s just a lot of data!
    • Testnets will have the last 90 days of data published
    • Check here for detailed data availability
  • We’ll continue to update tables with new data, as well as add new tables

Why are we open sourcing the data?
#

We aim to empower researchers, developers, and enthusiasts to explore the Ethereum network in depth and contribute to its ongoing evolution. We believe that open access to high-quality data is essential for the success of the Ethereum ecosystem.

By making this data publicly available we hope to foster greater understanding of the Ethereum network and drive advancements in areas such as protocol development, monitoring and performance optimization.

What is Xatu?
#

Xatu is a tool for collecting data from different components of the Ethereum network that we’ve been building for a while now. Since it’s initial release in December 2022 we’ve been running Xatu internally to monitor the Ethereum network by storing data in Clickhouse.

In-house we’re using it for monitoring, analysis and incident response. Notably it was used for the Big Blocks Test on Goerli/Mainnet in 2023 to help decide the EIP4844 blob parameters. It has also been the first port of call for analysing how Dencun performed through the fork lifecycle of Devnets -> Testnets -> Mainnet.

What’s in the dataset?
#

Xatu has it’s fingers in a lot of pies so we categorize the data into a few different types. Check here for the complete schema and data availability date ranges.

Beacon API Events
#

Last 60 days published

Events that are derived from the Beacon API Event Stream via Xatu Sentry from all consensus clients in multiple regions and networks. All events are annotated with additional data to help with analysis. For example, the attestation events have information about when the attestation was seen, and even the validator_index of the attestation. Mainnet data exists from June 2023.

  • 650+ billion attestation events
    • 6TiB compressed, 300TiB uncompressed 😲
  • 50+ million block events
  • 50+ million blob_sidecar events
  • Plus more!

Mempool Events
#

Publishing soon tm

Events that are derived from Xatu Mimicry which connects to the execution p2p network. We’ll be publishing these events in the next few days. Mainnet data exists from March 2023.

  • 3+ billion transaction events

Canonical Events
#

Publishing soon tm

We also derive events from the finalized chain which we call canonical events. Mainnet data exists from Beacon Chain genesis in December 2020.

These events are especially useful for analysis when compared to Beacon API Events and Mempool Events. For example, comparing when an attestation was seen on the network against when it was included in a beacon block, or comparing when a transaction was first seen in the mempool to when it was included in a block. We’ll be publishing these events in the coming weeks.

How do I use the data?
#

The data is stored in Apache Parquet files which are a columnar storage format that is highly optimized for analytics. You can read these files using a variety of tools including Python & Clickhouse. Check out the repo for more information on how to get started.

Clickhouse
#

Using Clickhouse is the simplest way to get started. You can use the clickhouse client to query the data directly from the Parquet files. Check out the Clickhouse documentation to get setup.

Querying directly
#

To query all attestations events in Sepolia on the 20th of March 2024 in the 13th hour you can use the following query:

Query
SELECT COUNT(*)
FROM
    url('https://data.ethpandaops.io/xatu/sepolia/databases/default/beacon_api_eth_v1_events_attestation/2024/2/20/13.parquet', 'Parquet')
Querying Parquet directly from Clickhouse

Inserting data
#

You can also insert the data into a Clickhouse database to query it more easily. This is highly recommended for larger queries.

Query
INSERT INTO default.beacon_api_eth_v1_events_attestation
SELECT *
FROM url('https://data.ethpandaops.io/xatu/sepolia/databases/default/beacon_api_eth_v1_events_attestation/2024/2/20/13.parquet', 'Parquet')
Inserting directly in to Clickhouse

Globbing
#

Clickhouse supports globbing so you can query multiple Parquet files at once. For example, to count the entire days worth of attestation events in Sepolia on the 20th of March 2024 you can use the following query:

Query
SELECT COUNT(*)
FROM
    url('https://data.ethpandaops.io/xatu/sepolia/databases/default/beacon_api_eth_v1_events_attestation/2024/2/20/{0..23}.parquet', 'Parquet')
Querying Parquet directly from Clickhouse with globbing

Conclusion
#

We cannot wait to see what the community does with the data! If you cook something up please let us know! We’d love to hear about it. If you have any questions feel free to reach out to us on Twitter. If you notice any issues please make an issue on the repo.

Happy querying and don’t forget the Data Challenge run by the Ethereum Foundation!

Love,

EthPandaOps Team ❤️