CBT (ClickHouse Build Tool)
CBT is a ClickHouse-focused data transformation tool that provides fast idempotent transformations with pure SQL or external scripts. It's designed for building reliable data pipelines with dependency management, position tracking, and automatic retry capabilities.
Overview
CBT simplifies building data transformation pipelines on ClickHouse by providing:
- Fast Transformations: Optimized for ClickHouse with native query execution
- Idempotent Operations: Safe to re-run transformations without side effects
- Dependency Management: Automatic validation of upstream data availability
- Position Tracking: Precise interval tracking for incremental transformations
- Gap Detection: Automatically identifies and backfills missing data intervals
- Scheduled Jobs: Support for both incremental and scheduled transformations
Key Features
Incremental Transformations
Process data in ordered intervals with precise position tracking:
- Maintains exact boundaries for every processed interval
- Supports gap detection and automatic backfilling
- Validates dependency availability before processing
- Perfect for event stream processing and time-series aggregations
Scheduled Transformations
Execute transformations on a schedule without position tracking:
- Ideal for reference data updates (exchange rates, user lists)
- System health monitoring and report generation
- Database maintenance tasks
- Runs independently of data positions
Multi-Instance Architecture
CBT runs as a unified binary handling both coordination and task execution:
- Multiple instances can run for high availability
- Automatic task deduplication via Redis-backed queue
- Tag-based worker filtering for specialized processing
- Shared ClickHouse and Redis infrastructure
Architecture
┌───────────────┐
│ CBT │
└───────┬───────┘
│
┌────────┴────────┐
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Redis │ │ ClickHouse │
│ │ │ │
│ • Task Queue │ │ • Data │
│ • Scheduling │ │ • Admin │
└──────────────┘ └──────────────┘
Use Cases
Data Pipeline Engineering
- Build complex transformation pipelines with dependency management
- Transform raw Ethereum data into analytics-ready tables
- Aggregate metrics across multiple data sources with automatic validation
Analytics Platform
- Power dashboards and reporting with transformed data
- Maintain materialized views and aggregation tables
- Ensure data consistency across dependent transformations
Real-time Processing
- Process streaming data in ordered intervals
- Handle late-arriving data with gap detection
- Maintain data freshness with scheduled updates
How It Works
Model Definition
Models are defined in YAML+SQL files that specify:
- Transformation type (incremental or scheduled)
- Dependencies on external data sources or other transformations
- Processing intervals and schedules
- SQL transformation logic or external script execution
External Models
Define source data boundaries:
---
table: beacon_blocks
interval:
type: slot
---
SELECT
min(slot) as min,
max(slot) as max
FROM ethereum.beacon_blocks
Transformation Models
Process data with dependency validation:
---
type: incremental
table: block_stats
interval:
max: 3600
dependencies:
- ethereum.beacon_blocks
schedules:
forwardfill: "@every 5m"
backfill: "@every 1h"
---
INSERT INTO analytics.block_stats
SELECT
slot,
COUNT(*) as block_count,
AVG(gas_used) as avg_gas
FROM ethereum.beacon_blocks
WHERE slot BETWEEN {{ .bounds.start }} AND {{ .bounds.end }}
GROUP BY slot;
Integration with ethPandaOps Stack
CBT powers data transformation in the ethPandaOps infrastructure:
- Xatu Data: Transforms raw Xatu network data into analytics tables
- The Lab: Provides transformed data for visualization and analysis
- Public Data: Powers the public datasets available to the community
Additional Features
Frontend UI
CBT includes a web-based frontend for:
- Real-time visualization of transformation pipelines
- Model dependency graphs
- Transformation status monitoring
- Interactive exploration of data models
REST API
Query model metadata and transformation state via REST API:
- List all models with filtering by type and database
- Get detailed model information including dependencies
- Query transformation status and progress
- OpenAPI specification included
Resources
Related Tools
- CBT API: Automatic REST API generator for ClickHouse databases managed with CBT
- ClickHouse Proto Gen: Generate Protocol Buffer schemas from ClickHouse tables
- Xatu: Collect Ethereum network data that can be transformed with CBT
- The Lab: Visualize data transformed by CBT
Community
Need help or want to contribute?
- Report issues on GitHub
- Join us on the Ethereum R&D Discord
- Check out related tools in the ethPandaOps ecosystem