Tech Stack
Backend
Completion Status
CHUG: High-Performance ETL That Actually Works in Production
Data migration between databases is painful. I've seen too many projects where "just copy the data over" turns into weeks of debugging schema mismatches, connection timeouts, and memory overflows.
CHUG was born out of frustration with existing ETL tools that either require expensive enterprise licenses or break down when you actually need them to handle real-world data volumes.
The Problem: Analytics Data Stuck in OLTP Hell
Modern applications generate massive amounts of data in PostgreSQL, but analytics workloads need the columnar performance of ClickHouse. The gap between these systems creates a bottleneck that kills data-driven decision making.
Traditional ETL tools are either:
- Too complex (looking at you, Airflow with 20-step DAGs)
- Too slow (batching everything overnight)
- Too expensive (enterprise solutions that cost more than your servers)
- Too unreliable (fails silently with partial data)
I wanted something that just works: fast, reliable, and simple enough to deploy and forget.
Architecture: Streaming ETL with Go's Performance
CHUG is designed around three core principles:
Streaming Over Batching: Instead of loading entire tables into memory, CHUG streams data in configurable chunks. This means you can migrate 100GB tables on a 2GB server without breaking a sweat.
Smart Schema Mapping: PostgreSQL and ClickHouse have different type systems. CHUG handles the translation automatically - UUIDs, timestamps, arrays, and JSON all get mapped correctly without manual intervention.
Resilient by Design: Network hiccups and transient failures are facts of life. CHUG implements exponential backoff with jitter, parameterized queries to prevent SQL injection, and comprehensive error logging.
Key Technical Features
CLI-First Experience: No complex config files required for simple migrations. Want to copy a table? One command:
bashchug ingest --pg-url "postgres://user:pass@host/db" \ --ch-url "clickhouse-host:9000" \ --table "user_events"
YAML Configuration for Complex Setups: When you need more control, CHUG supports full YAML configuration with polling intervals, custom batch sizes, and multiple table definitions.
Change Data Capture (CDC): CHUG can monitor PostgreSQL tables for changes and sync only the deltas. Perfect for keeping analytics data fresh without full table reloads.
Security First: All queries use proper parameterization, table names are validated and quoted, and sensitive data never appears in logs.
Real-World Performance
In production testing:
- Migrated 50M+ records with 5000-row batches in under 30 minutes
- Memory usage stays constant regardless of table size
- Handles connection drops gracefully with automatic retry
- Zero data loss with proper acknowledgment patterns
The secret sauce is in the implementation details:
- Streaming extraction from PostgreSQL with proper cursors
- Batched inserts to ClickHouse with configurable sizing
- Parallel processing where safe, sequential where necessary
Production Deployment Strategy
CHUG is designed to run anywhere:
Development: Docker Compose setup includes PostgreSQL, ClickHouse, and management UIs. Perfect for testing migrations locally.
Production: Single binary with no external dependencies. Deploy in a container, set up a cron job, or run as a daemon with polling enabled.
Monitoring: Structured logging with Zap provides detailed progress tracking and error reporting. Easy integration with your existing log aggregation.
Why Go for ETL?
Go might seem like an unusual choice for ETL, but it's perfect for this use case:
- Performance: Near-C++ speed for data processing
- Memory Safety: No segfaults or memory leaks during long-running migrations
- Concurrency: Goroutines make parallel processing trivial
- Single Binary: Deploy anywhere without runtime dependencies
- Strong Typing: Catch schema mismatches at compile time
Future Roadmap
CHUG is not production-ready today, and I'm actively working on:
- Parquet export for data lake integration
- Prometheus metrics for better observability
- Conflict resolution strategies for bidirectional sync
- Built-in data validation and integrity checks
The goal is to build the ETL tool I wish existed when I was dealing with data pipeline headaches at scale.