Bitcoin ETL dataset modernization requested by Google

Cases

min read

Olha Diachuk

August 29, 2025

We were engaged to take an aging, open‑source blockchain ETL pipeline and operationalize a Bitcoin‑only dataset for an internal Google customer. The work was primarily operational and systems‑engineering: reuse the existing Blockchain‑ETL components, package them for reliable operation on GCP, harden streaming and daily reconciliation setup so the dataset remains both fresh and consistent, and hand over a reproducible, supportable deployment.

The client and context

Google reached out for a maintained dataset to replace an abandoned open‑source Bitcoin ETL dataset and provide a new dataset with long‑term operational support. This dataset is the final result of another open source project—Blockchain ETL.

Dysnix had prior involvement with Blockchain ETL components at the very beginning of the project. So we’ve known it from A to Z.

*The Blockchain ETL architecture |* *Source*

Why not a total rewrite? The original ETL provided the required features and semantics. The shortest path to reliable production was to adapt and operationalize, not rewrite — less risk, faster time‑to‑value.

Deploy solution

Architecture

Here’s how the whole solution works:

Bitcoin nodes (GKE) → provide RPC to ETL (internal only).
Blockchain‑ETL workers (Airflow jobs) parse blocks and write artifacts to Google Cloud Storage (GCS).
Batch/streaming layer:
- Batch: daily Airflow job processes the previous day and reconciles the dataset.
- Streaming: new blocks/events → Pub/Sub → Dataflow streaming pipelines → incremental updates to BigQuery.
BigQuery: The final analytical dataset gets ready for the Google client.

The other architecture layers overview:

Layer	Technologies	Responsibility
Compute & Orchestration	GKE (Kubernetes), Helm, Flux CD	Run nodes, ETL, Airflow workers; GitOps deploys
Storage & Analytics	GCS, BigQuery	Artifacts, final dataset
Streaming	Pub/Sub, Dataflow	Timely updates
Orchestration & ETL	Airflow, Blockchain-ETL	Daily backfill + incremental parsing
Infra as Code	Terraform, Terragrunt	Provisioning GCP resources, bootstrap Flux CD
Observability	VictoriaMetrics, Grafana, Alertmanager	Metrics, dashboards, alerts

Comments on technical decisions

There are a few general points worth mentioning about this architecture:

We made Bitcoin node RPC endpoints internal to the cluster. Nodes are not exposed to the public internet to make them serve solely the purposes of the dataset.
We adhered to a single‑zone deployment as per client constraints but used application‑level redundancy (multiple node pods) to improve resilience. If there is any need to make it multizone in the future, it’ll be done in no time.
We avoided over‑engineering caching or complex hot‑path optimizations because the workload is not latency‑sensitive (the pipeline is primarily for analytics).

The paragraphs below provide more details on the technical implementation.

Streaming with daily reconciliation

Our ingestion pipeline balances freshness with stability. Streaming updates flow into BigQuery almost in real time through Pub/Sub and Dataflow, keeping the dataset close to the chain tip. But streaming alone can be fragile—short blockchain reorganizations may introduce temporary inconsistencies. To resolve this, a daily Airflow job reconciles the previous day’s data, correcting any transient errors and ensuring long‑term consistency.

Reorg protection by design

To further reduce risk, we never parse the very last blocks immediately. Instead, we deliberately wait on the most recent 2–4 blocks before parsing them. This simple delay guards against orphaned data. The daily job then finalizes the day once the chain is stabilized in 3-4 hours, resulting in a consistent dataset even in the face of reorgs.

Handling historic backfill

Catching up with the entire Bitcoin history, from 2009 onwards, is a separate operational mode. We parallelized the workload into many Airflow jobs and let the cluster autoscale only during that window.. Thus, we caught up in less than 24 hours after Bitcoin nodes were synced.

Deployment and operations model

For infrastructure, we relied on Terraform and Terragrunt, while runtime deployments followed a GitOps model with Flux CD and Helm charts. Treating Kubernetes configuration as code made every change auditable and every environment reproducible—the cluster state always aligns with what’s in Git.

Observability, metrics & operational features

Metrics stack:

VictoriaMetrics for long‑term metric storage,
Grafana dashboards,
Alertmanager for alerting.

Typical alerts we track continuously:

Airflow job failures (daily job fails).
Pod restart storms (liveness causing flapping).
Streaming pipeline lag/issues (Pub/Sub/Dataflow errors).

The proper tuning of alerting takes effort and time for testing. But in the end, we’ve got the system that generates only meaningful alerts that require an engineer’s attention, and can handle all the rest following the runbooks. But before that, we had some classic situations like:

Kubernetes liveness probes were too aggressive for rare long intervals between blocks (we saw restarts when block mining paused for ~2h). We tuned the liveness/readiness timeouts to practical thresholds (increased initial probe timeouts) so nodes are not restarted unnecessarily. This reduced flapping and false incidents.

Health checks:

Kubernetes liveness/readiness for node pods and ETL pods.
Blockchain‑ETL provides multiple dataset health checks (stream health vs dataset health).

Short glimpse at the developed failure runbooks:

Failure mode	Likely cause	Mitigation / Runbook
Airflow’s daily job fails	Data race with streaming / transient errors	Restart job after streaming backlog cleared; investigate logs
Pod flapping (nodes)	Aggressive liveness + low block frequency	Increase liveness timeout; add redundancy (2+ node pods)
Reorg-induced incorrect entries	Parsing too-recent blocks	Delay parsing of the last N blocks; daily reconciliation
Heavy backfill overload	Resource limits during parallel backfill	Autoscale cluster for backfill; throttle parallelism if RPC overloaded

Custom dashboards:

Node sync status and block height drift.
Airflow job success rates and durations.
Pub/Sub backlog and Dataflow processing lag.

Centralized log collection (as it’s lightweight—volume is modest for Bitcoin ETL), searchable in Grafana/observability endpoint.

Lessons learned & engineering takeaways

When tackling projects like this, we found a few key principles consistently paid off:

Leverage existing, stable components.

If the core ETL logic is sound and its semantics are stable, reusing mature, well-understood components is often the most efficient path. This approach significantly saves time and reduces implementation risk compared to a full rewrite.

Design for blockchain reorganizations (reorgs).

For any Bitcoin-like blockchain analytics pipeline, anticipating and mitigating reorgs is crucial. Our strategy combined a short, intentional parsing delay for recent blocks with periodic, full reconciliation. This dual approach ensures data accuracy even when the chain experiences temporary forks.

Separate operational modes.

It's vital to distinguish between steady-state streaming and heavy, historical backfill operations. We treated backfill as a controlled, autoscaled process, preventing it from overburdening the continuous streaming pipeline. This separation maintains performance and stability for both, while keeping the dataset data actual for near-realtime updates.

Embrace IaC and GitOps for maintainability.

IaC with Terraform/Terragrunt, combined with a GitOps approach using Flux CD, for long-term support and operational transparency, proved invaluable. This makes the entire deployment process readable, auditable, and repeatable, simplifying future maintenance and handovers.

What was dropped / out of scope

Mempool and confirmed‑data strategies, deterministic parsing, and hot‑path expensive caching/avoidance—not relevant for this project, as of Bitcoin network characteristics.
Extensive cost‑ or performance‑engineering was not conducted. The service runs in the background and does not require ultra‑low latency, so additional optimization was unnecessary. However, we conducted all necessary optimizations during the initial backfill stage.
The case study did not include global external security, privacy, and compliance considerations, as all the source and derived data are public. Although we handled all the possible security risks, access measures, and other best practices within the project.

Deliverables and future support

The project took 1 month for a Dysnix DevOps engineer to complete. Now it continues in support mode. Here’s the set of deliverables we provided:

Fully operational GCP deployment (GKE + GCS + Pub/Sub + Dataflow + BigQuery).
Helm charts + Flux CD configuration + Terraform modules (IaC).
Airflow orchestration configured for:
- Steady streaming ingestion,
- Daily reconciliation,
- Parallelized backfill.
Monitoring dashboards, alerts, and runbooks for the top incident classes.
Documentation and a deployment manual (to be published per client’s requirements).

We delivered a low‑risk, operationally robust pipeline that balances freshness and correctness and is reproducible via IaC and GitOps. The outcome gives the client a supported dataset that can be consumed reliably, with clear runbooks and observability to handle real‑world edge cases.

And as the Bitcoin dataset is expected to run continuously, the real value comes from maintaining its stability, reacting to operational issues, and scaling improvements as needed.

Beyond initial delivery, our team will continue to provide operational support, monitoring, and incremental improvements to ensure the dataset performs reliably. This includes troubleshooting incidents, adapting the stack to evolving Google Cloud features, and keeping the deployment aligned with best practices over time.