BQ Bulletproof
Bulletproof every BigQuery transformation, ingestion, and read so surprise bills are physically impossible and every table has tests, lineage, and ownership.
Why
Started 2026-05-03. Two threads converged:
-
Cost surprise. Feb 4-5 2026 incident: a per-row
WHERE email_id = @idlookup loop scanned 1.4 TB / ~$9 in 2 days because BQ bills at least 10 MB per query. No project-level guardrail caught it. Audit revealed zero per-query caps in the wrapper code, no billing budget on the API (one existed at billing-account level but was uncatalogued). -
Transformation sprawl. ~80 transformation views across 7 datasets (
pos_advanced24,pos_digest17,pos_staging11,pos_analytics9,pos_coaching8,master_data6,pos_dashboard5) plus ~5 pure-SQL transform CFs. Created ad-hoc, no version control, no tests, no enforced tier discipline. "Why is this number off?" has no audit trail.
Bulletproof = caps stop bad queries, every transform is version-controlled with assertions, every gold table has a freshness watchdog, and there's exactly one source of truth per metric.
Current state
- Status: in-flight — Phase 2 CLOSED 2026-05-20 (cutover 5/18 + 48h watch + teardown + Dataform-tf port all shipped). Phases 3-10 not started; ASK before Phase 3 (24 views = real volume).
- Last update: 2026-05-20 (MR !148 teardown + MR !150 tfvars-bump apply + MR !151 stranded-tf port + GitLab #4 closed won't-fix-because-eliminated)
- Phase 0 (safety nets): closed. 10 GB cap live, $50 budget kept, clusters applied to
dealbot_processed_emails+fub_people, DealBot batched lookup shipped, daily cost report writing topos_raw.daily_cost_report. - Phase 1 (inventory): done, re-verified 2026-05-18 against live BQ; reporting now 7 BASE TABLES (was 8 —
rpt_dealhub_summarydropped 2026-05-20). Catalogued 80 views; live is now 81 — one view added since the 2026-05-03 catalog:master_data.v_function_health(a real transform overpos_raw.function_executions; master_data 6→7). Nothing removed or renamed in the view set; the other 6 datasets are byte-exact to catalog. Corrected counts: pos_advanced 24, pos_digest 17, pos_staging 11, pos_analytics 9, pos_coaching 8, pos_dashboard 5, master_data 7 = 81.v_function_healthis a Phase-4 port target (master_data, silver). - Phase 2 (port
reporting_refresh.py): CLOSED 2026-05-20. Three MRs shipped: - MR !148 — teardown deletes + BQ drops + doc scrubs (40 files, +59/-2594). Removed
reporting_refresh.py+ tests, the/reporting-refreshroute on pos-pipeline, both/reporting-parityroutes on pos-monitoring,reporting_parity.pyhandler,parity_harness/(10 files + tests), both heartbeats fromhealth_check.py, thepos-monitoring-reporting-parityscheduler block, doc refs across 10 files. Dropped BQ:reporting.rpt_dealhub_summary,reporting_dataform.*(7 tables + dataset),pos_raw.reporting_parity. - MR !150 — manual tfvars-bump per
feedback_monorepo_no_tfvars_bump(monorepo CI builds images on merge but doesn't auto-bump tfvars). Live revisions now: pos-pipeline00020-jvm+ pos-monitoring00045-cfr, both on image:0fdcb619. B4 auto-apply correctly refused the scheduler-destroy without human review (safety gate working); manualterraform applyon pos-monitoring hard-deleted the now-paused scheduler. - MR !151 — ported the stranded Dataform terraform from
_source/platform/projects/bq-bulletproof/terraform/into liveinfra/terraform/bq-bulletproof-dataform/. State atgs://pospj-480915-tf-state/bq-bulletproof/survived the platform-repo archival;terraform state rmfor the 3 shadow-dataset resources (cloud already dropped) →terraform planreturns "No changes. Your infrastructure matches the configuration." - Live verification: Dataform writes
reporting.rpt_*directly daily @ 04:30 CT (release_config compile @ 03:00 UTC → workflow_config invoke); 5/19 09:30Z + 5/20 09:30Z runs landed clean; schedulerpos-monitoring-reporting-parityreturnsNOT_FOUND;terraform planfrom the new monorepo tf root no-op. - GitLab issue #4 closed won't-fix-because-eliminated: the "manual mirror-sync before 03:00 UTC cron" framing was killed by the 2026-05-15 monorepo consolidation. Canonical Dataform source IS the GitLab repo
gitlab.com/rylobasic/bq-bulletproof-dataform(verified byte-identical to the frozen archive); no monorepo subtree exists; nothing to sync. - Residual acknowledged tech debt (kept as-is for 0-change tf port):
release_config.code_compilation_config.default_schema = "reporting_dataform"is a now-orphan fallback. Every.sqlxoverrides via its ownconfig.schema. Cleanup is a behavior-change MR, separate from the port. - History (pre-cutover findings) retained below:
- Phase 2 (port
reporting_refresh.py) — pre-cutover history: RESOLVED 2026-05-18 — harness rebuilt, root cause proven, cutover greenlit. - Outcome: the golden frozen-input parity harness (
infra/bq-bulletproof/parity_harness/) was built subagent-driven (8 tasks, per-task spec+quality review, final holistic review, 51 hermetic tests), merged tomain(MR !133, merge commit18a1e6be, pipeline green), and re-run from a clean checkout per the README → all 7 tables equivalent on identical frozen input, OVERALL PASS, exit 0, self-cleaning. Independently corroborated 3 ways (per-task runs, opus reviewer's separate SQL formulation, clean-checkout). Conclusion proven: legacy ≡ Dataform; the old daily "drift" was 100% timing skew, not port bugs. Phase 2 cutover is greenlit. - Go/no-go re-run 2026-05-18 ~19:22Z: PASS (4th confirmation). Ran
python -m parity_harness.mainagainst fresh frozen input — all 7 tables 0 diffs (rpt_agent_boxscore29=29,rpt_agent_risk30=30,rpt_consult_metrics476=476,rpt_consults_not_booked55=55,rpt_pipeline_health287=287,rpt_transaction_pipeline_summary2=2,rpt_weekly_wins11=11;onlyL=0 onlyR=0each), OVERALL PASS, exit 0, self-cleaned in ~95s. Gate is GREEN. (Frozen-input parity is time-invariant, but re-run once more if the cutover lands days later AND the harness or Dataform models change in between.) - The next action is the cutover itself (see Next actions). The harness is one-shot — retire it (
rm -rf infra/bq-bulletproof/parity_harness/) after cutover completes. - History (why the rebuild was needed) — retained below:
- The old
reporting_parity.py"7-day zero-drift clock" was FALSE. Audit ofpos_raw.reporting_parityon 2026-05-18 showeddrift_detected=trueon every daily run 5/8 → 5/17; only one clean row ever (2026-05-07 13:54); zero clean days since the 5/13 restart. - The 5/13 single-theory fix is disproven. Theory was FLOAT64 AVG LSB noise (
479.37vs479.38) fixed by widening parity rounding 2 → 1 decimal (CCPJ !677). Deployed revision ispos-monitoring-00044-dvc(commit19f72c4e, 5/16) — the fix and 3+ revisions past it are live, so this is real divergence, not a deploy lag and not sub-decimal noise. - TESTED ROOT CAUSE (single, not 3 classes): the parity harness is unsatisfiable by construction. It compares
reporting.*(legacy CF, refreshed off the Looker email) againstreporting_dataform.*(Dataform, refreshed 04:30 CT) and demands byte-equality. The two sides refresh on independent schedules — observed skew on 5/17 was ~21.5h (reporting.*last_refreshed 2026-05-17 12:04 vsreporting_dataform.*2026-05-18 09:30).STALENESS_HOURS=26is so loose it permits a ~21h skew while calling both sides "fresh." For any table whose inputs mutate (YTD counters, pipeline membership,MAX(report_date)self-filter), two snapshots taken hours apart can never byte-match — regardless of Dataform model correctness. - No Dataform port bug found in any of the 5 drifting tables. Diffs are real activity in the skew window:
rpt_agent_boxscore/rpt_agent_riskYTD counters accumulate;rpt_pipeline_healthcontacts move stages (count flaps ±1-2);rpt_transaction_pipeline_summaryaggregates shift;rpt_weekly_winsWHERE report_date = MAX(report_date)picks a different daily MM-escalations partition by run time (legacy + Dataform SQL is byte-identical here — verified against both source definitions). - Falsified hypotheses (tested, not assumed): (a)
tenure_daysmonotonic drift — delta=0, computed from fixed enrollment date; (b)rpt_weekly_winsas a distinct port bug — SQL byte-identical, same skew cause viaMAX(report_date). Earlier "5 broken view ports / 3 failure classes" framing was a Phase-1 hypothesis, now falsified. - Control group confirms: the only 2 clean tables (
rpt_consult_metrics,rpt_consults_not_booked) are keyed on immutable historical events. Every drifting table is a live/append aggregate. Exactly the predicted split. - Phase 3 (port
pos_advanced, 24 views): moved 2026-05-20 to Trustworthy Monorepo (umbrella plan) as item #3 — same code path as GitLab hardening #5 ("converge bridge-applied SQL views into Dataform"). The two were always the same problem from different angles; consolidated. - Phases 4-10: out of scope for the umbrella; remain in BQ Bulletproof as future Dataform-reliability work (assertions, lineage, dataset caps, cost monitoring, etc.) once Phase 3 closes via the umbrella.
- Blocked on: Phase 3 scheduling is gated on umbrella sequencing.
Next 3 actions
- Phase 3 scheduling — tracked in the umbrella, not here. Pick a slot in the umbrella's priority queue.
- Optional Phase 2 micro-cleanup: retire the now-orphan
release_config.code_compilation_config.default_schema = "reporting_dataform"fallback ininfra/terraform/bq-bulletproof-dataform/dataform.tf. Separate behavior-change MR. Low priority. Not part of the umbrella. - When Phase 3 closes via the umbrella: restart BQ Bulletproof here for Phases 4-10 (Dataform reliability — assertions, lineage, dataset caps, freshness watchdog, cost monitoring).
Decisions log
- 2026-05-03 — Tool: Dataform, not dbt. All-Google stack reduces auth/billing/debug surface for a one-cook kitchen with Claude as the engineer. Honest tradeoff: Dataform's monitoring story is thinner than dbt+Elementary; mitigation is a custom freshness watchdog (Phase 9). See
dataform/README.md. - 2026-05-03 — Migration is full, not lazy. Every transform gets ported, not "when we touch it." Lazy migration leaves architectural debt indefinitely; bulletproof requires complete coverage.
- 2026-05-03 — Repo location:
~/platform/, not CCPJ. This is infrastructure. Repo Routing rule in~/.claude/CLAUDE.mdupdated to enforce. - 2026-05-18 — Dataform location: eliminate the source-of-truth/mirror split. The pre-monorepo "edit in platform subtree, manually sync to mirror repo" was tech debt; the subtree is now frozen in the archive. Resolution: the GitLab repo
gitlab.com/rylobasic/bq-bulletproof-dataformis the single canonical source (it's what the Dataform service clones); local working copy is a standalone clone at~/rylobasic-bq-bulletproof-dataform/. Verified the frozen archive and mirror are byte-identical first (nothing stranded). Closed the "CI mirror-sync" debt by deleting the split, not building the sync. - 2026-05-18 — Parity methodology: golden frozen-input parity. Root-cause investigation (tested) proved the old harness byte-compared two prod refreshes taken ~21h apart — unsatisfiable for any mutating-input table, never a port bug. The "7-day zero-drift clock" was an unsatisfiable predicate from the start. New harness runs both legacy SQL and Dataform against a single pinned source snapshot and compares outputs; time-invariant, proves the actual goal (logical equivalence). Rejected: co-trigger+skew-gate (still probabilistic, doesn't fix the concept) and split byte/tolerance (weakens "bulletproof" to "close enough").
Phase 11 — SUPERSEDED 2026-05-15 (added 2026-05-03)
Phase 11 originally coordinated the CCPJ→platform strangler-fig migration so no bq-bulletproof code got lost in the sweep. That strangler-fig never ran — the 2026-05-15 monorepo consolidation big-banged all 5 source repos into ~/rylobasic/ instead. lib/bigquery_client.py and all bq-bulletproof code already live under ~/rylobasic/; ~/CCPJ + ~/platform are archived read-only. Phase 11 is closed by supersession, its only live consequence being the removal of the Phase 2 cutover timing gate (see Blocked-on / Next actions). CHECKLIST.md Phase 11 rewritten accordingly (MR !135).
Files in this project
Canonical location since 2026-05-18: ~/rylobasic/infra/bq-bulletproof/ (relocated from the now-frozen ~/_archived-platform-2026-05-15/projects/bq-bulletproof/; old ~/platform/... paths are dead).
~/rylobasic/infra/bq-bulletproof/CONTEXT.md— the why + principles + locked decisions (longer form than this hub)~/rylobasic/infra/bq-bulletproof/CHECKLIST.md— every phase as checkboxes; the working surface (Phase 2 CLOSED 2026-05-20)~/rylobasic/infra/bq-bulletproof/INVENTORY.md— Phase 1 catalog (re-verified 2026-05-18: 81 views; reporting now 7 base tables)~/rylobasic/infra/bq-bulletproof/FINDINGS.md— parity / verification findings log (includes 2026-05-20 Phase 2 closure section)~/rylobasic/infra/terraform/bq-bulletproof-dataform/— live monorepo terraform for the Dataform Repository + nightly release/workflow configs + IAM (ported 2026-05-20 via MR !151; state atgs://pospj-480915-tf-state/bq-bulletproof/)- Dataform models: canonical = GitLab repo
gitlab.com/rylobasic/bq-bulletproof-dataform; local working clone at~/rylobasic-bq-bulletproof-dataform/(standalone, outside the monorepo by necessity — Dataform needs root-levelpackage.json/workflow_settings.yaml). Edit→commit→push there; no more source-of-truth/mirror split.
Open issues
- GitLab #4 (bq-bulletproof Dataform mirror-sync) — closed 2026-05-20 won't-fix-because-eliminated (the manual-sync problem was killed by the 2026-05-15 monorepo consolidation, not by automation).
- GitLab #5 (Converge bridge-applied SQL views into Dataform) — open. Overlaps with Phase 3 path; decide before starting Phase 3.
- Per-phase issues: open as work begins (suggested label:
project/bq-bulletproof).
References
~/CCPJ/projects/pos-cloud-functions/functions/pipeline/BUILD-LOG.md— Layer 1 cap entry (2026-05-03)~/platform/projects/BigQuery-Data-Atlas/— existing lineage docs to be auto-generated from Dataform in Phase 10~/CCPJ/APIs/BigQuery-API-Implementation/docs/BigQuery-API-Reference.md— full BQ endpoint catalog + 11-capability audit that surfaced the gap~/.claude/projects/-Users-rycolston/memory/project_gcp_enhancement_candidates.md— Dataform was item #1 in the 2026-04-26 GCP audit~/.claude/projects/-Users-rycolston/memory/feedback_no_time_estimates.md— locked preference shaping the plan format~/.claude/CLAUDE.mdRepo Routing section (updated 2026-05-03 to canonicalize 5-repo decision tree)
Memory pointer
~/.claude/projects/-Users-rycolston/memory/project_bq_bulletproof.md