posthog.tracks — source & the incremental MERGE model

posthog.tracks is the consumer-facing event table (~3.1B rows). It is built by the posthog_tracks dbt incremental model from posthog.raw_tracks, which the hourly posthog-tracks DAG loads from PostHog batch-export parquet in GCS.
Last update: 2026-06-12

TL;DR. Pipeline is PostHog batch export → GCS parquet → load_posthog_tracks.pyposthog.raw_tracks → dbt incremental MERGE (unique_key=event_id) → posthog.tracks. The table is currently degraded / under review: ~28.8M excess rows from duplicated event_ids, surfaced by a source unique test. See 2026-06-12 posthog.tracks duplicate event_id.

Source pipeline

LayerObjectFile / owner
Export (PostHog)batch export → GCS parquetgs://data-platform-dataflow/{env}/posthog_batch/events/{date}/{hour}/*.parquet.sz
Sensor (Airflow)wait_for_posthog_batchdag.py · GCSUploadSessionCompleteSensor (deferrable, 30-min inactivity, 65-min timeout)
Load (GKE Python)posthog.raw_tracksload_posthog_tracks.py · parquet → Pydantic (RawPostHogEvent) → BQ; DELETE-before-APPEND on window_start
Freshness (dbt)source:posthog.raw_tracksdag.py · dbt source freshness (warn 2h / error 4h on bq_ingested_timestamp)
Transform (dbt)posthog_tracksposthog.tracksmodels/posthog_tracks.sql · incremental MERGE, unique_key=event_id ← the consumer table
Test (dbt)source & model tests on tracksdbt/sources/sources.yml + models/posthog_tracks.yml · twice daily (06:00 / 18:00 UTC)

Lineage (amber = degraded / under review)

graph LR bucket["GCS parquet.sz
posthog_batch/events"] loader["load_posthog_tracks.py
→ raw_tracks"] raw["posthog.raw_tracks
2026-03-12 → present"] model["posthog_tracks (dbt)
incremental MERGE on event_id"] tracks["posthog.tracks
~3.1B rows, all history"] backfill["posthog.backfill_tracks
2025-05 → 2026-03-11"] mergebf["merge_backfill_to_tracks.py
day-by-day, insert-only"] bucket --> loader --> raw --> model --> tracks backfill --> mergebf --> tracks classDef partial fill:#fff2cc,stroke:#c90,color:#000 classDef frozen fill:#eaeef2,stroke:#8c959f,color:#57606a class tracks,model partial class backfill,mergebf frozen

The posthog_tracks incremental model

models/posthog_tracks.sql is an incremental merge with unique_key='event_id', full_refresh=false, partitioned by created_at (DAY) and clustered by event_name, user_id. It only processes rows with created_at >= 2026-03-04 (the cutover date when the legacy Dataflow pipeline stopped); earlier history already lives in tracks from the completed backfill.

Columns

ColumnTypeDescription
event_idSTRINGPostHog event UUID, stable across retries. unique_key. Re-emitted by PostHog for months-old events.
created_atTIMESTAMPEvent timestamp (UTC). Partition key. Re-written on re-delivery today (secondary bug; fix = merge_exclude_columns=['created_at']).
event_nameSTRINGPostHog event name. Cluster key.
user_idSTRINGPostHog distinct_id mapped to Moises user. Cluster key.
person_idSTRINGPostHog person UUID.
session_idSTRINGFrom session_id or $session_id.
active_feature_flagsJSONFlags active at event time ($active_feature_flags).
feature_flags_valuesJSONFlag evaluations from $feature/* properties.
lib_valuesJSONPostHog library metadata ($lib*). Growth derives platform from $lib.
device_model / device_type / device_osSTRINGDevice attributes from event properties.
ip_addressSTRINGClient IP ($ip).
clientSTRINGClient/source marker (web / ios / android).
geoip_*STRINGGeoIP-enriched city / continent / country / time zone.
app_version / localeSTRINGFrom $app_version / $locale.
event_payloadJSONFull PostHog properties payload, preserved for downstream analysis.
bq_ingested_timestampTIMESTAMPLoader stamp — one datetime.now() per loader run, copied to every row. MERGE dedup tiebreaker. Not an insert time.
event_inserted_atTIMESTAMPUpstream insert timestamp from the export payload (_inserted_at / inserted_at).

Data tests

Two layers of tests reference tracks:

WhereTestScopeStatus
dbt/sources/sources.ymlunique on event_idcreated_at >= 2026-03-04FAIL — pulled into bq-marketing via +tag:marketing on 2026-06-11.
dbt/sources/sources.ymlnot_null on event_id, created_atcreated_at >= 2026-03-04PASS
dbt/sources/sources.ymlelementary volume anomalies (incl. purchase_flow)14-day training
models/posthog_tracks.ymlmodel-level unique on event_idn/aDisabled — commented out; will be re-enabled (scoped >= 2026-03-04) after dedup.
models/posthog_tracks.ymlelementary volume / dimension / freshness anomalies28-day training

Because the owning model's unique test is commented out, the posthog-tracks DAG never caught the duplicates; the bq-marketing DAG did, once the marketing model added a source('posthog','tracks') dependency. Re-enabling the model-level test (gated on dedup) puts the alarm back on the owning DAG.

Repo files

FilePurpose
dag.pyHourly DAG: sensor → loader → freshness → dbt run → (gated) dbt test.
load_posthog_tracks.pyCore loader: GCS parquet → Pydantic → raw_tracks.
common.pyDAG-layer constants and CLI arg builders; gcs_prefix_template.
models/posthog_tracks.sql / .ymlIncremental MERGE model + tests for tracks.
merge_backfill_to_tracks.pyAd-hoc day-by-day insert-only MERGE of backfill_tracks into tracks (legacy dup contributor).
dedup_tracks.pyOne-time dedup (snapshot + atomic CREATE OR REPLACE). Lives on branch NOREF/posthog-tracks-dedup.
dbt/sources/sources.ymlposthog source declaration + event_id tests.

See the full investigation: 2026-06-12 posthog.tracks duplicate event_id · PostHog Tracks home