Vault¶

Vault is the analytics pipeline. It collects two things — consent events (every banner impression, dialog interaction, and consent decision the widget records) and session data (every HTTP request to the customer's site that the widget served on, derived from CDN logs) — normalises them into ClickHouse, and serves aggregated metrics back to the dashboard.

Vault has a strict write/read split:

vault-ingest (repo vault-ingest, ECS service vault-etl) — write side. Pulls from CDN logs and SQS, normalises, writes ClickHouse.
vault-api — read side. Stateless HTTP API that queries pre-aggregated ClickHouse tables and serves async CSV exports. It does not ingest anything.

Architecture¶

flowchart LR
    Widget[Widget JS] -->|consent events| Edge[edge-api]
    Edge -->|SQS| Ingest[vault-ingest]
    Bunny[Bunny CDN] -->|logging API| Ingest
    CF[CloudFront] -->|S3 logs| Ingest
    Ingest -->|bronze| S3[(S3)]
    Ingest -->|silver| CH[(ClickHouse)]
    CH -->|materialized views| CH
    API[vault-api] -->|reads aggregates| CH
    API -->|exports| PG[(PostgreSQL)]
    API -->|presigned downloads| S3
    Dash[Dashboard / core-api] --> API
    Grafana --> CH

Component	Purpose
vault-api	Hono HTTP API on Node.js. Read-only. Queries ClickHouse aggregate tables, runs the export worker, serves presigned S3 URLs for download. Single static API key (`X-API-Key`) for all callers.
vault-ingest	Long-running TypeScript service. Multiple independent task loops, each polling for work and tracking progress in PostgreSQL. ECS service name is `vault-etl`.
ClickHouse	Holds both silver (append-only ReplacingMergeTree raw tables — `sessions`, `consent_events`) and gold (SummingMergeTree daily aggregates fed by materialized views — `*_daily`). Everything the dashboard reads is in ClickHouse.
S3	Bronze layer + export storage. CDN logs and consent payloads land here as the source of truth; everything else is recomputable from bronze.
PostgreSQL	Operational state only — vault-ingest's per-file ingest progress and the vault-api exports job queue. Not a data layer; no analytics data is stored here.
Grafana	Reads ClickHouse directly for ops dashboards. Does not go through vault-api.

Previous generations deprecated

Earlier architectures used Docker Compose + Portainer, then Docker Swarm + Trino + Iceberg. Both have been fully replaced by the current ClickHouse + ECS setup.

Repository vs. service naming¶

The write-side repo is vault-ingest. The ECS service running it is named vault-etl for historical reasons — Terraform module, ECR repo, SSM prefix, and CloudWatch log groups all use vault-etl. They refer to the same thing. The legacy _node server-setup runbook references the same service.

There is also a vault-worker.yaml Kubernetes manifest in the vault-ingest repo (Scaleway/SCW). It is a legacy deployment artifact — vault-ingest now runs on ECS in eu-central-1.

Data layers (bronze / silver / gold)¶

vault-ingest is built around a deliberate layered model — read README.md and AI_CONTEXT.md in the repo for the design principles.

Layer	Where	Mutability	Purpose
Bronze	S3	Immutable, never rewritten	Source of truth. Raw CDN logs and raw consent payloads. Everything downstream is recomputable from bronze.
Silver	ClickHouse — `sessions`, `consent_events` (ReplacingMergeTree)	Append-only, dedup on merge	Normalised, typed, parsed raw rows. Bots not yet filtered.
Gold	ClickHouse — `*_daily` SummingMergeTree tables (`consent_status_daily`, `consent_dim_daily`, `sessions_daily_filtered`, …)	Append-only, fed by materialized views off the silver tables	What the dashboard actually queries. Pre-rolled-up by day, code, dimension; bot filtering applied at this layer for the `*_filtered` family.

PostgreSQL sits beside this stack as operational state, not as a data layer — it holds vault-ingest's per-file ingest progress and the vault-api export job queue. Nothing analytical lives there.

Backfills are a first-class operation. Drop a partition or reset status='pending' on a row in cdn_ingestion_files/consent_ingestion, and the next ingest loop replays it. ReplacingMergeTree collapses duplicates on silver, and the materialized views re-fire to rebuild gold.

Ingest pipelines¶

vault-ingest is one container running a single task loop, selected by the TASK env var. Each ECS task definition pins a different TASK value, so the cluster is really N copies of the same image doing different jobs. Tasks track progress in PostgreSQL (cdn_ingestion_files, consent_ingestion, bunny_api_pull_state) so they are crash-safe and resumable.

flowchart LR
    subgraph Sessions["Session pipeline"]
        BunnyAPI[Bunny logging API] -->|"bunny-api-logs"| S3a[(S3 bronze<br/>CDN logs)]
        CFOrig[CloudFront] -->|"cloudfront-log-index"| S3a
        S3a -->|"clickhouse-ingest"| CHa[(ClickHouse<br/>sessions)]
    end
    subgraph Consent["Consent pipeline"]
        EdgeAPI[edge-api] -->|"SQS"| Q[(SQS<br/>eu + aws)]
        Q -->|"consent-sqs-ingest"| S3b[(S3 bronze<br/>consent payloads)]
        S3b -->|"consent-clickhouse-ingest"| CHb[(ClickHouse<br/>consent_events)]
    end

Session pipeline (CDN logs → `sessions`)¶

Three tasks cooperate:

bunny-api-logs — pulls Bunny CDN HTTP logs by zone and date via the Bunny Logging API. 25k-line batches by line offset, state in bunny_api_pull_state (offset, complete flag, retry backoff). Only processes past completed days, with a 4 h cooldown after midnight UTC. Writes raw to S3 bronze. Drops Bunny origin-shield entries (HTTP scheme + FR edge location) before upload — they are internal routing, not real traffic.
cloudfront-log-index — every 5 min, lists yesterday/ and today/ prefixes in the CloudFront S3 bucket, copies new files to cloudfront/{YYYY/MM/DD}/ in the bronze bucket, and registers each in cdn_ingestion_files.
clickhouse-ingest — pulls files marked pending from cdn_ingestion_files, processes them line by line, batches 10k rows, inserts into default.sessions. Resumable: tracks last_line per file. Up to 10 retries with exponential backoff (capped at 60 s) before marking permanently failed.

For each line:

Parse — provider-aware. Bunny is tab-separated with a fixed schema; CloudFront has a #Fields: header that must be re-parsed per file because field order varies. Hard-coded field indexes will break.
Filter — origin-shield duplicates again (defence in depth), invalid URIs.
Geo-enrich — IP → country, ASN.
Anonymise — IPv6 truncated to /56. Geo-lookup happens before anonymisation.
UA-parse — ua-parser-js for device / OS / browser / browser context.
Insert — append to default.sessions (silver).

Consent pipeline (SQS → `consent_events`)¶

Two tasks, with SQS as the durable buffer between the edge and storage:

consent-sqs-ingest — long-poll SQS (default 20 s, 60 s visibility timeout), buffer up to CONSENT_BATCH_SIZE (default 1000) messages or CONSENT_FLUSH_AGE_MS (default 30 s), gzip, write to S3 bronze, then delete from SQS. One task instance per queue — there are separate queues per region (eu, aws) and the env var CONSENT_SQS_NAME labels which one this instance owns. The S3 prefix and DB rows include the label so they don't collide.
consent-clickhouse-ingest — same pattern as session ingest: poll consent_ingestion for pending files, parse, transform, insert into default.consent_events.

The consent parser supports three payload formats, detected at parse time:

Version	Format	Source	Notes
V1	flat CSV	legacy widget	Historical only. CSV streamed via `csv-parse`.
V2	JSON `{payload: {a, d, u, s}, headers}`	legacy edge-api	Carries a numeric `domainId` instead of the 8-char `code`. The mapper resolves it via a domain-lookup service.
V3	JSON with `event_type`, `user_token`, `code` at the root	current edge-api	The format produced by edge-api `/v3` endpoints.

V2 and V3 payloads also carry IAB compliance strings (tc_string, gpp_string), category/vendor/service consent + disclosure arrays, and the parsed user-agent components.

Storage layer¶

Silver (ClickHouse — raw)¶

Table	Engine	Partition	Order	TTL	Source
`default.sessions`	ReplacingMergeTree(`imported_at`)	`request_date`	`(code, request_date, ...)`	90 days	`clickhouse-ingest`
`default.consent_events`	ReplacingMergeTree()	`toYYYYMM(event_time)`	`(code, user_token, event_time)`	—	`consent-clickhouse-ingest`

DDL lives in vault-ingest/sql/. ClickHouse's ReplacingMergeTree merge collapses rows with identical sort keys, so re-ingesting the same line on retry is safe.

Gold (ClickHouse — daily rollups)¶

The dashboard never reads raw consent_events — it queries pre-aggregated *_daily tables fed by materialized views attached to the silver tables. This is what makes vault-api fast: the work happens at write-time inside ClickHouse, not at read-time inside vault-api.

Table	Grain	Fed by	Used for
`consent_status_daily`	day × code × hostname × path × country	`consent_status_daily_mv`	`/consent/status` (all-allowed / all-denied / partial)
`consent_category_daily`	day × code × ... × category	`consent_category_*_mv`	`/consent/categories`
`consent_dim_daily`	day × code × dimension × value	8 dimension MVs	`/consent/dimensions`
`consent_dim_category_daily`	day × code × dimension × value × category	8 MVs	`/consent/dimension-categories`
`consent_funnel_daily`	day × code × event_type	`consent_funnel_daily_mv`	`/consent/funnel`
`sessions_daily`	day × code	upstream	`/sessions/overview` (pre-cutover)
`sessions_daily_filtered`	day × code	upstream + bot filter	`/sessions/overview` (post-cutover)
`sessions_dim_daily` / `_filtered`	day × code × dimension	upstream	`/sessions/timeline` distributions

All aggregate tables are SummingMergeTree, so concurrent inserts from materialized views are safe — duplicate (sort-key, partition) rows sum on merge. Percentages are never computed in SQL; vault-api derives them in app code via round4() after fetching raw counts. Missing days are zero-filled by LEFT JOIN numbers().

The "all-allowed / all-denied / partial" classification rule lives identically in every relevant MV: necessary is excluded, then arrayAll / arrayExists over category_consented decides the bucket.

The *_daily aggregate tables are populated by ClickHouse materialized views attached to the silver tables — they fire on every insert, with no separate scheduling. (vault-ingest does not own this DDL today; the views were created by the upstream consent-context system. They feed off the same consent_events rows vault-ingest writes.)

Bot filtering¶

Bot filtering applies only to the session side. Consent events are not bot-filtered today.

Two signals combine, both in ClickHouse:

ASN dictionary — ip_asn table loaded from MaxMind GeoLite2-ASN, classified into hosting / proxy / vpn / isp / business / education / unknown via the manually-curated asn_classification override table, then promoted into ip_asn_dict (an IP_TRIE dictionary for O(1) CIDR lookup). Refreshed monthly by the refresh-asn task — a one-shot, requires MAXMIND_LICENSE_KEY, schedule it as an ECS scheduled task. See vault-ingest/_readme/maxmind.md.
UA dominance heuristic — bot_ua_flags flags any user-agent that accounts for ≥30 % of a (code, day)'s traffic, with a 100-session circuit breaker to ignore tiny domains. Promoted to bot_ua_dict for fast joins from MVs.

A session is excluded from billing if either filter matches.

The 2026-05-01 cutover¶

vault-api endpoints that compute session counts switch tables based on the date range:

Range entirely before 2026-05-01 → sessions_daily (unfiltered).
Range entirely after → sessions_daily_filtered (bot-excluded).
Range straddling the cutoff → UNION ALL across both tables, with date predicates carving the split.

This is in vault-api/src/routes/sessions.ts. If you're staring at a session-count discrepancy across the cutoff, this is almost always why.

vault-api: read side¶

Authentication¶

Single static X-API-Key (env var VAULT_API_KEY). All callers — dashboard, core-api, internal tools — share the same key. There is no per-customer authentication: scoping is purely by the code parameter in each request. A client with the API key can read any customer's data.

The exceptions are /health, /docs, and /openapi.json, which are unauthenticated. The ALB target group health check uses /health, which pings ClickHouse with SELECT 1.

Query pattern¶

Every analytics endpoint follows the same shape:

Validate request via Zod (date range ≤ 90 days, code matches ^[a-zA-Z0-9]{8}$).
Build a parameterised ClickHouse SQL query against the relevant *_daily aggregate. Filters (hostname, path_segment, country) become WHERE clauses. Use {name:Type} parameter binding — no string interpolation.
Run two queries in parallel: a totals roll-up and a daily breakdown.
Zero-fill missing days via LEFT JOIN numbers().
Compute percentages in Node, return JSON.
No caching layer in front of ClickHouse — every request hits the cluster.

CLICKHOUSE_MAX_EXECUTION_TIME (default 10 s) caps individual query runtime; on timeout the API returns 503 database_error.

Async exports¶

CSV exports of raw consent_events for legal / GDPR requests are too slow for a sync HTTP request, so they're queued. The flow:

sequenceDiagram
    participant C as Client
    participant API as vault-api
    participant PG as PostgreSQL
    participant W as export worker
    participant CH as ClickHouse
    participant S3 as S3

    C->>API: POST /exports {code, from, to}
    API->>PG: INSERT consent_exports status=pending
    API-->>C: { id, status: "pending" }
    loop every EXPORTS_WORKER_INTERVAL_MS
        W->>PG: claim oldest pending job
        loop for each day in range
            W->>CH: INSERT INTO FUNCTION s3(daily.csv) SELECT ...
        end
        W->>S3: stream-zip per-day CSVs into final ZIP
        W->>S3: delete temp CSVs
        W->>PG: status=ready, size_bytes, entries
    end
    C->>API: GET /exports/{id}
    API-->>C: { status: "ready" }
    C->>API: POST /exports/{id}/download
    API->>S3: presign GET (TTL 900s)
    API-->>C: { url }

Job state lives in the PostgreSQL consent_exports table (status 0=pending, 1=running, 4=ready, 9=failed). The worker is part of the vault-api process — gated by EXPORTS_WORKER_ENABLED=true so you can run a no-worker replica for read traffic only. Up to EXPORTS_WORKER_MAX_ATTEMPTS (default 3) retries before giving up. The CSV is built ClickHouse-side via INSERT INTO FUNCTION s3(...) so vault-api never streams large result sets through Node.

Ranges are capped at 31 days for export jobs vs. 90 days for analytics queries.

HTTP API surface (vault-api)¶

All routes require X-API-Key unless noted.

Group	Endpoint	Purpose
Health	`GET /health` (no auth)	ALB health check; pings ClickHouse
Health	`GET /docs`, `GET /openapi.json` (no auth)	Auto-generated OpenAPI 3.1 from Zod schemas
Sessions	`POST /sessions/overview`	Daily session counts for one or many codes
Sessions	`POST /sessions/timeline`	Daily timeline + distributions across 5 dimensions
Consent	`POST /consent/status`	All-allowed / all-denied / partial breakdown
Consent	`POST /consent/categories`	Per-category disclosed vs consented + consent_rate
Consent	`POST /consent/dimensions`	Single-dimension breakdown
Consent	`POST /consent/dimension-categories`	Per-category rates per dimension value
Consent	`POST /consent/funnel`	Event funnel by `event_type`
Consent	`POST /consent/churn`	Users who changed consent over time
Consent	`POST /consent/matrix`	Cross-tab of two dimensions
Consent	`POST /consent/compliance`	Disclosure audit, GPP signal impact
Consent (raw)	`POST /consent/token-lookup`	Raw events for a `user_token` (legal/GDPR lookup)
Consent (raw)	`POST /consent/events`	Paginated raw events, up to 50k per page
Consent (raw)	`POST /consent/latest`	Most recent N raw events for a code
Exports	`POST /exports`	Queue an export job (≤31 days)
Exports	`GET /exports/{id}`	Poll export status
Exports	`POST /exports/{id}/download`	Get presigned S3 URL (900 s TTL)

Endpoints under "Consent (raw)" are the only ones that hit consent_events FINAL directly. Everything else queries the daily aggregate tables.

Error responses¶

Status	Code	When
400	`invalid_request`	Zod validation failed (`details[]` per field)
401	`unauthorized`	Missing or wrong `X-API-Key`
404	`not_found`	Unknown route
503	`database_error`	ClickHouse query failed or timed out
500	`internal_error`	Unhandled exception

All log lines are JSON to stdout (request_id, timing_ms, status_code, endpoint, plus per-route metrics). No external APM in either repo — operability is via stdout + ClickHouse logs + PostgreSQL state tables.

Infrastructure¶

Production¶

Component	Details
ECS cluster	`prod-euc1-vault-ecs-cluster` (c7i.xlarge)
vault-api	ECS service on internal ALB (`vault-api.internal.cookiehub.net`)
vault-etl	ECS tasks (one per `TASK` value: `clickhouse-ingest`, `bunny-api-logs`, `cloudfront-log-index`, `consent-sqs-ingest` × N queues, `consent-clickhouse-ingest`). Long-running, plus `refresh-asn` as a scheduled task.
ClickHouse	EC2 m7i.xlarge with 1 TB EBS data volume
Grafana	ECS task, port 3000, accessible via port forwarding
Database	PostgreSQL RDS (`prod-euc1-core-postgres`) — used for ingest state + export queue, not analytics
ALB	Internal only — not publicly accessible

Stage¶

Component	Details
Database	PostgreSQL RDS (`stage-euc1-core-postgres`)

Access¶

Production vault services are behind the internal ALB — accessible only via VPN or SSM port forwarding.

Grafana: see Port forwarding runbook
ClickHouse: see Port forwarding runbook

Secrets¶

Stored in SSM under /{group}/{env}/vault-api/ and /{group}/{env}/etl/ (the ETL prefix is the legacy name; same as vault-ingest):

Parameter	Description
`clickhouse_url`	ClickHouse HTTP endpoint
`clickhouse_username` / `clickhouse_password`	ClickHouse auth
`vault_api_key`	Static `X-API-Key` shared by all callers
`vault_ingest_config`	JSON blob merged into vault-ingest's env at startup. Per-instance task env vars take precedence — this is for shared defaults.

Operational notes¶

No global retries on permanent ingest failures. Files past MAX_ERROR_COUNT (10) sit in status=failed until someone manually resets them.
The Bunny wrap-around guard is in-memory only. vault-ingest detects Bunny logging-API repeats by comparing the first line of the next batch to the last batch's first line — but that map is per-process. After a restart, the first batch can re-upload tail-of-day data. ReplacingMergeTree dedupes it, so it's an efficiency loss not a correctness bug.
ASN refresh is monthly, manual to schedule. Run TASK=refresh-asn as a one-shot. If it stops happening, the ASN dictionary stales and bot filtering quality drifts.
Backfills. To rebuild a day: drop the relevant partition from the silver and gold tables in ClickHouse, then reset status='pending' on the relevant rows of cdn_ingestion_files or consent_ingestion. The next ingest loop replays silver, and the materialized views rebuild gold from there. ReplacingMergeTree handles the silver dedup.
Domain-code resolution for V2 consent. V2 payloads carry domainId, not code. The mapper resolves it via vault-ingest/src/tasks/consentClickhouseIngest/domainLookup.ts. If that lookup ever returns nothing for a domainId, the row is skipped — check the lookup service first when V2 events disappear.

Dependencies¶

edge-api — produces consent events into the SQS queues vault-ingest reads.
Bunny CDN logging API + CloudFront S3 logs — session source data.
MaxMind GeoLite2-ASN — bot-filter reference data, refreshed monthly.
ClickHouse — primary analytics store.
PostgreSQL RDS — ingest progress + export job queue.
core-api / dashboard — only known consumers of vault-api.