Skip to content

Vault

Vault is the analytics pipeline. It collects two things — consent events (every banner impression, dialog interaction, and consent decision the widget records) and session data (every HTTP request to the customer's site that the widget served on, derived from CDN logs) — normalises them into ClickHouse, and serves aggregated metrics back to the dashboard.

Vault has a strict write/read split:

  • vault-ingest (repo vault-ingest, ECS service vault-etl) — write side. Pulls from CDN logs and SQS, normalises, writes ClickHouse.
  • vault-api — read side. Stateless HTTP API that queries pre-aggregated ClickHouse tables and serves async CSV exports. It does not ingest anything.

Architecture

flowchart LR
    Widget[Widget JS] -->|consent events| Edge[edge-api]
    Edge -->|SQS| Ingest[vault-ingest]
    Bunny[Bunny CDN] -->|logging API| Ingest
    CF[CloudFront] -->|S3 logs| Ingest
    Ingest -->|bronze| S3[(S3)]
    Ingest -->|silver| CH[(ClickHouse)]
    CH -->|materialized views| CH
    API[vault-api] -->|reads aggregates| CH
    API -->|exports| PG[(PostgreSQL)]
    API -->|presigned downloads| S3
    Dash[Dashboard / core-api] --> API
    Grafana --> CH
Component Purpose
vault-api Hono HTTP API on Node.js. Read-only. Queries ClickHouse aggregate tables, runs the export worker, serves presigned S3 URLs for download. Single static API key (X-API-Key) for all callers.
vault-ingest Long-running TypeScript service. Multiple independent task loops, each polling for work and tracking progress in PostgreSQL. ECS service name is vault-etl.
ClickHouse Holds both silver (append-only ReplacingMergeTree raw tables — sessions, consent_events) and gold (SummingMergeTree daily aggregates fed by materialized views — *_daily). Everything the dashboard reads is in ClickHouse.
S3 Bronze layer + export storage. CDN logs and consent payloads land here as the source of truth; everything else is recomputable from bronze.
PostgreSQL Operational state only — vault-ingest's per-file ingest progress and the vault-api exports job queue. Not a data layer; no analytics data is stored here.
Grafana Reads ClickHouse directly for ops dashboards. Does not go through vault-api.

Previous generations deprecated

Earlier architectures used Docker Compose + Portainer, then Docker Swarm + Trino + Iceberg. Both have been fully replaced by the current ClickHouse + ECS setup.

Repository vs. service naming

The write-side repo is vault-ingest. The ECS service running it is named vault-etl for historical reasons — Terraform module, ECR repo, SSM prefix, and CloudWatch log groups all use vault-etl. They refer to the same thing. The legacy _node server-setup runbook references the same service.

There is also a vault-worker.yaml Kubernetes manifest in the vault-ingest repo (Scaleway/SCW). It is a legacy deployment artifact — vault-ingest now runs on ECS in eu-central-1.

Data layers (bronze / silver / gold)

vault-ingest is built around a deliberate layered model — read README.md and AI_CONTEXT.md in the repo for the design principles.

Layer Where Mutability Purpose
Bronze S3 Immutable, never rewritten Source of truth. Raw CDN logs and raw consent payloads. Everything downstream is recomputable from bronze.
Silver ClickHouse — sessions, consent_events (ReplacingMergeTree) Append-only, dedup on merge Normalised, typed, parsed raw rows. Bots not yet filtered.
Gold ClickHouse — *_daily SummingMergeTree tables (consent_status_daily, consent_dim_daily, sessions_daily_filtered, …) Append-only, fed by materialized views off the silver tables What the dashboard actually queries. Pre-rolled-up by day, code, dimension; bot filtering applied at this layer for the *_filtered family.

PostgreSQL sits beside this stack as operational state, not as a data layer — it holds vault-ingest's per-file ingest progress and the vault-api export job queue. Nothing analytical lives there.

Backfills are a first-class operation. Drop a partition or reset status='pending' on a row in cdn_ingestion_files/consent_ingestion, and the next ingest loop replays it. ReplacingMergeTree collapses duplicates on silver, and the materialized views re-fire to rebuild gold.

Ingest pipelines

vault-ingest is one container running a single task loop, selected by the TASK env var. Each ECS task definition pins a different TASK value, so the cluster is really N copies of the same image doing different jobs. Tasks track progress in PostgreSQL (cdn_ingestion_files, consent_ingestion, bunny_api_pull_state) so they are crash-safe and resumable.

flowchart LR
    subgraph Sessions["Session pipeline"]
        BunnyAPI[Bunny logging API] -->|"bunny-api-logs"| S3a[(S3 bronze<br/>CDN logs)]
        CFOrig[CloudFront] -->|"cloudfront-log-index"| S3a
        S3a -->|"clickhouse-ingest"| CHa[(ClickHouse<br/>sessions)]
    end
    subgraph Consent["Consent pipeline"]
        EdgeAPI[edge-api] -->|"SQS"| Q[(SQS<br/>eu + aws)]
        Q -->|"consent-sqs-ingest"| S3b[(S3 bronze<br/>consent payloads)]
        S3b -->|"consent-clickhouse-ingest"| CHb[(ClickHouse<br/>consent_events)]
    end

Session pipeline (CDN logs → sessions)

Three tasks cooperate:

  • bunny-api-logs — pulls Bunny CDN HTTP logs by zone and date via the Bunny Logging API. 25k-line batches by line offset, state in bunny_api_pull_state (offset, complete flag, retry backoff). Only processes past completed days, with a 4 h cooldown after midnight UTC. Writes raw to S3 bronze. Drops Bunny origin-shield entries (HTTP scheme + FR edge location) before upload — they are internal routing, not real traffic.
  • cloudfront-log-index — every 5 min, lists yesterday/ and today/ prefixes in the CloudFront S3 bucket, copies new files to cloudfront/{YYYY/MM/DD}/ in the bronze bucket, and registers each in cdn_ingestion_files.
  • clickhouse-ingest — pulls files marked pending from cdn_ingestion_files, processes them line by line, batches 10k rows, inserts into default.sessions. Resumable: tracks last_line per file. Up to 10 retries with exponential backoff (capped at 60 s) before marking permanently failed.

For each line:

  1. Parse — provider-aware. Bunny is tab-separated with a fixed schema; CloudFront has a #Fields: header that must be re-parsed per file because field order varies. Hard-coded field indexes will break.
  2. Filter — origin-shield duplicates again (defence in depth), invalid URIs.
  3. Geo-enrich — IP → country, ASN.
  4. Anonymise — IPv6 truncated to /56. Geo-lookup happens before anonymisation.
  5. UA-parse — ua-parser-js for device / OS / browser / browser context.
  6. Insert — append to default.sessions (silver).

Two tasks, with SQS as the durable buffer between the edge and storage:

  • consent-sqs-ingest — long-poll SQS (default 20 s, 60 s visibility timeout), buffer up to CONSENT_BATCH_SIZE (default 1000) messages or CONSENT_FLUSH_AGE_MS (default 30 s), gzip, write to S3 bronze, then delete from SQS. One task instance per queue — there are separate queues per region (eu, aws) and the env var CONSENT_SQS_NAME labels which one this instance owns. The S3 prefix and DB rows include the label so they don't collide.
  • consent-clickhouse-ingest — same pattern as session ingest: poll consent_ingestion for pending files, parse, transform, insert into default.consent_events.

The consent parser supports three payload formats, detected at parse time:

Version Format Source Notes
V1 flat CSV legacy widget Historical only. CSV streamed via csv-parse.
V2 JSON {payload: {a, d, u, s}, headers} legacy edge-api Carries a numeric domainId instead of the 8-char code. The mapper resolves it via a domain-lookup service.
V3 JSON with event_type, user_token, code at the root current edge-api The format produced by edge-api /v3 endpoints.

V2 and V3 payloads also carry IAB compliance strings (tc_string, gpp_string), category/vendor/service consent + disclosure arrays, and the parsed user-agent components.

Storage layer

Silver (ClickHouse — raw)

Table Engine Partition Order TTL Source
default.sessions ReplacingMergeTree(imported_at) request_date (code, request_date, ...) 90 days clickhouse-ingest
default.consent_events ReplacingMergeTree() toYYYYMM(event_time) (code, user_token, event_time) consent-clickhouse-ingest

DDL lives in vault-ingest/sql/. ClickHouse's ReplacingMergeTree merge collapses rows with identical sort keys, so re-ingesting the same line on retry is safe.

Gold (ClickHouse — daily rollups)

The dashboard never reads raw consent_events — it queries pre-aggregated *_daily tables fed by materialized views attached to the silver tables. This is what makes vault-api fast: the work happens at write-time inside ClickHouse, not at read-time inside vault-api.

Table Grain Fed by Used for
consent_status_daily day × code × hostname × path × country consent_status_daily_mv /consent/status (all-allowed / all-denied / partial)
consent_category_daily day × code × ... × category consent_category_*_mv /consent/categories
consent_dim_daily day × code × dimension × value 8 dimension MVs /consent/dimensions
consent_dim_category_daily day × code × dimension × value × category 8 MVs /consent/dimension-categories
consent_funnel_daily day × code × event_type consent_funnel_daily_mv /consent/funnel
sessions_daily day × code upstream /sessions/overview (pre-cutover)
sessions_daily_filtered day × code upstream + bot filter /sessions/overview (post-cutover)
sessions_dim_daily / _filtered day × code × dimension upstream /sessions/timeline distributions

All aggregate tables are SummingMergeTree, so concurrent inserts from materialized views are safe — duplicate (sort-key, partition) rows sum on merge. Percentages are never computed in SQL; vault-api derives them in app code via round4() after fetching raw counts. Missing days are zero-filled by LEFT JOIN numbers().

The "all-allowed / all-denied / partial" classification rule lives identically in every relevant MV: necessary is excluded, then arrayAll / arrayExists over category_consented decides the bucket.

The *_daily aggregate tables are populated by ClickHouse materialized views attached to the silver tables — they fire on every insert, with no separate scheduling. (vault-ingest does not own this DDL today; the views were created by the upstream consent-context system. They feed off the same consent_events rows vault-ingest writes.)

Bot filtering

Bot filtering applies only to the session side. Consent events are not bot-filtered today.

Two signals combine, both in ClickHouse:

  1. ASN dictionaryip_asn table loaded from MaxMind GeoLite2-ASN, classified into hosting / proxy / vpn / isp / business / education / unknown via the manually-curated asn_classification override table, then promoted into ip_asn_dict (an IP_TRIE dictionary for O(1) CIDR lookup). Refreshed monthly by the refresh-asn task — a one-shot, requires MAXMIND_LICENSE_KEY, schedule it as an ECS scheduled task. See vault-ingest/_readme/maxmind.md.
  2. UA dominance heuristicbot_ua_flags flags any user-agent that accounts for ≥30 % of a (code, day)'s traffic, with a 100-session circuit breaker to ignore tiny domains. Promoted to bot_ua_dict for fast joins from MVs.

A session is excluded from billing if either filter matches.

The 2026-05-01 cutover

vault-api endpoints that compute session counts switch tables based on the date range:

  • Range entirely before 2026-05-01 → sessions_daily (unfiltered).
  • Range entirely after → sessions_daily_filtered (bot-excluded).
  • Range straddling the cutoff → UNION ALL across both tables, with date predicates carving the split.

This is in vault-api/src/routes/sessions.ts. If you're staring at a session-count discrepancy across the cutoff, this is almost always why.

vault-api: read side

Authentication

Single static X-API-Key (env var VAULT_API_KEY). All callers — dashboard, core-api, internal tools — share the same key. There is no per-customer authentication: scoping is purely by the code parameter in each request. A client with the API key can read any customer's data.

The exceptions are /health, /docs, and /openapi.json, which are unauthenticated. The ALB target group health check uses /health, which pings ClickHouse with SELECT 1.

Query pattern

Every analytics endpoint follows the same shape:

  1. Validate request via Zod (date range ≤ 90 days, code matches ^[a-zA-Z0-9]{8}$).
  2. Build a parameterised ClickHouse SQL query against the relevant *_daily aggregate. Filters (hostname, path_segment, country) become WHERE clauses. Use {name:Type} parameter binding — no string interpolation.
  3. Run two queries in parallel: a totals roll-up and a daily breakdown.
  4. Zero-fill missing days via LEFT JOIN numbers().
  5. Compute percentages in Node, return JSON.
  6. No caching layer in front of ClickHouse — every request hits the cluster.

CLICKHOUSE_MAX_EXECUTION_TIME (default 10 s) caps individual query runtime; on timeout the API returns 503 database_error.

Async exports

CSV exports of raw consent_events for legal / GDPR requests are too slow for a sync HTTP request, so they're queued. The flow:

sequenceDiagram
    participant C as Client
    participant API as vault-api
    participant PG as PostgreSQL
    participant W as export worker
    participant CH as ClickHouse
    participant S3 as S3

    C->>API: POST /exports {code, from, to}
    API->>PG: INSERT consent_exports status=pending
    API-->>C: { id, status: "pending" }
    loop every EXPORTS_WORKER_INTERVAL_MS
        W->>PG: claim oldest pending job
        loop for each day in range
            W->>CH: INSERT INTO FUNCTION s3(daily.csv) SELECT ...
        end
        W->>S3: stream-zip per-day CSVs into final ZIP
        W->>S3: delete temp CSVs
        W->>PG: status=ready, size_bytes, entries
    end
    C->>API: GET /exports/{id}
    API-->>C: { status: "ready" }
    C->>API: POST /exports/{id}/download
    API->>S3: presign GET (TTL 900s)
    API-->>C: { url }

Job state lives in the PostgreSQL consent_exports table (status 0=pending, 1=running, 4=ready, 9=failed). The worker is part of the vault-api process — gated by EXPORTS_WORKER_ENABLED=true so you can run a no-worker replica for read traffic only. Up to EXPORTS_WORKER_MAX_ATTEMPTS (default 3) retries before giving up. The CSV is built ClickHouse-side via INSERT INTO FUNCTION s3(...) so vault-api never streams large result sets through Node.

Ranges are capped at 31 days for export jobs vs. 90 days for analytics queries.

HTTP API surface (vault-api)

All routes require X-API-Key unless noted.

Group Endpoint Purpose
Health GET /health (no auth) ALB health check; pings ClickHouse
Health GET /docs, GET /openapi.json (no auth) Auto-generated OpenAPI 3.1 from Zod schemas
Sessions POST /sessions/overview Daily session counts for one or many codes
Sessions POST /sessions/timeline Daily timeline + distributions across 5 dimensions
Consent POST /consent/status All-allowed / all-denied / partial breakdown
Consent POST /consent/categories Per-category disclosed vs consented + consent_rate
Consent POST /consent/dimensions Single-dimension breakdown
Consent POST /consent/dimension-categories Per-category rates per dimension value
Consent POST /consent/funnel Event funnel by event_type
Consent POST /consent/churn Users who changed consent over time
Consent POST /consent/matrix Cross-tab of two dimensions
Consent POST /consent/compliance Disclosure audit, GPP signal impact
Consent (raw) POST /consent/token-lookup Raw events for a user_token (legal/GDPR lookup)
Consent (raw) POST /consent/events Paginated raw events, up to 50k per page
Consent (raw) POST /consent/latest Most recent N raw events for a code
Exports POST /exports Queue an export job (≤31 days)
Exports GET /exports/{id} Poll export status
Exports POST /exports/{id}/download Get presigned S3 URL (900 s TTL)

Endpoints under "Consent (raw)" are the only ones that hit consent_events FINAL directly. Everything else queries the daily aggregate tables.

Error responses

Status Code When
400 invalid_request Zod validation failed (details[] per field)
401 unauthorized Missing or wrong X-API-Key
404 not_found Unknown route
503 database_error ClickHouse query failed or timed out
500 internal_error Unhandled exception

All log lines are JSON to stdout (request_id, timing_ms, status_code, endpoint, plus per-route metrics). No external APM in either repo — operability is via stdout + ClickHouse logs + PostgreSQL state tables.

Infrastructure

Production

Component Details
ECS cluster prod-euc1-vault-ecs-cluster (c7i.xlarge)
vault-api ECS service on internal ALB (vault-api.internal.cookiehub.net)
vault-etl ECS tasks (one per TASK value: clickhouse-ingest, bunny-api-logs, cloudfront-log-index, consent-sqs-ingest × N queues, consent-clickhouse-ingest). Long-running, plus refresh-asn as a scheduled task.
ClickHouse EC2 m7i.xlarge with 1 TB EBS data volume
Grafana ECS task, port 3000, accessible via port forwarding
Database PostgreSQL RDS (prod-euc1-core-postgres) — used for ingest state + export queue, not analytics
ALB Internal only — not publicly accessible

Stage

Component Details
Database PostgreSQL RDS (stage-euc1-core-postgres)

Access

Production vault services are behind the internal ALB — accessible only via VPN or SSM port forwarding.

Secrets

Stored in SSM under /{group}/{env}/vault-api/ and /{group}/{env}/etl/ (the ETL prefix is the legacy name; same as vault-ingest):

Parameter Description
clickhouse_url ClickHouse HTTP endpoint
clickhouse_username / clickhouse_password ClickHouse auth
vault_api_key Static X-API-Key shared by all callers
vault_ingest_config JSON blob merged into vault-ingest's env at startup. Per-instance task env vars take precedence — this is for shared defaults.

Operational notes

  • No global retries on permanent ingest failures. Files past MAX_ERROR_COUNT (10) sit in status=failed until someone manually resets them.
  • The Bunny wrap-around guard is in-memory only. vault-ingest detects Bunny logging-API repeats by comparing the first line of the next batch to the last batch's first line — but that map is per-process. After a restart, the first batch can re-upload tail-of-day data. ReplacingMergeTree dedupes it, so it's an efficiency loss not a correctness bug.
  • ASN refresh is monthly, manual to schedule. Run TASK=refresh-asn as a one-shot. If it stops happening, the ASN dictionary stales and bot filtering quality drifts.
  • Backfills. To rebuild a day: drop the relevant partition from the silver and gold tables in ClickHouse, then reset status='pending' on the relevant rows of cdn_ingestion_files or consent_ingestion. The next ingest loop replays silver, and the materialized views rebuild gold from there. ReplacingMergeTree handles the silver dedup.
  • Domain-code resolution for V2 consent. V2 payloads carry domainId, not code. The mapper resolves it via vault-ingest/src/tasks/consentClickhouseIngest/domainLookup.ts. If that lookup ever returns nothing for a domainId, the row is skipped — check the lookup service first when V2 events disappear.

Dependencies

  • edge-api — produces consent events into the SQS queues vault-ingest reads.
  • Bunny CDN logging API + CloudFront S3 logs — session source data.
  • MaxMind GeoLite2-ASN — bot-filter reference data, refreshed monthly.
  • ClickHouse — primary analytics store.
  • PostgreSQL RDS — ingest progress + export job queue.
  • core-api / dashboard — only known consumers of vault-api.