Atrax¶

Atrax is the web scanner — it crawls customer domains to discover cookies, scripts, and tracking technologies. It splits along a strong security boundary: the API controller runs on the trusted core network, while the browser workers run in their own region, in their own VPC, with their own egress IPs.

Architecture¶

flowchart LR
    subgraph euc1[eu-central-1]
        API[atrax-api]
        DB[(PostgreSQL RDS)]
    end
    subgraph euw1[eu-west-1]
        Node[atrax-node]
    end
    Core[core / dashboard] -->|REST| API
    Node -->|polls jobs / heartbeat| API
    Node -->|browser| Site[Customer site]
    Node -->|POST report + screenshot| API
    API --> DB
    API -->|writes reports + screenshots| S3[(S3)]
    Core -->|fetches reports| API

Component	Region	Purpose
atrax-api	eu-central-1	Hono HTTP controller on Node.js. Receives scan requests, hands jobs to nodes that poll for them, persists results to PostgreSQL, and uploads node-supplied report JSON + screenshots to S3 using its own IAM user.
atrax-node	eu-west-1	Headless Puppeteer worker. Polls atrax-api for jobs, runs untrusted website code in Chromium, POSTs results back to atrax-api (no direct S3 access).

Why the region split¶

atrax-node executes arbitrary website JavaScript inside Chromium — it's the highest-risk surface in the network. Putting it in a dedicated VPC in a different region (no peering, no shared subnets) means a compromised crawler can't reach the production data plane. The eu-west-1 location is also where Atrax has run historically, so the EIPs assigned to crawler hosts are the ones customers have already added to their bot allow-lists.

Cluster + EC2 shape (atrax-node)¶

The atrax-node cluster is not ASG-backed. It's a fixed set of EC2 instances in public subnets with one Elastic IP per host attached directly. EIPs are declared in the module but imported per environment — terraform never allocates a new one, because doing so would change the egress IP and break customer whitelists. Replacing an instance via terraform taint keeps the EIP attached.

The ECS service uses placement_strategy { type = "spread", field = "instanceId" } so when there are multiple hosts and multiple tasks, traffic fans out across all EIPs. CPU and memory limits are set at the container level rather than the task level — this lets a single task burst into the full host capacity when alone, while memoryReservation (the soft limit, used for placement) lets multiple tasks pack onto one host.

How a scan flows through the system¶

A scan is a long-lived job that bounces between the dashboard, atrax-api, and a node over many minutes. The whole flow is pull-based — atrax-api never opens an outbound connection to a node. Nodes ask for work, report progress, and push results back over HTTPS to atrax-api.{env}.cookiehub.net.

sequenceDiagram
    participant Dash as Dashboard / Core
    participant API as atrax-api
    participant DB as PostgreSQL
    participant Node as atrax-node
    participant Site as Customer site
    participant S3

    Dash->>API: POST /api/job  (domain, settings, webhook)
    API->>DB: insert job, status=Pending
    API-->>Dash: { newJob: { id, ... } }

    loop every ~500ms
        Node->>API: GET /api/job/next/{nodeId}
        API->>DB: pick eligible Pending job (load-balanced)
        API-->>Node: { job } or { job: null }
    end

    Node->>API: PATCH /api/job/{id}  status=Crawling
    loop for each page (up to maxPages / maxSeconds)
        Node->>Site: navigate, capture cookies + network
        Note over Node: every 1000 pages → POST /api/job/{id}/report (checkpoint)
    end
    Node->>API: PATCH /api/job/{id}  status=CrawlCompleted

    loop verify pages flagged during crawl
        Node->>Site: replay with consent, check cookie persistence
    end
    Node->>API: PATCH /api/job/{id}  status=VerifyCompleted

    Node->>API: POST /api/job/{id}/screenshot  (binary PNG)
    API->>S3: PUT screenshots/{domain}/{id}.png
    Node->>API: POST /api/job/{id}/report  (full JobStore JSON)
    API->>S3: PUT {express|extended}/{domain}/{id}.json
    Node->>API: PATCH /api/job/{id}  status=Completed, report=summary
    Node->>API: POST /api/notification  (fire customer webhook)

    Dash->>API: GET /api/report/{id}
    API->>S3: GET report
    API-->>Dash: { job, report }

Job lifecycle¶

Status is a single integer column on the Job row, advanced exclusively by the node via PATCH /api/job/:id. atrax-api never advances a job's status on its own — even when a node goes stale, the API only nulls out nodeId and lets the job be re-claimed at its current status (see Node liveness).

Value	Name	Driver	Meaning
`0`	`Pending`	API	Created by `POST /api/job`. No node assigned.
`1`	`Claimed`	API	A node picked up the job via `GET /api/job/next/:nodeId`. `nodeId` is set.
`2`	`Crawling`	Node	Identification stage — discovering and visiting pages.
`3`	`CrawlCompleted`	Node	Identification finished, verification queued.
`4`	`VerifyCompleted`	Node	Verification finished, finalization (screenshot + report upload) running.
`5`	`Completed`	Node	Terminal success. Full report is in S3, summary is on the job row.
`6`	`Failed`	Node	Terminal failure. Partial report may still be uploaded.

stateDiagram-v2
    [*] --> Pending
    Pending --> Claimed: node calls /job/next
    Claimed --> Crawling: node starts identification
    Crawling --> CrawlCompleted
    CrawlCompleted --> VerifyCompleted: verification pass done
    VerifyCompleted --> Completed: screenshot + report uploaded
    Crawling --> Failed: ≥10 page errors or 100% failure rate
    CrawlCompleted --> Failed
    VerifyCompleted --> Failed
    Completed --> [*]
    Failed --> [*]

The values are defined in atrax-api/src/types/index.ts and atrax/src/types/jobStatus.ts — they must stay in sync between the two repos.

Dispatch and load balancing¶

atrax-api does not push work — nodes pull it. The dispatch endpoint is GET /api/job/next/:nodeId (atrax-api/src/routes/job.ts), and the algorithm is intentionally simple:

Count this node's active jobs (status < Completed).
Count active jobs on every other Active node.
Compute skip = number of nodes strictly less busy than this one.
Find the first job with status < Completed AND nodeId IS NULL, ordered by status DESC, priority DESC, and OFFSET skip.
Only assign it if this node has capacity:
< 4 active jobs → take any priority
4..7 active jobs → only take jobs with priority >= 7
>= 8 → never take

The skip trick is a poor person's load balancer: when the node calling in is busier than its peers, it walks past the first N waiting jobs and picks one further down the queue, leaving the head for less-loaded nodes to grab on their next poll. There is no row-level locking — the assignment is prisma.job.update, last write wins. A simultaneous double-claim is theoretically possible but unlikely given the 500 ms poll cadence and OFFSET skip divergence.

The status DESC ordering is deliberate: jobs already in flight (Crawling, etc.) but orphaned by a stale-node sweep get re-claimed before brand-new Pending work, so half-finished crawls preempt new ones.

Nodes also call GET /api/job/running/:nodeId?max=4 once on startup to recover any jobs they were working on before a restart. The API filters on status > Pending AND status < Completed AND nodeId = self.

What a scan actually does¶

The crawler is a single Puppeteer-extra (stealth plugin) process running up to 8 concurrent jobs (atrax/src/crawler/config.ts — MAX_CONCURRENT_JOBS). Each job gets its own browser instance and its own profile directory under ./profiles/{jobId} so cookies and storage don't bleed across jobs.

Two-stage crawl¶

A job runs the page list twice, with the verification pass triggered as a phase change rather than a separate API job:

Identification (status = Crawling). Visits each entry URL, auto-consents to common CMPs, captures cookies/storage/network/screenshot, and walks <a> hrefs that match settings.hosts and the include/exclude filters. Stops when one of maxPages, maxSeconds, or "no new pages discovered" is hit. The first page (in production, non-implementation jobs) is screenshot-captured to ./.temp/{jobId}.png.
Verification (status = CrawlCompleted → VerifyCompleted). Replays only pages flagged verificationNeeded during the first pass. The user-agent flips from CookieHubScan/3.0 to CookieHubVerify/3.0. The crawler grants full CMP consent up front and re-checks whether cookies still appear — those still present after consent is granted are flagged notCompliant.

What gets captured per page¶

HTTP cookies — via CDP Network.getAllCookies (name, domain, path, expires, httpOnly, secure, session, value).
Storage — localStorage and sessionStorage if settings.storage is true. Values are truncated to 30 chars + ... before leaving the browser, so we never persist full session payloads.
Network — Network.requestWillBeSent and Network.responseReceivedExtraInfo events, capped at 5000 each per page (MAX_REQUESTS / MAX_RESPONSES). Older entries are dropped.
Page metadata — title, meta description, OG tags, documentElement.lang, top 20 distinct non-black/white computed colors, full HTML via page.content().
Tech stack — detectStack() runs against captured HTML + request URLs to label e.g. next.js, wordpress, cookiebot.
Cookiehub-specific — detects /c2/{hash}.js (and legacy /cc/{hash}.js) loads, whether they fired before or after GTM, and Google Consent Mode state (gcd= parameter, GCM v1 vs v2).

On the entry URL of stage 1, the node calls into the page to grant consent through whichever CMP it detects: Cookiebot, OneTrust, Didomi, Iubenda, UserCentrics, and ~8 others (atrax/src/crawler/services/browser/handlers/). This is intentional: we want to see what tracking the site fires after a user clicks "accept all", which is when most regulators care about the cookie inventory. Stage 2 also grants consent and uses persistence as the compliance signal.

The crawler does not click visual banner buttons — it calls the JS APIs directly. If a site uses a custom CMP we don't recognise, we crawl in the un-consented state, which can hide cookies that only fire post-consent.

Per-page failure handling¶

Single-page navigation has two timeouts that fight each other: a 25 s wrapper around page.goto and Puppeteer's own 40 s internal timeout. The wrapper wins in practice. On entry-URL failure only, the node retries https://www.X and then http://X before giving up; intermediate-page failures are not retried.

A whole job is marked Failed only when:

≥ 10 pages have timed out or returned 5xx and the current page also times out / 5xxs, or
100 % of attempted pages errored.

Any other page failure is just logged on the page record. Bugsnag receives the top-level exceptions; per-page errors do not.

Resource and content limits¶

Limit	Value	Source
Concurrent jobs per node	8	`MAX_CONCURRENT_JOBS`
Startup-time job recovery	up to 4	`GET /api/job/running/:nodeId?max=4`
Page navigation timeout	25 s wrapper / 40 s puppeteer	`navigationHandler.ts`
Storage value capture	first 30 chars + `...`	`cookieHandler.ts`
Network requests stored	5000 / page	`MAX_REQUESTS`
Network responses stored	5000 / page	`MAX_RESPONSES`
Link length filter	255 chars	`cdpHandler.ts`
Detected colours	top 20	`metaHandler.ts`
Checkpoint upload cadence	every 1000 pages	`storageHandler.ts`
Profile cleanup (legacy)	every 4 h via `mycron`	container cron, profiles > 1 day old

Results and storage¶

The node never talks to S3 directly. It POSTs results to atrax-api and atrax-api uses its {env}-euc1-atrax-api-s3 IAM user to PUT to S3. This is the security boundary: the eu-west-1 crawler hosts run untrusted website JS but hold no cloud credentials beyond their controller bearer token.

Endpoint	Bucket	Key	Trigger
`POST /api/job/:id/report` (JSON body)	`atrax-{env}-express` or `atrax-{env}-extended`	`{domain}/{jobId}.json`	Checkpoint every 1000 pages, plus once on completion
`POST /api/job/:id/screenshot` (binary, `Content-Type: image/png`)	`atrax-{env}-screenshots`	`{domain}/{jobId}.png`	After verification stage, production non-implementation jobs only
`PATCH /api/job/:id` `{ status, report, progress }`	(PostgreSQL only)	—	Phase transitions and the final summary

Bucket selection is by settings.type: jobs with type === "extended" go to the extended bucket, anything else (including express) goes to express. The extended report retains the full per-page detail, the express report is the smaller summary used by the public-facing scan widget.

The summary stored on the Job.report JSONB column (separate from the S3 file) carries the data the dashboard uses for the result list — risk score, cookie counts per category, GCM/TCF state, screenshot URL, detected stack — so the dashboard can render the index view without an S3 round-trip. Full per-page detail still requires GET /api/report/:id, which fetches from S3.

The screenshot bucket is public-read (it backs <img> tags in the result UI); the express and extended buckets are private with KMS encryption. Lifecycle policies (prod) are listed under S3 bucket lifecycle.

Node liveness and stale cleanup¶

Each node calls POST /api/node/:id/heartbeat every 30 s (HEARTBEAT_INTERVAL_MS). The endpoint sets seenAt = now and status = Active on the node row.

atrax-api runs an internal sweeper on a setInterval (atrax-api/src/scheduler.ts → handleStaleNodes) every 5 minutes. For any Active node whose seenAt is older than NODE_STALE_MS (default 1 hour), the sweeper:

Sets node.status = InActive.
Sets nodeId = null on every job belonging to that node where status < Completed.

The job's status is left as-is. So a node that died mid-Crawling leaves the job at status Crawling, nodeId=null, and on the next dispatch round, that job gets picked first (because ORDER BY status DESC puts in-flight statuses ahead of Pending). The new node then continues from where the previous one left off — possible because the previous node had been writing JobStore checkpoints to S3 every 1000 pages, which the new node fetches via GET /api/job/:id/checkpoint.

There is no global job timeout. A job that gets wedged on a live node will stay there until the node itself fails or is force-cycled.

atrax-api owns the cookie taxonomy that classifies raw crawl output into Necessary / Preferences / Analytics / Marketing / Uncategorized. The data lives in two Prisma models, Category and Cookie, and is synced daily from https://cdn.cookiehub.eu/db/cookies.json by the same scheduler that runs the stale-node sweep (atrax-api/src/tasks/sync-cookies.ts). The CDN is the source of truth — local DB rows are upserted by externalId.

Two endpoints expose this data:

GET /api/cookies — full list, used by the dashboard.
GET /api/cookies/prefix — only entries flagged prefix: true, used by the node to handle prefix-matched cookies (e.g. _ga_*). The node caches this list for 10 minutes.

If the daily sync fails, the next attempt is 24 hours later; there is no retry-on-failure logic.

HTTP API surface¶

All routes require Authorization: Bearer <token>, where <token> matches one of the comma-separated values in the controller_auth SSM parameter. The only unauthenticated path is GET / (health check used by the ALB target group).

Job lifecycle (called by dashboard / core)¶

Method	Path	Purpose
`POST`	`/api/job`	Create a scan. Validates the domain via DNS, rejects localhost. Returns 429 if the caller IP already has a running `express` job.
`GET`	`/api/job/:id`	Fetch a single job row.
`GET`	`/api/report/:id`	Fetch job + full S3 report (only if `Completed` or `Failed`).
`GET`	`/api/express-report/:id?key=<sha256>`	Public-friendly summary, gated by HMAC of `jobId + webhook.key`. Powers the embeddable scan widget.
`DELETE`	`/api/job/:id`	Hard delete.

Job lifecycle (called by node)¶

Method	Path	Purpose
`GET`	`/api/job/next/:nodeId`	Claim the next eligible job. Implements the load-balancing skip logic. Returns `{ job: null }` when there's no work or the node is at capacity.
`GET`	`/api/job/running/:nodeId?max=N`	Recover up to N in-flight jobs assigned to this node, used on node startup.
`GET`	`/api/job/unassigned`	Active jobs with `nodeId IS NULL` — used for monitoring.
`PATCH`	`/api/job/:id`	Advance status, push progress counters, or update the summary `report`. Drives the state machine.
`POST`	`/api/job/:id/report`	Upload full `JobStore` JSON. atrax-api routes it to express vs extended bucket by `settings.type`.
`POST`	`/api/job/:id/screenshot`	Upload PNG bytes. Returns the public CDN URL.
`GET`	`/api/job/:id/checkpoint`	Re-download the last `JobStore` JSON for crash recovery.
`POST`	`/api/notification`	Fire the customer webhook configured in `settings.webhook` and record the delivery attempt (no retries).

Node management¶

Method	Path	Purpose
`GET`	`/api/node`	Resolve a node's id from its source IP. Used by a node before it knows its id.
`POST`	`/api/node`	Register / re-register by `(hostname, ipAddress)`. Returns the node id.
`PATCH`	`/api/node/:id`	Update node status.
`GET`	`/api/node/active`	List all active nodes with their job counts (powers the ops dashboard).
`POST`	`/api/node/:id/heartbeat`	Liveness ping. Touches `seenAt` and re-marks the node `Active`.
`POST`	`/api/node/:id/deactivate`	Manual drain — sets the node `InActive` and orphans its jobs.

Reference data¶

Method	Path	Purpose
`GET`	`/api/cookies`	Full taxonomy.
`GET`	`/api/cookies/prefix`	Prefix-match entries only (consumed by nodes, cached 10 min).

Stage Deployment¶

	atrax-api	atrax-node
Region	eu-central-1	eu-west-1
URL / Hostname	`https://atrax-api.stage.cookiehub.net`	(no inbound)
VPC	shared default VPC (`172.31.0.0/16`)	dedicated `172.30.128.0/17`
ECS cluster	`stage-euc1-core-ecs-cluster` (shared core)	`stage-euc1-atrax-ecs-cluster`
EC2 host	shared core (`t3.small`)	1× `t3.medium`
EIP	n/a	`eipalloc-08277e7e15ba1ecbf`
Task sizing	256 CPU / 512 MB hard / 256 MB soft, `desired_count = 1`	1024 CPU / 1024 MB hard / 512 MB soft, `desired_count = 1`
Health check	`GET /health`	n/a
ECR	`atrax-api` (eu-central-1)	`atrax-node` (eu-west-1)
S3 buckets	`atrax-stage-{express,extended,screenshots}` (pre-Terraform)	no direct S3 access — POSTs to atrax-api
IAM user (S3)	`stage-euc1-atrax-api-s3` (terraform-managed)	n/a — atrax-api owns the only S3 credentials
DB	PostgreSQL RDS, database `atrax`, user `atraxstage` (owns the DB)	n/a

Production Deployment¶

	atrax-api	atrax-node
Region	eu-central-1	eu-west-1
URL / Hostname	`https://atrax-api.cookiehub.net`	(no inbound)
VPC	shared default VPC (`172.31.0.0/16`)	dedicated `172.30.0.0/17`
ECS cluster	`prod-euc1-core-ecs-cluster` (shared core)	`prod-euw1-atrax-ecs-cluster`
EC2 host	shared core (`t3.small`)	1× `r6a.xlarge` (4 vCPU, 32 GB) running 2 tasks
EIP	n/a	`eipalloc-0f948e1ce9cf2609d`
Task sizing	512 CPU / 1024 MB hard / 512 MB soft, `desired_count = 1`	1024 CPU (weight) / 16384 MB hard / 8000 MB soft, `desired_count = 2`
Health check	`GET /health`	n/a
ECR	`atrax-api` (eu-central-1)	`atrax-node` (eu-west-1)
S3 buckets	`cookiehub-atrax-{express,extended,screenshots}` (terraform-managed)	no direct S3 access — POSTs to atrax-api
IAM user (S3)	`prod-euc1-atrax-api-s3` (terraform-managed)	n/a
DB	PostgreSQL RDS, database `atrax`, user `atraxprod` (owns the DB)	n/a

S3 bucket lifecycle (prod only)¶

Bucket	Current versions	Non-current versions
`cookiehub-atrax-screenshots`	365 days	30 days
`cookiehub-atrax-express`	30 days	30 days
`cookiehub-atrax-extended`	365 days	365 days

screenshots is the only public bucket (AES256, public-read for static asset delivery). The other two are private with KMS encryption.

Secrets¶

Stored in SSM under /atrax/{env}/atrax-api/ and /atrax/{env}/atrax-node/:

Parameter	Component	Description
`database_url`	atrax-api	PostgreSQL connection string. Must include `?uselibpqcompat=true&sslmode=require` (see DB notes). Password must be URL-encoded.
`controller_auth`	atrax-api	Bearer tokens accepted by atrax-api (comma-separated, one per caller).
`s3_access_key_id`	atrax-api	Access key from the terraform-managed `{env}-euc1-atrax-api-s3` IAM user.
`s3_secret_access_key`	atrax-api	Secret key for the same IAM user.
`controller_url`	atrax-node	URL of atrax-api (e.g. `https://atrax-api.cookiehub.net`).
`controller_auth`	atrax-node	Bearer token used to authenticate with atrax-api. Must match one of the api's `controller_auth` tokens.
`bugsnag_api_key`	atrax-node	Error tracking.

The two S3 SSM values are populated from terraform outputs after terraform apply:

# from environments/{env}/eu-central-1
aws ssm put-parameter --type SecureString --overwrite \
  --name /atrax/{env}/atrax-api/s3_access_key_id \
  --value "$(terraform output -raw atrax_api_s3_access_key_id)" \
  --region eu-central-1
aws ssm put-parameter --type SecureString --overwrite \
  --name /atrax/{env}/atrax-api/s3_secret_access_key \
  --value "$(terraform output -raw atrax_api_s3_secret_access_key)" \
  --region eu-central-1

The other secrets are still set manually after first apply (placeholder values are written by terraform with lifecycle { ignore_changes = [value] }). See Secrets management.

Database notes¶

atrax-api uses Prisma 7 with @prisma/adapter-pg. Two non-obvious requirements when setting database_url:

Connection-string flags. Append ?uselibpqcompat=true&sslmode=require. Without uselibpqcompat=true, queries fail with DriverAdapterError: DatabaseAccessDenied (Prisma error P1010) even when authentication and permissions are fine — the adapter handles connection initialization differently from libpq, and RDS expects libpq behavior. sslmode=require enables TLS, which RDS supports natively.
URL-encode the password. The random_password.postgres resource generates passwords with ?, #, &, +, [, ], etc. — all reserved in URL syntax. Strict URL parsers (Prisma's Rust-based one) reject them; lenient ones (pg, used by core-api / vault-api) tolerate them. Use encodeURIComponent on the raw password before substituting into the URL.

Each environment has a dedicated postgres user (atraxstage / atraxprod) that owns the database and the public schema — set with ALTER DATABASE atrax OWNER TO … plus REASSIGN OWNED BY cookmin TO … after importing any schema dump as the master user. Owning everything avoids per-table grant maintenance.

Deployment¶

Image pushes to ECR + aws ecs update-service --force-new-deployment:

# atrax-api (eu-central-1)
aws ecs update-service --cluster {env}-euc1-core-ecs-cluster \
  --service atrax-api --force-new-deployment --region eu-central-1

# atrax-node (eu-west-1)
aws ecs update-service --cluster {env}-euw1-atrax-ecs-cluster \
  --service atrax-node --force-new-deployment --region eu-west-1

Force-new-deployment is also the way to make a running task pick up updated SSM values, since the ECS agent only reads secrets at task launch.

Dependencies¶

PostgreSQL RDS (eu-central-1) — scan jobs, results, node registrations
S3 — crawl reports (express, extended) and screenshots
Core API / Dashboard — triggers scans via REST and reads results
No cross-region peering between the eu-central-1 and eu-west-1 VPCs — atrax-node reaches atrax-api only via the public ALB (atrax-api.{env}.cookiehub.net)

Scaling¶

The atrax-node cluster is sized horizontally (number of EIPs available to import) × vertically (instance type + tasks-per-host). To add a second host:

Allocate a new EIP in the prod account, eu-west-1.
Bump node_count = 2 in environments/prod/eu-west-1/atrax.tf.
terraform import 'module.atrax_ecs.aws_eip.node["b"]' eipalloc-… before running plan.
terraform apply.

To add another worker on an existing host: bump desired_count on the atrax-node service.