Atrax¶
Atrax is the web scanner — it crawls customer domains to discover cookies, scripts, and tracking technologies. It splits along a strong security boundary: the API controller runs on the trusted core network, while the browser workers run in their own region, in their own VPC, with their own egress IPs.
Architecture¶
flowchart LR
subgraph euc1[eu-central-1]
API[atrax-api]
DB[(PostgreSQL RDS)]
end
subgraph euw1[eu-west-1]
Node[atrax-node]
end
Core[core / dashboard] -->|REST| API
Node -->|polls jobs / heartbeat| API
Node -->|browser| Site[Customer site]
Node -->|POST report + screenshot| API
API --> DB
API -->|writes reports + screenshots| S3[(S3)]
Core -->|fetches reports| API
| Component | Region | Purpose |
|---|---|---|
| atrax-api | eu-central-1 | Hono HTTP controller on Node.js. Receives scan requests, hands jobs to nodes that poll for them, persists results to PostgreSQL, and uploads node-supplied report JSON + screenshots to S3 using its own IAM user. |
| atrax-node | eu-west-1 | Headless Puppeteer worker. Polls atrax-api for jobs, runs untrusted website code in Chromium, POSTs results back to atrax-api (no direct S3 access). |
Why the region split¶
atrax-node executes arbitrary website JavaScript inside Chromium — it's the highest-risk surface in the network. Putting it in a dedicated VPC in a different region (no peering, no shared subnets) means a compromised crawler can't reach the production data plane. The eu-west-1 location is also where Atrax has run historically, so the EIPs assigned to crawler hosts are the ones customers have already added to their bot allow-lists.
Cluster + EC2 shape (atrax-node)¶
The atrax-node cluster is not ASG-backed. It's a fixed set of EC2 instances in public subnets with one Elastic IP per host attached directly. EIPs are declared in the module but imported per environment — terraform never allocates a new one, because doing so would change the egress IP and break customer whitelists. Replacing an instance via terraform taint keeps the EIP attached.
The ECS service uses placement_strategy { type = "spread", field = "instanceId" } so when there are multiple hosts and multiple tasks, traffic fans out across all EIPs. CPU and memory limits are set at the container level rather than the task level — this lets a single task burst into the full host capacity when alone, while memoryReservation (the soft limit, used for placement) lets multiple tasks pack onto one host.
How a scan flows through the system¶
A scan is a long-lived job that bounces between the dashboard, atrax-api, and a node over many minutes. The whole flow is pull-based — atrax-api never opens an outbound connection to a node. Nodes ask for work, report progress, and push results back over HTTPS to atrax-api.{env}.cookiehub.net.
sequenceDiagram
participant Dash as Dashboard / Core
participant API as atrax-api
participant DB as PostgreSQL
participant Node as atrax-node
participant Site as Customer site
participant S3
Dash->>API: POST /api/job (domain, settings, webhook)
API->>DB: insert job, status=Pending
API-->>Dash: { newJob: { id, ... } }
loop every ~500ms
Node->>API: GET /api/job/next/{nodeId}
API->>DB: pick eligible Pending job (load-balanced)
API-->>Node: { job } or { job: null }
end
Node->>API: PATCH /api/job/{id} status=Crawling
loop for each page (up to maxPages / maxSeconds)
Node->>Site: navigate, capture cookies + network
Note over Node: every 1000 pages → POST /api/job/{id}/report (checkpoint)
end
Node->>API: PATCH /api/job/{id} status=CrawlCompleted
loop verify pages flagged during crawl
Node->>Site: replay with consent, check cookie persistence
end
Node->>API: PATCH /api/job/{id} status=VerifyCompleted
Node->>API: POST /api/job/{id}/screenshot (binary PNG)
API->>S3: PUT screenshots/{domain}/{id}.png
Node->>API: POST /api/job/{id}/report (full JobStore JSON)
API->>S3: PUT {express|extended}/{domain}/{id}.json
Node->>API: PATCH /api/job/{id} status=Completed, report=summary
Node->>API: POST /api/notification (fire customer webhook)
Dash->>API: GET /api/report/{id}
API->>S3: GET report
API-->>Dash: { job, report }
Job lifecycle¶
Status is a single integer column on the Job row, advanced exclusively by the node via PATCH /api/job/:id. atrax-api never advances a job's status on its own — even when a node goes stale, the API only nulls out nodeId and lets the job be re-claimed at its current status (see Node liveness).
| Value | Name | Driver | Meaning |
|---|---|---|---|
0 |
Pending |
API | Created by POST /api/job. No node assigned. |
1 |
Claimed |
API | A node picked up the job via GET /api/job/next/:nodeId. nodeId is set. |
2 |
Crawling |
Node | Identification stage — discovering and visiting pages. |
3 |
CrawlCompleted |
Node | Identification finished, verification queued. |
4 |
VerifyCompleted |
Node | Verification finished, finalization (screenshot + report upload) running. |
5 |
Completed |
Node | Terminal success. Full report is in S3, summary is on the job row. |
6 |
Failed |
Node | Terminal failure. Partial report may still be uploaded. |
stateDiagram-v2
[*] --> Pending
Pending --> Claimed: node calls /job/next
Claimed --> Crawling: node starts identification
Crawling --> CrawlCompleted
CrawlCompleted --> VerifyCompleted: verification pass done
VerifyCompleted --> Completed: screenshot + report uploaded
Crawling --> Failed: ≥10 page errors or 100% failure rate
CrawlCompleted --> Failed
VerifyCompleted --> Failed
Completed --> [*]
Failed --> [*]
The values are defined in atrax-api/src/types/index.ts and atrax/src/types/jobStatus.ts — they must stay in sync between the two repos.
Dispatch and load balancing¶
atrax-api does not push work — nodes pull it. The dispatch endpoint is GET /api/job/next/:nodeId (atrax-api/src/routes/job.ts), and the algorithm is intentionally simple:
- Count this node's active jobs (
status < Completed). - Count active jobs on every other
Activenode. - Compute
skip= number of nodes strictly less busy than this one. - Find the first job with
status < Completed AND nodeId IS NULL, ordered bystatus DESC, priority DESC, andOFFSET skip. - Only assign it if this node has capacity:
< 4active jobs → take any priority4..7active jobs → only take jobs withpriority >= 7>= 8→ never take
The skip trick is a poor person's load balancer: when the node calling in is busier than its peers, it walks past the first N waiting jobs and picks one further down the queue, leaving the head for less-loaded nodes to grab on their next poll. There is no row-level locking — the assignment is prisma.job.update, last write wins. A simultaneous double-claim is theoretically possible but unlikely given the 500 ms poll cadence and OFFSET skip divergence.
The status DESC ordering is deliberate: jobs already in flight (Crawling, etc.) but orphaned by a stale-node sweep get re-claimed before brand-new Pending work, so half-finished crawls preempt new ones.
Nodes also call GET /api/job/running/:nodeId?max=4 once on startup to recover any jobs they were working on before a restart. The API filters on status > Pending AND status < Completed AND nodeId = self.
What a scan actually does¶
The crawler is a single Puppeteer-extra (stealth plugin) process running up to 8 concurrent jobs (atrax/src/crawler/config.ts — MAX_CONCURRENT_JOBS). Each job gets its own browser instance and its own profile directory under ./profiles/{jobId} so cookies and storage don't bleed across jobs.
Two-stage crawl¶
A job runs the page list twice, with the verification pass triggered as a phase change rather than a separate API job:
- Identification (
status = Crawling). Visits each entry URL, auto-consents to common CMPs, captures cookies/storage/network/screenshot, and walks<a>hrefs that matchsettings.hostsand the include/exclude filters. Stops when one ofmaxPages,maxSeconds, or "no new pages discovered" is hit. The first page (in production, non-implementation jobs) is screenshot-captured to./.temp/{jobId}.png. - Verification (
status = CrawlCompleted→VerifyCompleted). Replays only pages flaggedverificationNeededduring the first pass. The user-agent flips fromCookieHubScan/3.0toCookieHubVerify/3.0. The crawler grants full CMP consent up front and re-checks whether cookies still appear — those still present after consent is granted are flaggednotCompliant.
What gets captured per page¶
- HTTP cookies — via CDP
Network.getAllCookies(name, domain, path, expires, httpOnly, secure, session, value). - Storage —
localStorageandsessionStorageifsettings.storageis true. Values are truncated to 30 chars +...before leaving the browser, so we never persist full session payloads. - Network —
Network.requestWillBeSentandNetwork.responseReceivedExtraInfoevents, capped at 5000 each per page (MAX_REQUESTS/MAX_RESPONSES). Older entries are dropped. - Page metadata — title, meta description, OG tags,
documentElement.lang, top 20 distinct non-black/white computed colors, full HTML viapage.content(). - Tech stack —
detectStack()runs against captured HTML + request URLs to label e.g.next.js,wordpress,cookiebot. - Cookiehub-specific — detects
/c2/{hash}.js(and legacy/cc/{hash}.js) loads, whether they fired before or after GTM, and Google Consent Mode state (gcd=parameter, GCM v1 vs v2).
CMP auto-consent¶
On the entry URL of stage 1, the node calls into the page to grant consent through whichever CMP it detects: Cookiebot, OneTrust, Didomi, Iubenda, UserCentrics, and ~8 others (atrax/src/crawler/services/browser/handlers/). This is intentional: we want to see what tracking the site fires after a user clicks "accept all", which is when most regulators care about the cookie inventory. Stage 2 also grants consent and uses persistence as the compliance signal.
The crawler does not click visual banner buttons — it calls the JS APIs directly. If a site uses a custom CMP we don't recognise, we crawl in the un-consented state, which can hide cookies that only fire post-consent.
Per-page failure handling¶
Single-page navigation has two timeouts that fight each other: a 25 s wrapper around page.goto and Puppeteer's own 40 s internal timeout. The wrapper wins in practice. On entry-URL failure only, the node retries https://www.X and then http://X before giving up; intermediate-page failures are not retried.
A whole job is marked Failed only when:
- ≥ 10 pages have timed out or returned 5xx and the current page also times out / 5xxs, or
- 100 % of attempted pages errored.
Any other page failure is just logged on the page record. Bugsnag receives the top-level exceptions; per-page errors do not.
Resource and content limits¶
| Limit | Value | Source |
|---|---|---|
| Concurrent jobs per node | 8 | MAX_CONCURRENT_JOBS |
| Startup-time job recovery | up to 4 | GET /api/job/running/:nodeId?max=4 |
| Page navigation timeout | 25 s wrapper / 40 s puppeteer | navigationHandler.ts |
| Storage value capture | first 30 chars + ... |
cookieHandler.ts |
| Network requests stored | 5000 / page | MAX_REQUESTS |
| Network responses stored | 5000 / page | MAX_RESPONSES |
| Link length filter | 255 chars | cdpHandler.ts |
| Detected colours | top 20 | metaHandler.ts |
| Checkpoint upload cadence | every 1000 pages | storageHandler.ts |
| Profile cleanup (legacy) | every 4 h via mycron |
container cron, profiles > 1 day old |
Results and storage¶
The node never talks to S3 directly. It POSTs results to atrax-api and atrax-api uses its {env}-euc1-atrax-api-s3 IAM user to PUT to S3. This is the security boundary: the eu-west-1 crawler hosts run untrusted website JS but hold no cloud credentials beyond their controller bearer token.
| Endpoint | Bucket | Key | Trigger |
|---|---|---|---|
POST /api/job/:id/report (JSON body) |
atrax-{env}-express or atrax-{env}-extended |
{domain}/{jobId}.json |
Checkpoint every 1000 pages, plus once on completion |
POST /api/job/:id/screenshot (binary, Content-Type: image/png) |
atrax-{env}-screenshots |
{domain}/{jobId}.png |
After verification stage, production non-implementation jobs only |
PATCH /api/job/:id { status, report, progress } |
(PostgreSQL only) | — | Phase transitions and the final summary |
Bucket selection is by settings.type: jobs with type === "extended" go to the extended bucket, anything else (including express) goes to express. The extended report retains the full per-page detail, the express report is the smaller summary used by the public-facing scan widget.
The summary stored on the Job.report JSONB column (separate from the S3 file) carries the data the dashboard uses for the result list — risk score, cookie counts per category, GCM/TCF state, screenshot URL, detected stack — so the dashboard can render the index view without an S3 round-trip. Full per-page detail still requires GET /api/report/:id, which fetches from S3.
The screenshot bucket is public-read (it backs <img> tags in the result UI); the express and extended buckets are private with KMS encryption. Lifecycle policies (prod) are listed under S3 bucket lifecycle.
Node liveness and stale cleanup¶
Each node calls POST /api/node/:id/heartbeat every 30 s (HEARTBEAT_INTERVAL_MS). The endpoint sets seenAt = now and status = Active on the node row.
atrax-api runs an internal sweeper on a setInterval (atrax-api/src/scheduler.ts → handleStaleNodes) every 5 minutes. For any Active node whose seenAt is older than NODE_STALE_MS (default 1 hour), the sweeper:
- Sets
node.status = InActive. - Sets
nodeId = nullon every job belonging to that node wherestatus < Completed.
The job's status is left as-is. So a node that died mid-Crawling leaves the job at status Crawling, nodeId=null, and on the next dispatch round, that job gets picked first (because ORDER BY status DESC puts in-flight statuses ahead of Pending). The new node then continues from where the previous one left off — possible because the previous node had been writing JobStore checkpoints to S3 every 1000 pages, which the new node fetches via GET /api/job/:id/checkpoint.
There is no global job timeout. A job that gets wedged on a live node will stay there until the node itself fails or is force-cycled.
Cookie reference data¶
atrax-api owns the cookie taxonomy that classifies raw crawl output into Necessary / Preferences / Analytics / Marketing / Uncategorized. The data lives in two Prisma models, Category and Cookie, and is synced daily from https://cdn.cookiehub.eu/db/cookies.json by the same scheduler that runs the stale-node sweep (atrax-api/src/tasks/sync-cookies.ts). The CDN is the source of truth — local DB rows are upserted by externalId.
Two endpoints expose this data:
GET /api/cookies— full list, used by the dashboard.GET /api/cookies/prefix— only entries flaggedprefix: true, used by the node to handle prefix-matched cookies (e.g._ga_*). The node caches this list for 10 minutes.
If the daily sync fails, the next attempt is 24 hours later; there is no retry-on-failure logic.
HTTP API surface¶
All routes require Authorization: Bearer <token>, where <token> matches one of the comma-separated values in the controller_auth SSM parameter. The only unauthenticated path is GET / (health check used by the ALB target group).
Job lifecycle (called by dashboard / core)¶
| Method | Path | Purpose |
|---|---|---|
POST |
/api/job |
Create a scan. Validates the domain via DNS, rejects localhost. Returns 429 if the caller IP already has a running express job. |
GET |
/api/job/:id |
Fetch a single job row. |
GET |
/api/report/:id |
Fetch job + full S3 report (only if Completed or Failed). |
GET |
/api/express-report/:id?key=<sha256> |
Public-friendly summary, gated by HMAC of jobId + webhook.key. Powers the embeddable scan widget. |
DELETE |
/api/job/:id |
Hard delete. |
Job lifecycle (called by node)¶
| Method | Path | Purpose |
|---|---|---|
GET |
/api/job/next/:nodeId |
Claim the next eligible job. Implements the load-balancing skip logic. Returns { job: null } when there's no work or the node is at capacity. |
GET |
/api/job/running/:nodeId?max=N |
Recover up to N in-flight jobs assigned to this node, used on node startup. |
GET |
/api/job/unassigned |
Active jobs with nodeId IS NULL — used for monitoring. |
PATCH |
/api/job/:id |
Advance status, push progress counters, or update the summary report. Drives the state machine. |
POST |
/api/job/:id/report |
Upload full JobStore JSON. atrax-api routes it to express vs extended bucket by settings.type. |
POST |
/api/job/:id/screenshot |
Upload PNG bytes. Returns the public CDN URL. |
GET |
/api/job/:id/checkpoint |
Re-download the last JobStore JSON for crash recovery. |
POST |
/api/notification |
Fire the customer webhook configured in settings.webhook and record the delivery attempt (no retries). |
Node management¶
| Method | Path | Purpose |
|---|---|---|
GET |
/api/node |
Resolve a node's id from its source IP. Used by a node before it knows its id. |
POST |
/api/node |
Register / re-register by (hostname, ipAddress). Returns the node id. |
PATCH |
/api/node/:id |
Update node status. |
GET |
/api/node/active |
List all active nodes with their job counts (powers the ops dashboard). |
POST |
/api/node/:id/heartbeat |
Liveness ping. Touches seenAt and re-marks the node Active. |
POST |
/api/node/:id/deactivate |
Manual drain — sets the node InActive and orphans its jobs. |
Reference data¶
| Method | Path | Purpose |
|---|---|---|
GET |
/api/cookies |
Full taxonomy. |
GET |
/api/cookies/prefix |
Prefix-match entries only (consumed by nodes, cached 10 min). |
Stage Deployment¶
| atrax-api | atrax-node | |
|---|---|---|
| Region | eu-central-1 | eu-west-1 |
| URL / Hostname | https://atrax-api.stage.cookiehub.net |
(no inbound) |
| VPC | shared default VPC (172.31.0.0/16) |
dedicated 172.30.128.0/17 |
| ECS cluster | stage-euc1-core-ecs-cluster (shared core) |
stage-euc1-atrax-ecs-cluster |
| EC2 host | shared core (t3.small) |
1× t3.medium |
| EIP | n/a | eipalloc-08277e7e15ba1ecbf |
| Task sizing | 256 CPU / 512 MB hard / 256 MB soft, desired_count = 1 |
1024 CPU / 1024 MB hard / 512 MB soft, desired_count = 1 |
| Health check | GET /health |
n/a |
| ECR | atrax-api (eu-central-1) |
atrax-node (eu-west-1) |
| S3 buckets | atrax-stage-{express,extended,screenshots} (pre-Terraform) |
no direct S3 access — POSTs to atrax-api |
| IAM user (S3) | stage-euc1-atrax-api-s3 (terraform-managed) |
n/a — atrax-api owns the only S3 credentials |
| DB | PostgreSQL RDS, database atrax, user atraxstage (owns the DB) |
n/a |
Production Deployment¶
| atrax-api | atrax-node | |
|---|---|---|
| Region | eu-central-1 | eu-west-1 |
| URL / Hostname | https://atrax-api.cookiehub.net |
(no inbound) |
| VPC | shared default VPC (172.31.0.0/16) |
dedicated 172.30.0.0/17 |
| ECS cluster | prod-euc1-core-ecs-cluster (shared core) |
prod-euw1-atrax-ecs-cluster |
| EC2 host | shared core (t3.small) |
1× r6a.xlarge (4 vCPU, 32 GB) running 2 tasks |
| EIP | n/a | eipalloc-0f948e1ce9cf2609d |
| Task sizing | 512 CPU / 1024 MB hard / 512 MB soft, desired_count = 1 |
1024 CPU (weight) / 16384 MB hard / 8000 MB soft, desired_count = 2 |
| Health check | GET /health |
n/a |
| ECR | atrax-api (eu-central-1) |
atrax-node (eu-west-1) |
| S3 buckets | cookiehub-atrax-{express,extended,screenshots} (terraform-managed) |
no direct S3 access — POSTs to atrax-api |
| IAM user (S3) | prod-euc1-atrax-api-s3 (terraform-managed) |
n/a |
| DB | PostgreSQL RDS, database atrax, user atraxprod (owns the DB) |
n/a |
S3 bucket lifecycle (prod only)¶
| Bucket | Current versions | Non-current versions |
|---|---|---|
cookiehub-atrax-screenshots |
365 days | 30 days |
cookiehub-atrax-express |
30 days | 30 days |
cookiehub-atrax-extended |
365 days | 365 days |
screenshots is the only public bucket (AES256, public-read for static asset delivery). The other two are private with KMS encryption.
Secrets¶
Stored in SSM under /atrax/{env}/atrax-api/ and /atrax/{env}/atrax-node/:
| Parameter | Component | Description |
|---|---|---|
database_url |
atrax-api | PostgreSQL connection string. Must include ?uselibpqcompat=true&sslmode=require (see DB notes). Password must be URL-encoded. |
controller_auth |
atrax-api | Bearer tokens accepted by atrax-api (comma-separated, one per caller). |
s3_access_key_id |
atrax-api | Access key from the terraform-managed {env}-euc1-atrax-api-s3 IAM user. |
s3_secret_access_key |
atrax-api | Secret key for the same IAM user. |
controller_url |
atrax-node | URL of atrax-api (e.g. https://atrax-api.cookiehub.net). |
controller_auth |
atrax-node | Bearer token used to authenticate with atrax-api. Must match one of the api's controller_auth tokens. |
bugsnag_api_key |
atrax-node | Error tracking. |
The two S3 SSM values are populated from terraform outputs after terraform apply:
# from environments/{env}/eu-central-1
aws ssm put-parameter --type SecureString --overwrite \
--name /atrax/{env}/atrax-api/s3_access_key_id \
--value "$(terraform output -raw atrax_api_s3_access_key_id)" \
--region eu-central-1
aws ssm put-parameter --type SecureString --overwrite \
--name /atrax/{env}/atrax-api/s3_secret_access_key \
--value "$(terraform output -raw atrax_api_s3_secret_access_key)" \
--region eu-central-1
The other secrets are still set manually after first apply (placeholder values are written by terraform with lifecycle { ignore_changes = [value] }). See Secrets management.
Database notes¶
atrax-api uses Prisma 7 with @prisma/adapter-pg. Two non-obvious requirements when setting database_url:
- Connection-string flags. Append
?uselibpqcompat=true&sslmode=require. Withoutuselibpqcompat=true, queries fail withDriverAdapterError: DatabaseAccessDenied(Prisma error P1010) even when authentication and permissions are fine — the adapter handles connection initialization differently from libpq, and RDS expects libpq behavior.sslmode=requireenables TLS, which RDS supports natively. - URL-encode the password. The
random_password.postgresresource generates passwords with?,#,&,+,[,], etc. — all reserved in URL syntax. Strict URL parsers (Prisma's Rust-based one) reject them; lenient ones (pg, used by core-api / vault-api) tolerate them. UseencodeURIComponenton the raw password before substituting into the URL.
Each environment has a dedicated postgres user (atraxstage / atraxprod) that owns the database and the public schema — set with ALTER DATABASE atrax OWNER TO … plus REASSIGN OWNED BY cookmin TO … after importing any schema dump as the master user. Owning everything avoids per-table grant maintenance.
Deployment¶
Image pushes to ECR + aws ecs update-service --force-new-deployment:
# atrax-api (eu-central-1)
aws ecs update-service --cluster {env}-euc1-core-ecs-cluster \
--service atrax-api --force-new-deployment --region eu-central-1
# atrax-node (eu-west-1)
aws ecs update-service --cluster {env}-euw1-atrax-ecs-cluster \
--service atrax-node --force-new-deployment --region eu-west-1
Force-new-deployment is also the way to make a running task pick up updated SSM values, since the ECS agent only reads secrets at task launch.
Dependencies¶
- PostgreSQL RDS (eu-central-1) — scan jobs, results, node registrations
- S3 — crawl reports (
express,extended) and screenshots - Core API / Dashboard — triggers scans via REST and reads results
- No cross-region peering between the eu-central-1 and eu-west-1 VPCs — atrax-node reaches atrax-api only via the public ALB (
atrax-api.{env}.cookiehub.net)
Scaling¶
The atrax-node cluster is sized horizontally (number of EIPs available to import) × vertically (instance type + tasks-per-host). To add a second host:
- Allocate a new EIP in the prod account, eu-west-1.
- Bump
node_count = 2inenvironments/prod/eu-west-1/atrax.tf. terraform import 'module.atrax_ecs.aws_eip.node["b"]' eipalloc-…before running plan.terraform apply.
To add another worker on an existing host: bump desired_count on the atrax-node service.