Skip to content

Atrax

Atrax is the web scanner — it crawls customer domains to discover cookies, scripts, and tracking technologies. It splits along a strong security boundary: the API controller runs on the trusted core network, while the browser workers run in their own region, in their own VPC, with their own egress IPs.

Architecture

flowchart LR
    subgraph euc1[eu-central-1]
        API[atrax-api]
        DB[(PostgreSQL RDS)]
    end
    subgraph euw1[eu-west-1]
        Node[atrax-node]
    end
    Core[core / dashboard] -->|REST| API
    Node -->|polls jobs / heartbeat| API
    Node -->|browser| Site[Customer site]
    Node -->|POST report + screenshot| API
    API --> DB
    API -->|writes reports + screenshots| S3[(S3)]
    Core -->|fetches reports| API
Component Region Purpose
atrax-api eu-central-1 Hono HTTP controller on Node.js. Receives scan requests, hands jobs to nodes that poll for them, persists results to PostgreSQL, and uploads node-supplied report JSON + screenshots to S3 using its own IAM user.
atrax-node eu-west-1 Headless Puppeteer worker. Polls atrax-api for jobs, runs untrusted website code in Chromium, POSTs results back to atrax-api (no direct S3 access).

Why the region split

atrax-node executes arbitrary website JavaScript inside Chromium — it's the highest-risk surface in the network. Putting it in a dedicated VPC in a different region (no peering, no shared subnets) means a compromised crawler can't reach the production data plane. The eu-west-1 location is also where Atrax has run historically, so the EIPs assigned to crawler hosts are the ones customers have already added to their bot allow-lists.

Cluster + EC2 shape (atrax-node)

The atrax-node cluster is not ASG-backed. It's a fixed set of EC2 instances in public subnets with one Elastic IP per host attached directly. EIPs are declared in the module but imported per environment — terraform never allocates a new one, because doing so would change the egress IP and break customer whitelists. Replacing an instance via terraform taint keeps the EIP attached.

The ECS service uses placement_strategy { type = "spread", field = "instanceId" } so when there are multiple hosts and multiple tasks, traffic fans out across all EIPs. CPU and memory limits are set at the container level rather than the task level — this lets a single task burst into the full host capacity when alone, while memoryReservation (the soft limit, used for placement) lets multiple tasks pack onto one host.

How a scan flows through the system

A scan is a long-lived job that bounces between the dashboard, atrax-api, and a node over many minutes. The whole flow is pull-based — atrax-api never opens an outbound connection to a node. Nodes ask for work, report progress, and push results back over HTTPS to atrax-api.{env}.cookiehub.net.

sequenceDiagram
    participant Dash as Dashboard / Core
    participant API as atrax-api
    participant DB as PostgreSQL
    participant Node as atrax-node
    participant Site as Customer site
    participant S3

    Dash->>API: POST /api/job  (domain, settings, webhook)
    API->>DB: insert job, status=Pending
    API-->>Dash: { newJob: { id, ... } }

    loop every ~500ms
        Node->>API: GET /api/job/next/{nodeId}
        API->>DB: pick eligible Pending job (load-balanced)
        API-->>Node: { job } or { job: null }
    end

    Node->>API: PATCH /api/job/{id}  status=Crawling
    loop for each page (up to maxPages / maxSeconds)
        Node->>Site: navigate, capture cookies + network
        Note over Node: every 1000 pages → POST /api/job/{id}/report (checkpoint)
    end
    Node->>API: PATCH /api/job/{id}  status=CrawlCompleted

    loop verify pages flagged during crawl
        Node->>Site: replay with consent, check cookie persistence
    end
    Node->>API: PATCH /api/job/{id}  status=VerifyCompleted

    Node->>API: POST /api/job/{id}/screenshot  (binary PNG)
    API->>S3: PUT screenshots/{domain}/{id}.png
    Node->>API: POST /api/job/{id}/report  (full JobStore JSON)
    API->>S3: PUT {express|extended}/{domain}/{id}.json
    Node->>API: PATCH /api/job/{id}  status=Completed, report=summary
    Node->>API: POST /api/notification  (fire customer webhook)

    Dash->>API: GET /api/report/{id}
    API->>S3: GET report
    API-->>Dash: { job, report }

Job lifecycle

Status is a single integer column on the Job row, advanced exclusively by the node via PATCH /api/job/:id. atrax-api never advances a job's status on its own — even when a node goes stale, the API only nulls out nodeId and lets the job be re-claimed at its current status (see Node liveness).

Value Name Driver Meaning
0 Pending API Created by POST /api/job. No node assigned.
1 Claimed API A node picked up the job via GET /api/job/next/:nodeId. nodeId is set.
2 Crawling Node Identification stage — discovering and visiting pages.
3 CrawlCompleted Node Identification finished, verification queued.
4 VerifyCompleted Node Verification finished, finalization (screenshot + report upload) running.
5 Completed Node Terminal success. Full report is in S3, summary is on the job row.
6 Failed Node Terminal failure. Partial report may still be uploaded.
stateDiagram-v2
    [*] --> Pending
    Pending --> Claimed: node calls /job/next
    Claimed --> Crawling: node starts identification
    Crawling --> CrawlCompleted
    CrawlCompleted --> VerifyCompleted: verification pass done
    VerifyCompleted --> Completed: screenshot + report uploaded
    Crawling --> Failed: ≥10 page errors or 100% failure rate
    CrawlCompleted --> Failed
    VerifyCompleted --> Failed
    Completed --> [*]
    Failed --> [*]

The values are defined in atrax-api/src/types/index.ts and atrax/src/types/jobStatus.ts — they must stay in sync between the two repos.

Dispatch and load balancing

atrax-api does not push work — nodes pull it. The dispatch endpoint is GET /api/job/next/:nodeId (atrax-api/src/routes/job.ts), and the algorithm is intentionally simple:

  1. Count this node's active jobs (status < Completed).
  2. Count active jobs on every other Active node.
  3. Compute skip = number of nodes strictly less busy than this one.
  4. Find the first job with status < Completed AND nodeId IS NULL, ordered by status DESC, priority DESC, and OFFSET skip.
  5. Only assign it if this node has capacity:
  6. < 4 active jobs → take any priority
  7. 4..7 active jobs → only take jobs with priority >= 7
  8. >= 8 → never take

The skip trick is a poor person's load balancer: when the node calling in is busier than its peers, it walks past the first N waiting jobs and picks one further down the queue, leaving the head for less-loaded nodes to grab on their next poll. There is no row-level locking — the assignment is prisma.job.update, last write wins. A simultaneous double-claim is theoretically possible but unlikely given the 500 ms poll cadence and OFFSET skip divergence.

The status DESC ordering is deliberate: jobs already in flight (Crawling, etc.) but orphaned by a stale-node sweep get re-claimed before brand-new Pending work, so half-finished crawls preempt new ones.

Nodes also call GET /api/job/running/:nodeId?max=4 once on startup to recover any jobs they were working on before a restart. The API filters on status > Pending AND status < Completed AND nodeId = self.

What a scan actually does

The crawler is a single Puppeteer-extra (stealth plugin) process running up to 8 concurrent jobs (atrax/src/crawler/config.tsMAX_CONCURRENT_JOBS). Each job gets its own browser instance and its own profile directory under ./profiles/{jobId} so cookies and storage don't bleed across jobs.

Two-stage crawl

A job runs the page list twice, with the verification pass triggered as a phase change rather than a separate API job:

  1. Identification (status = Crawling). Visits each entry URL, auto-consents to common CMPs, captures cookies/storage/network/screenshot, and walks <a> hrefs that match settings.hosts and the include/exclude filters. Stops when one of maxPages, maxSeconds, or "no new pages discovered" is hit. The first page (in production, non-implementation jobs) is screenshot-captured to ./.temp/{jobId}.png.
  2. Verification (status = CrawlCompletedVerifyCompleted). Replays only pages flagged verificationNeeded during the first pass. The user-agent flips from CookieHubScan/3.0 to CookieHubVerify/3.0. The crawler grants full CMP consent up front and re-checks whether cookies still appear — those still present after consent is granted are flagged notCompliant.

What gets captured per page

  • HTTP cookies — via CDP Network.getAllCookies (name, domain, path, expires, httpOnly, secure, session, value).
  • StoragelocalStorage and sessionStorage if settings.storage is true. Values are truncated to 30 chars + ... before leaving the browser, so we never persist full session payloads.
  • NetworkNetwork.requestWillBeSent and Network.responseReceivedExtraInfo events, capped at 5000 each per page (MAX_REQUESTS / MAX_RESPONSES). Older entries are dropped.
  • Page metadata — title, meta description, OG tags, documentElement.lang, top 20 distinct non-black/white computed colors, full HTML via page.content().
  • Tech stackdetectStack() runs against captured HTML + request URLs to label e.g. next.js, wordpress, cookiebot.
  • Cookiehub-specific — detects /c2/{hash}.js (and legacy /cc/{hash}.js) loads, whether they fired before or after GTM, and Google Consent Mode state (gcd= parameter, GCM v1 vs v2).

On the entry URL of stage 1, the node calls into the page to grant consent through whichever CMP it detects: Cookiebot, OneTrust, Didomi, Iubenda, UserCentrics, and ~8 others (atrax/src/crawler/services/browser/handlers/). This is intentional: we want to see what tracking the site fires after a user clicks "accept all", which is when most regulators care about the cookie inventory. Stage 2 also grants consent and uses persistence as the compliance signal.

The crawler does not click visual banner buttons — it calls the JS APIs directly. If a site uses a custom CMP we don't recognise, we crawl in the un-consented state, which can hide cookies that only fire post-consent.

Per-page failure handling

Single-page navigation has two timeouts that fight each other: a 25 s wrapper around page.goto and Puppeteer's own 40 s internal timeout. The wrapper wins in practice. On entry-URL failure only, the node retries https://www.X and then http://X before giving up; intermediate-page failures are not retried.

A whole job is marked Failed only when:

  • ≥ 10 pages have timed out or returned 5xx and the current page also times out / 5xxs, or
  • 100 % of attempted pages errored.

Any other page failure is just logged on the page record. Bugsnag receives the top-level exceptions; per-page errors do not.

Resource and content limits

Limit Value Source
Concurrent jobs per node 8 MAX_CONCURRENT_JOBS
Startup-time job recovery up to 4 GET /api/job/running/:nodeId?max=4
Page navigation timeout 25 s wrapper / 40 s puppeteer navigationHandler.ts
Storage value capture first 30 chars + ... cookieHandler.ts
Network requests stored 5000 / page MAX_REQUESTS
Network responses stored 5000 / page MAX_RESPONSES
Link length filter 255 chars cdpHandler.ts
Detected colours top 20 metaHandler.ts
Checkpoint upload cadence every 1000 pages storageHandler.ts
Profile cleanup (legacy) every 4 h via mycron container cron, profiles > 1 day old

Results and storage

The node never talks to S3 directly. It POSTs results to atrax-api and atrax-api uses its {env}-euc1-atrax-api-s3 IAM user to PUT to S3. This is the security boundary: the eu-west-1 crawler hosts run untrusted website JS but hold no cloud credentials beyond their controller bearer token.

Endpoint Bucket Key Trigger
POST /api/job/:id/report (JSON body) atrax-{env}-express or atrax-{env}-extended {domain}/{jobId}.json Checkpoint every 1000 pages, plus once on completion
POST /api/job/:id/screenshot (binary, Content-Type: image/png) atrax-{env}-screenshots {domain}/{jobId}.png After verification stage, production non-implementation jobs only
PATCH /api/job/:id { status, report, progress } (PostgreSQL only) Phase transitions and the final summary

Bucket selection is by settings.type: jobs with type === "extended" go to the extended bucket, anything else (including express) goes to express. The extended report retains the full per-page detail, the express report is the smaller summary used by the public-facing scan widget.

The summary stored on the Job.report JSONB column (separate from the S3 file) carries the data the dashboard uses for the result list — risk score, cookie counts per category, GCM/TCF state, screenshot URL, detected stack — so the dashboard can render the index view without an S3 round-trip. Full per-page detail still requires GET /api/report/:id, which fetches from S3.

The screenshot bucket is public-read (it backs <img> tags in the result UI); the express and extended buckets are private with KMS encryption. Lifecycle policies (prod) are listed under S3 bucket lifecycle.

Node liveness and stale cleanup

Each node calls POST /api/node/:id/heartbeat every 30 s (HEARTBEAT_INTERVAL_MS). The endpoint sets seenAt = now and status = Active on the node row.

atrax-api runs an internal sweeper on a setInterval (atrax-api/src/scheduler.tshandleStaleNodes) every 5 minutes. For any Active node whose seenAt is older than NODE_STALE_MS (default 1 hour), the sweeper:

  1. Sets node.status = InActive.
  2. Sets nodeId = null on every job belonging to that node where status < Completed.

The job's status is left as-is. So a node that died mid-Crawling leaves the job at status Crawling, nodeId=null, and on the next dispatch round, that job gets picked first (because ORDER BY status DESC puts in-flight statuses ahead of Pending). The new node then continues from where the previous one left off — possible because the previous node had been writing JobStore checkpoints to S3 every 1000 pages, which the new node fetches via GET /api/job/:id/checkpoint.

There is no global job timeout. A job that gets wedged on a live node will stay there until the node itself fails or is force-cycled.

atrax-api owns the cookie taxonomy that classifies raw crawl output into Necessary / Preferences / Analytics / Marketing / Uncategorized. The data lives in two Prisma models, Category and Cookie, and is synced daily from https://cdn.cookiehub.eu/db/cookies.json by the same scheduler that runs the stale-node sweep (atrax-api/src/tasks/sync-cookies.ts). The CDN is the source of truth — local DB rows are upserted by externalId.

Two endpoints expose this data:

  • GET /api/cookies — full list, used by the dashboard.
  • GET /api/cookies/prefix — only entries flagged prefix: true, used by the node to handle prefix-matched cookies (e.g. _ga_*). The node caches this list for 10 minutes.

If the daily sync fails, the next attempt is 24 hours later; there is no retry-on-failure logic.

HTTP API surface

All routes require Authorization: Bearer <token>, where <token> matches one of the comma-separated values in the controller_auth SSM parameter. The only unauthenticated path is GET / (health check used by the ALB target group).

Job lifecycle (called by dashboard / core)

Method Path Purpose
POST /api/job Create a scan. Validates the domain via DNS, rejects localhost. Returns 429 if the caller IP already has a running express job.
GET /api/job/:id Fetch a single job row.
GET /api/report/:id Fetch job + full S3 report (only if Completed or Failed).
GET /api/express-report/:id?key=<sha256> Public-friendly summary, gated by HMAC of jobId + webhook.key. Powers the embeddable scan widget.
DELETE /api/job/:id Hard delete.

Job lifecycle (called by node)

Method Path Purpose
GET /api/job/next/:nodeId Claim the next eligible job. Implements the load-balancing skip logic. Returns { job: null } when there's no work or the node is at capacity.
GET /api/job/running/:nodeId?max=N Recover up to N in-flight jobs assigned to this node, used on node startup.
GET /api/job/unassigned Active jobs with nodeId IS NULL — used for monitoring.
PATCH /api/job/:id Advance status, push progress counters, or update the summary report. Drives the state machine.
POST /api/job/:id/report Upload full JobStore JSON. atrax-api routes it to express vs extended bucket by settings.type.
POST /api/job/:id/screenshot Upload PNG bytes. Returns the public CDN URL.
GET /api/job/:id/checkpoint Re-download the last JobStore JSON for crash recovery.
POST /api/notification Fire the customer webhook configured in settings.webhook and record the delivery attempt (no retries).

Node management

Method Path Purpose
GET /api/node Resolve a node's id from its source IP. Used by a node before it knows its id.
POST /api/node Register / re-register by (hostname, ipAddress). Returns the node id.
PATCH /api/node/:id Update node status.
GET /api/node/active List all active nodes with their job counts (powers the ops dashboard).
POST /api/node/:id/heartbeat Liveness ping. Touches seenAt and re-marks the node Active.
POST /api/node/:id/deactivate Manual drain — sets the node InActive and orphans its jobs.

Reference data

Method Path Purpose
GET /api/cookies Full taxonomy.
GET /api/cookies/prefix Prefix-match entries only (consumed by nodes, cached 10 min).

Stage Deployment

atrax-api atrax-node
Region eu-central-1 eu-west-1
URL / Hostname https://atrax-api.stage.cookiehub.net (no inbound)
VPC shared default VPC (172.31.0.0/16) dedicated 172.30.128.0/17
ECS cluster stage-euc1-core-ecs-cluster (shared core) stage-euc1-atrax-ecs-cluster
EC2 host shared core (t3.small) t3.medium
EIP n/a eipalloc-08277e7e15ba1ecbf
Task sizing 256 CPU / 512 MB hard / 256 MB soft, desired_count = 1 1024 CPU / 1024 MB hard / 512 MB soft, desired_count = 1
Health check GET /health n/a
ECR atrax-api (eu-central-1) atrax-node (eu-west-1)
S3 buckets atrax-stage-{express,extended,screenshots} (pre-Terraform) no direct S3 access — POSTs to atrax-api
IAM user (S3) stage-euc1-atrax-api-s3 (terraform-managed) n/a — atrax-api owns the only S3 credentials
DB PostgreSQL RDS, database atrax, user atraxstage (owns the DB) n/a

Production Deployment

atrax-api atrax-node
Region eu-central-1 eu-west-1
URL / Hostname https://atrax-api.cookiehub.net (no inbound)
VPC shared default VPC (172.31.0.0/16) dedicated 172.30.0.0/17
ECS cluster prod-euc1-core-ecs-cluster (shared core) prod-euw1-atrax-ecs-cluster
EC2 host shared core (t3.small) r6a.xlarge (4 vCPU, 32 GB) running 2 tasks
EIP n/a eipalloc-0f948e1ce9cf2609d
Task sizing 512 CPU / 1024 MB hard / 512 MB soft, desired_count = 1 1024 CPU (weight) / 16384 MB hard / 8000 MB soft, desired_count = 2
Health check GET /health n/a
ECR atrax-api (eu-central-1) atrax-node (eu-west-1)
S3 buckets cookiehub-atrax-{express,extended,screenshots} (terraform-managed) no direct S3 access — POSTs to atrax-api
IAM user (S3) prod-euc1-atrax-api-s3 (terraform-managed) n/a
DB PostgreSQL RDS, database atrax, user atraxprod (owns the DB) n/a

S3 bucket lifecycle (prod only)

Bucket Current versions Non-current versions
cookiehub-atrax-screenshots 365 days 30 days
cookiehub-atrax-express 30 days 30 days
cookiehub-atrax-extended 365 days 365 days

screenshots is the only public bucket (AES256, public-read for static asset delivery). The other two are private with KMS encryption.

Secrets

Stored in SSM under /atrax/{env}/atrax-api/ and /atrax/{env}/atrax-node/:

Parameter Component Description
database_url atrax-api PostgreSQL connection string. Must include ?uselibpqcompat=true&sslmode=require (see DB notes). Password must be URL-encoded.
controller_auth atrax-api Bearer tokens accepted by atrax-api (comma-separated, one per caller).
s3_access_key_id atrax-api Access key from the terraform-managed {env}-euc1-atrax-api-s3 IAM user.
s3_secret_access_key atrax-api Secret key for the same IAM user.
controller_url atrax-node URL of atrax-api (e.g. https://atrax-api.cookiehub.net).
controller_auth atrax-node Bearer token used to authenticate with atrax-api. Must match one of the api's controller_auth tokens.
bugsnag_api_key atrax-node Error tracking.

The two S3 SSM values are populated from terraform outputs after terraform apply:

# from environments/{env}/eu-central-1
aws ssm put-parameter --type SecureString --overwrite \
  --name /atrax/{env}/atrax-api/s3_access_key_id \
  --value "$(terraform output -raw atrax_api_s3_access_key_id)" \
  --region eu-central-1
aws ssm put-parameter --type SecureString --overwrite \
  --name /atrax/{env}/atrax-api/s3_secret_access_key \
  --value "$(terraform output -raw atrax_api_s3_secret_access_key)" \
  --region eu-central-1

The other secrets are still set manually after first apply (placeholder values are written by terraform with lifecycle { ignore_changes = [value] }). See Secrets management.

Database notes

atrax-api uses Prisma 7 with @prisma/adapter-pg. Two non-obvious requirements when setting database_url:

  1. Connection-string flags. Append ?uselibpqcompat=true&sslmode=require. Without uselibpqcompat=true, queries fail with DriverAdapterError: DatabaseAccessDenied (Prisma error P1010) even when authentication and permissions are fine — the adapter handles connection initialization differently from libpq, and RDS expects libpq behavior. sslmode=require enables TLS, which RDS supports natively.
  2. URL-encode the password. The random_password.postgres resource generates passwords with ?, #, &, +, [, ], etc. — all reserved in URL syntax. Strict URL parsers (Prisma's Rust-based one) reject them; lenient ones (pg, used by core-api / vault-api) tolerate them. Use encodeURIComponent on the raw password before substituting into the URL.

Each environment has a dedicated postgres user (atraxstage / atraxprod) that owns the database and the public schema — set with ALTER DATABASE atrax OWNER TO … plus REASSIGN OWNED BY cookmin TO … after importing any schema dump as the master user. Owning everything avoids per-table grant maintenance.

Deployment

Image pushes to ECR + aws ecs update-service --force-new-deployment:

# atrax-api (eu-central-1)
aws ecs update-service --cluster {env}-euc1-core-ecs-cluster \
  --service atrax-api --force-new-deployment --region eu-central-1

# atrax-node (eu-west-1)
aws ecs update-service --cluster {env}-euw1-atrax-ecs-cluster \
  --service atrax-node --force-new-deployment --region eu-west-1

Force-new-deployment is also the way to make a running task pick up updated SSM values, since the ECS agent only reads secrets at task launch.

Dependencies

  • PostgreSQL RDS (eu-central-1) — scan jobs, results, node registrations
  • S3 — crawl reports (express, extended) and screenshots
  • Core API / Dashboard — triggers scans via REST and reads results
  • No cross-region peering between the eu-central-1 and eu-west-1 VPCs — atrax-node reaches atrax-api only via the public ALB (atrax-api.{env}.cookiehub.net)

Scaling

The atrax-node cluster is sized horizontally (number of EIPs available to import) × vertically (instance type + tasks-per-host). To add a second host:

  1. Allocate a new EIP in the prod account, eu-west-1.
  2. Bump node_count = 2 in environments/prod/eu-west-1/atrax.tf.
  3. terraform import 'module.atrax_ecs.aws_eip.node["b"]' eipalloc-… before running plan.
  4. terraform apply.

To add another worker on an existing host: bump desired_count on the atrax-node service.