60 FPS
Nathan Atherton
Back to Blog
Side ProjectsDiscordNode.jsCloudflareRaspberry PiNetworkingDebugging

1.4 Million Users and a Raspberry Pi at My Parents' House

Nathan Atherton· Staff Software EngineerMay 16, 20268 min read

I help run a Discord bot. It does one thing: looks up game stats for people. You type a slash command, it goes and fetches your numbers, draws a card, sends it back. Simple.

It has been doing this since January 2021. Five and a bit years. Last weekend I sent a signal to the cluster manager and it wrote a JSON file to /tmp with the current totals across all four shards:

{
  "totals": {
    "guilds": 11036,
    "members": 1476565
  }
}

1.4 million users. Across eleven thousand servers. On a free side project nobody runs as a business. The bot has no marketing, no website worth speaking of, and the codebase is exactly what you would expect from something that started in January 2021 and has been patched on evenings ever since.

I want to write this up because the bot is the side project I have the most complicated feelings about. It mostly just runs. I forget it exists for months at a time. And then one Saturday afternoon it falls over in a way that takes the rest of the day to fix, and I am reminded that 1.4 million people are very real even when the abstraction is just guild count goes up.

What "It Just Runs" Actually Means

For long stretches, looking after the bot looks like nothing. The host is a VPS in Europe somewhere. It runs four Node.js processes - one cluster manager and three worker shards, plus a separate API service - all under systemd, all restarting themselves cleanly if anything dies. Mongo lives on the same box. There is a nodemon in front of the workers because at some point in 2021 I or someone wanted live reload and we just never took it out.

I check in on it roughly never. The Top.gg server count auto-updates. Users invite the bot to new servers, use it, leave servers, and we find out about it from logs nobody is reading. Every now and then somebody opens an issue on the repo and I fix it on a Sunday.

That arrangement has held up under a workload that has, frankly, no business running on this little. Several hundred slash commands a day. Tens of thousands of guilds. Sharding done by discord-hybrid-sharding, which has been a quietly excellent piece of software for years now. discord.js v13 - yes, still v13 - happily handling the gateway. The hardest part of the stack, by far, is the part that scrapes upstream data, because that is the part that lives at the mercy of someone else's anti-bot policy.

The Saturday Incident

I got a message that the bot was returning API ERROR on every rank lookup. I could not reach the hosted logs directly so I cloned the repo, dropped the production env vars into my Mac, started a local Mongo in Docker, and ran the bot against the test bot's token. Worked perfectly. Ran /rank on myself in a test server and got a clean card back.

So the bug was not in the code. It was on the host.

This is the bit I want hiring managers to read carefully. Always reproduce before you fix. If I had assumed it was a parse error and gone digging through the HTML scraper, I could have spent hours. Instead, two minutes of npm start on my laptop told me the code was fine. The next place to look was upstream.

I finally got SSH access to the production box and tailed journalctl -u rltrack.service. The errors were repetitive and unambiguous:

Access denied | api.[upstream] used Cloudflare to restrict access

Cloudflare was blocking every single request from the VPS's IP range. The bot was rotating user agents - desktop Chrome, Firefox, Edge, macOS, Windows - and every single attempt was being refused. That meant it was not a user-agent problem. It was an IP reputation problem. Some time recently, the VPS's datacentre range had been flagged.

Dead Ends, In Order

I want to be honest about the dead ends, because the dead ends are the part that does not show up in the diff.

  • Switching data sources. The bot supports two upstream providers, swappable by env var. I flipped it. The other one was also Cloudflare-protected. Same wall.
  • Cloudflare Workers as a proxy. Free tier, no install needed. The catch is that the Worker hits the upstream from Cloudflare's own IP ranges. Sometimes that bypasses the block, sometimes it does not. I did not love the variance.
  • Residential proxy services. $3-5/month would solve this in five minutes. But this is a free bot. I do not want it to have a monthly bill. Mentally filed as plan C.
  • Reverse SSH tunnel from a Pi I have at my parents' house. Their home IP is residential. Cloudflare loves residential IPs. But the VPS's hosting provider has IP allowlisting on port 22, and my parents' IP is not on the list, and the VPS owner was AFK that day.
  • Tailscale on the VPS. Would work in five minutes. But the VPS owner had to approve the install, and again, AFK.
  • Running a headless browser on the Pi. Briefly considered. The Pi in question is a Pi Zero 2 W with 425MB of RAM. Chromium would not even start.

The thing that made this fun rather than frustrating is that at no point was I out of options. Every dead end pruned the tree. By dinner time I had a clear answer: route the bot's outbound traffic through my parents' house, with the Pi as the relay, but somehow without needing inbound ports on either side.

The Fix

Cloudflare Tunnel. The version that handles arbitrary TCP, not just HTTP origins. Both ends dial outbound to Cloudflare's edge. Neither end opens an inbound port. The tunnel carries TCP between them.

The architecture ended up looking like this:

  • Pi: tinyproxy listening on 127.0.0.1:8888 - a tiny HTTP proxy, sub-second startup, single config file. cloudflared running as a systemd service, registered as a named tunnel with a TCP service ingress pointing at port 8888.
  • Cloudflare: a hostname I owned (one of my personal domains, moved onto Cloudflare DNS for this purpose). The tunnel routes by hostname.
  • VPS: cloudflared access tcp as a systemd service, listening on 127.0.0.1:8888, forwarding the local TCP stream through the tunnel to the Pi.
  • Bot: a one-line change in puppeteer.js to pass --proxy-server to Chromium when the PROXY_URL env var is set.

Once it was wired up, every headless browser instance the bot launched on the VPS would speak HTTP CONNECT to local port 8888, which tunnelled through Cloudflare's edge, which forwarded to the Pi, which forwarded to the actual upstream. The upstream saw a residential UK IP and a real Chromium fingerprint and let the request through. The challenge that the VPS could not pass, the Pi-fronted browser passed on the first try.

The diagnostic that confirmed it was satisfying. From the VPS:

$ curl -x http://127.0.0.1:8888 https://api.ipify.org
90.252.154.138

That's my parents' home IP. Showing up as the source of a request issued from a server in Europe.

The Numbers Bit

While I was in there, I added a SIGUSR2 handler to the cluster manager process. kill -SIGUSR2 the manager and it broadcasts an eval to every shard, collects the per-cluster guild and member counts, and dumps the result as JSON to /tmp/rltrack-stats.json. No new HTTP endpoint, no new Discord command, nothing exposed publicly. Just a quiet local-only debug capability that lives in the running process.

The first time I fired it I got back numbers that felt low - half the clusters were reporting zero. Looked at the journal. The bot had just restarted from a deploy and only two of the four clusters were READY. Waited 90 seconds, fired again, got the real number. 11,036 guilds, 1,476,565 members, nicely balanced across clusters. The largest cluster has more members than the next two combined, which means a small number of huge servers are doing a lot of the work. That checks out for how Discord population usually distributes.

That moment - one signal, two seconds, a million and a half people - was the part of the day I want to remember. Not because the number is impressive in some absolute sense. Lots of things have more users. But because the entire mechanism - broadcastEval across cluster IPC, summed in Node, written to a tmpfs - has been quietly working in some form for years, and I had simply never asked it the question.

What This Has Taught Me

I do this job for a living and I read engineering blogs full of advice. The bot has been my private contradiction of a lot of that advice. A few things I have actually learned from running it for five years:

  • Old dependencies are usually fine. discord.js v13 is years out of date. It works. The migration cost would be real and the value would be invisible to users. I will move when something forces me to.
  • Process supervision is enough. systemd with Restart=always handles 95% of failure modes you actually see in production for this kind of workload. Kubernetes was never going to be in the picture.
  • Reproduce before you fix. The Saturday incident took maybe 20 minutes to localise to "external block" because the first thing I did was run the bot locally. Most of the wasted time in my career has been people skipping that step.
  • Free side projects have real users. Every time the bot breaks I am reminded that 1.4 million people did not consent to my Saturday plans, and that "it's just a hobby" stops applying somewhere between zero and a million.
  • The cheapest solution is often the most elegant. A Pi I had been using for something else, a domain I already owned, and a Cloudflare free-tier tunnel. Total marginal cost: zero. Total marginal infrastructure: zero. The bot's annual operating budget is still functionally £0.

It is still running. I checked. 11,041 guilds now, five up since this morning. Somebody is using /rank in a server I will never see, on a Pi that is two hundred miles from me, routed through a tunnel I built between dinner and bedtime on a Saturday. I love this job sometimes.