Asking Claude What's on My Driveway
I asked Claude a question this evening: "any cars on the drive?"
About thirty seconds later I had an answer. Not "yes" or "no" - a description. Make, colour, where each one was parked. Plus a follow-up I hadn't asked for: the cars had been there throughout the last ten minutes and hadn't moved.
The thing that answered me wasn't a search engine, a smart-home plugin, or a cloud service. It was a small Bun process running on my MacBook with a tiny tool-set wired into Claude Code via MCP. The cameras feeding it are the ones already bolted to my house. The whole stack came together in one evening.
I want to write this up because the moment Claude described a car it had never been told existed - and then volunteered that the car was still there - I had the same feeling I had the first time I saw GitHub Copilot autocomplete a function. The substrate had shifted under my feet a little.
What I Actually Built
I have a Reolink NVR with five cameras around the house. It's a normal consumer setup, accessed via a Raspberry Pi Zero 2W acting as a Tailscale subnet router because the cameras live on an isolated VLAN. Useful for keeping them off the open internet. Annoying for building anything on top of them.
The end product is a single page in a browser. You type a question. A pill at the top shows what Claude is doing (listing cameras, looking at Front Gate, scanning Main Drive in the last 10 minutes...) with a live progress bar as it extracts frames. Thumbnails of the actual frames Claude looked at fade into a strip below. Then the answer streams in. It's a chat, so you can follow up. "What colour was the first car?" "Did anyone walk past the gate between 6 and 7?" Each follow-up resumes the same session, so Claude doesn't redo work it's already done. Follow-ups cost about a tenth of the first turn thanks to prompt caching.
Under the hood, three pieces:
- A recorder that pulls the five low-res sub-streams off the NVR via RTSP and writes 60-second MP4 segments to disk, with 48 hours of rolling retention. About 18GB on disk at steady state. Runs as a launchd agent so it's always on.
- An MCP server with five tools Claude can call: list cameras, get a live frame from one, get a range of frames from one, get frames across all cameras, and report on recorder health.
- A Bun HTTP server that runs
claude -pper question, parses its stream-json output, and forwards the interesting bits to the browser as Server-Sent Events.
The Constraint That Shaped It
I started out planning to use the NVR's own recordings. They're already there - the box has a 1.85TB hard drive and keeps a week of footage. Why duplicate that?
Because the firmware doesn't expose a usable HTTP playback API. The Search endpoint will happily tell you which recordings exist for a given time window. The Playback endpoint, on this firmware, returns 403 or 404 for every variation I tried. There's no clean way to pull an MP4 of "Front Gate between 18:00 and 18:15" out of the NVR via HTTP. It exists. You just can't get at it.
So I built a parallel ring buffer. Five ffmpeg processes, one per camera, each muxing the sub-stream into rolling 60-second MP4 segments with timestamped filenames in UTC. A supervisor restarts crashed children with exponential backoff. A GC loop deletes anything older than 48 hours. Total CPU on my M4 is invisible. The whole recorder fits in about 200 lines of TypeScript.
Constraints producing cleaner architecture is one of my favourite genres of feeling. The NVR's own retention is now my backup, not my source. Day-to-day, I own the bits.
MCP Is The Product Surface
If you haven't used Anthropic's Model Context Protocol yet, the one-line summary: it's a standard way to expose tools, resources, and prompts to an LLM client. You write a small server. The client (Claude Code, here) registers your tools alongside its built-ins. When the model decides it needs to "look at Front Gate", it calls your function. You return frames as base64-encoded images in the response. Claude sees them.
The thing that's worth saying out loud is that MCP is where the productisation happens. The LLM is generic. The tools you give it - and the descriptions you write for those tools - are the entire product. My get_frames_all_cameras tool isn't just "extract frames from a time window." Its description is a small piece of operator wisdom:
IMPORTANT - frame density: brief events (delivery, passing car, person at door) may last only 3-10 seconds. Sample densely enough to catch them. Aim for one frame every 10-15 seconds per camera. For a 5-min window use ~30, 10-min use ~50-60, 30-min use ~90.
That paragraph is more important than the implementation. It teaches Claude how to use the tool in a way that matches real-world cadence. The first version of the tool defaulted to 6 frames per camera over 10 minutes - one frame every 100 seconds. A delivery person could absolutely be on camera for 8 seconds and never appear. Bumping the defaults and the descriptions was the difference between "sometimes misses things" and "catches the things you actually care about." None of that lives in the model. All of it lives in my MCP server.
Two Bugs I'll Save You Some Time On
If you build something similar, two specific things will burn you.
Bun.serve's idleTimeout: 0 doesn't disable the idle timeout. I had it set to 0 intending to disable. The server quietly fell back to the 10-second default and killed every SSE connection mid-stream. The only sign was a single log line: [Bun.serve]: request timed out after 10 seconds. Fix: set it explicitly to 255 (the documented max) and emit a comment keepalive every five seconds for safety.
Streamed text_delta partials can drop on the wire. Claude Code's stream-json output sends incremental text deltas as the model writes. They're fine 99% of the time. Occasionally, an answer truncates mid-sentence in the UI. The fix isn't smarter partial handling - it's also reading the canonical final assistant message (which contains the complete text in a single block) and emitting that as a text_full event that replaces whatever was streamed. Belt and braces. The partials give you the live typewriter effect; the final replace guarantees the rendered answer is always complete.
Why This Matters
I've been building software professionally since 2019. Most of what I shipped this evening would have been a multi-week project five years ago. The ring buffer, sure - that's just ffmpeg. The UI, sure - that's just React. The networking via Tailscale, also routine.
The new thing - the thing that changes the equation - is that I didn't have to write a single line of computer-vision code. No YOLO. No model fine-tuning. No "person detection" pipeline I'd have to maintain. I extract frames; Claude looks. The whole "understand what's in the image" layer, which used to be the hardest part of a project like this, is now the thing I get for free by handing Claude a JPEG.
What that means, practically: vertical AI products you'd never have bothered with five years ago are now an evening of work. Pick something specific to your life. Get the data into a place where you can hand frames or rows or documents to an LLM. Wrap it in an MCP server with thoughtful tool descriptions. Put a small UI in front of it.
This pattern is going to eat a lot of "should we build that?" decisions in the next year. The cost of "yes, let's just try" has collapsed. The remaining hard part - the part this evening didn't address but the next one will - is closing the loop. Right now my watcher answers when asked. The interesting v2 is the one that also watches, runs a cron, notices what's worth surfacing, and pings me unprompted. That's a different post.
For tonight: the recorder is running as a launchd agent, the dashboard is at localhost:5173, and I can ask my house questions. It's a small thing. It's also the kind of small thing that wasn't possible last year.