honestly this is a pretty natural move, mobile access to your coding agent makes sense when you just want to check on a long running task or review something quick.
I've been building something similar, basically a way to run your full dev environment on your Mac and connect to it from iOS. terminal, files, AI agent all talking to the same session. the tricky part is honestly just keeping state in sync when you switch devices.
yeah the cost thing is real, though I think it's moving fast enough that the trajectory matters more than today's price.
like a year ago running an agent on a non-trivial task was painful and expensive. now it's annoying and somewhat expensive. tbh at the rate efficiency is improving I wouldn't bet on "still more expensive than a human" holding for long.
the reviewer/worker pipeline is honestly the part I'm most curious about.
like how do you handle disagreements between agents, does the reviewer just block and the worker retries, or is there a loop with a hard cutoff?
the failure mode I'd worry about most is cascading context drift, where each agent in the chain slightly misunderstands the task and by the time you get to the test agent it's validating the wrong thing entirely.
fwiw I think the LanceDB memory is the right call for this kind of setup, keeping shared context grounded is probably what prevents most of those drift issues.
The worker-reviewer pipeline typically runs 1–2 self-revision iterations. In my experience, agents handle most tasks fine, but they tend to miss quality gates — docstrings, minor business logic edge cases, that kind of thing. The reviewer catches what slips through on the code quality side.
This is all based on observed behavior from daily Claude Code CLI usage, where I've added hooks specifically to catch systematic failure patterns. OpenSwarm is essentially a productized version of those scaffoldings from my actual workflow — packaged into a more reusable architecture.
On context drift — good call, and yeah, that's exactly why the shared memory layer matters. LanceDB keeps the grounding consistent across the chain so each agent isn't just working off its own drifting interpretation.
As for disagreements: right now the reviewer blocks and the worker retries with feedback, with a hard cutoff to prevent infinite loops. It's simple but it works — the revision depth rarely needs to go beyond 2 rounds. And when it does fail, that's actually the useful signal — especially when you're triaging larger projects, the points where agents break down are exactly where a human engineer needs to step in.
At this point, what OpenSwarm really needs is broader testing from other users to validate these patterns outside my own workflow.
The NAT traversal angle is honestly the most compelling part here, WebRTC's ICE/STUN/TURN stack handles weird network topologies pretty gracefully without you having to think about it.
That said I think spzb's point is real, the signaling server is still a trust boundary even if the p2p channel itself is encrypted. Tailscale sidesteps a lot of that by having a more battle-tested auth model tbh.
On the "why would you even want this on phone" thing though, I guess I disagree a bit, I've been building something related: Xtro, terminal + source control + AI agent on iPhone connected to Mac, and like the use case is surprisingly legit once the layout is actually designed for mobile. Emergency deploys, quick fixes, that kind of thing.
Aider's `--yes` flag combined with a git-based loop works honestly better than I expected for this, like it'll just commit and you review the diff.
Pi I've tried headless and it's fine but you kinda have to wire up the exit conditions yourself since it's so minimal by design.
Fwiw the janky `claude -p` approach you described is actually pretty solid once you stop fighting it, the simplicity is the feature I think.
It's nice to have I guess. But still not as good as just using the Cli in the terminal, while in VS code or other fork, where you can glance the source control from time to time.
Of course its not near as good, but that's not the point - it's meant to supplement normal development, not compete with it. The idea that one can be nearly as productive on a mobile phone as on a pc is a fairy tale. Best example is the github app which functionally might be ok, but is unusable for e.g. looking at the source code of a repo in any meaningful way (IMO).
There's plenty of situations where one doesn't want to stay at the PC for AI to finish its thing. Now we can just go about our life and check in from the phone. IMO great feature. Would've used it many times in the past but didn't want to be bothered with some wrapper around CC that perhaps did it already.
It's not an argument whether you can be more productive on phone or desktop. Some people (like myself) simply don't have much time to be dedicated at desks so we have to build workflows that support being able to at least be reasonably productive from our phones.
I'm super happy Anthropic finally releases this tool. It's a starting point and I hope they'll improve it. I did a comparison with its features / capabilities here: https://yepanywhere.com/claude-code-remote-control.html
I get your point, but just out of curiosity, what is 'reasonably productive' in that case? E.g. compared to speed/efficiency/ease of coding/developing/researching on PC, would you say you're 20% of that on your phone? I reckon for me the number is like <10%. Just typing code on a phone is a chore. Having browsers open on another screen, splitting terminals, ssh tunnels and so many other things make any form of using mobile phones for what I use my pc for is a literal mental pain and thus I don't do it. I'd be better off doing additional 5 minutes on my PC instead of doing 50 minutes on my phone (and I have a foldable one lol).
I know everyone is different thus my curiosity about other peoples experience!
It looks like Claude tokenizer handles markdown pretty efficiently, like a ## header is like 1-2 tokens. So I think the actual token savings are smaller than the byte-level numbers suggest. Where this actually matters is if maybe if you're on Haiku with big codebases where system prompt and context fight for space. On Opus? Probably not worth making your files unreadable for humans?
Good callout on the tokenizer — I was measuring character reduction, not actual token savings, so the real gains are smaller than the headline numbers suggest. You're right that this matters most at the edges: Haiku, large codebases, or heavy memory systems where the context budget is genuinely tight. On Opus with a simple project CLAUDE.md it's probably not worth the readability hit.
I've been building something similar, basically a way to run your full dev environment on your Mac and connect to it from iOS. terminal, files, AI agent all talking to the same session. the tricky part is honestly just keeping state in sync when you switch devices.