In This Article
The Bug, in Plain English
OpenClaw is the open source AI agent runtime that OpenClaw Consult helps clients deploy. When an OpenClaw agent talks to a model like Claude Sonnet and the connection silently stalls mid-stream, the runtime is supposed to time out and try again. That is the right behavior. The bug was that the retry would hit the same wedged connection, time out again, retry, time out, forever. Every single one of those retries is a paid API call.
This was reported as openclaw/openclaw#76293. I read the issue, found the right place in the runtime, wrote the patch, opened openclaw/openclaw#76345, iterated through four review cycles with the project's AI review pipeline, and Peter Steinberger merged it into core (the main branch of openclaw/openclaw) on May 3, 2026.
The Numbers Behind the Wreck
The reporter on the original issue had logging in place, which is how the scale of this bug came to light. Two separate incidents in their production deployment:
- Incident 1, April 25, 2026 at 13:02 ET. A single heartbeat fire timed out, generating 761 paid Claude Sonnet 4.6 API calls within 60 seconds.
- Incident 2, April 29, 2026 at 01:02 UTC. Same triggering condition, 1,384 paid calls within 60 seconds.
At Anthropic Sonnet 4.6 list pricing for the heartbeat-context size in the report, that is $20 to $30 burned in a single minute. Worse, the reporter had auto-recharge enabled on their provider account, which is the recommended setting for reliability. Auto-recharge silently masks cost spikes at the time of incident, so users only discover the damage at the next billing cycle.
This bug was provider-agnostic too. The reporter reproduced it on Ollama with qwen3:14b, confirming the loop lives in OpenClaw's retry layer above the provider abstraction, not in any particular vendor's SDK.
Why a Single Stalled Connection Cascaded
OpenClaw's embedded agent runner has a clean architectural separation that is normally a strength: an outer retry loop that handles attempt-level failures and per-attempt sessions that handle in-flight model calls. When a model stream times out, the per-attempt session aborts and the outer loop kicks off a new attempt with a fresh session and a fresh idle-timeout wrapper.
The problem was that any state for "have we already wasted N attempts on this same wedged provider" lived inside the per-attempt wrapper, not at the outer-loop level. So every fresh attempt got a fresh counter starting from zero. Combined with profile and model failover (which intentionally tries the same problem against multiple configured providers), a single stalled upstream could fan paid calls out across the entire fallback chain before the broad MAX_RUN_LOOP_ITERATIONS backstop kicked in 160 attempts later.
By 160 attempts, the damage was done.
The Fix: A Circuit Breaker at the Outer Run Loop
The fix lives in src/agents/pi-embedded-runner/run.ts and a new pure helper module at src/agents/pi-embedded-runner/run/idle-timeout-breaker.ts. The breaker tracks consecutive idle timeouts where the model produced zero output tokens. After 5 such attempts in a row, the run loop refuses to start another attempt and surfaces a distinct error to the caller.
Two design choices that mattered for getting the fix accepted:
The state had to live at the outer-loop level. An earlier version of the patch put the counter inside the per-attempt timeout wrapper. The project's automated review pipeline (clawsweeper, powered by GPT-5.5) caught this with high confidence and explained why: the per-attempt wrapper is recreated on every iteration, so a wrapper-local counter resets on every retry and the cap never trips on the real path. The reviewer was right. The revised patch puts the counter alongside the existing sameModelIdleTimeoutRetries in the outer loop, where it survives across attempt boundaries and across profile and model failover.
The reset condition keys on output tokens, not just on whether the attempt succeeded. A slow-but-responsive stream that produced 200 tokens before timing out is not a wedged provider. The reset is gated on attemptUsage.output > 0, so legitimate slow streams keep retrying and only the truly-no-output wedge is treated as a circuit-breaker condition. This avoided breaking deployments where the model is just slow at the tail of its turn.
Worst case under the new default cap of 5 is roughly $0.10 to $0.30 per incident at the same context size that was producing $20 to $30 incidents before. Routing semantics are otherwise unchanged, so existing deployments do not need any config changes.
What It Took to Get the Fix Merged
OpenClaw is one of the strictest open source projects for contribution review. At the time of this fix the repository had 367,000 GitHub stars, around 41,000 unique authors who had ever opened a pull request, and only ~6,900 PRs that had ever merged into core. Roughly a 1-in-6 hit rate against a deliberately high bar. Refactor-only PRs are auto-rejected. Most feature requests get pushed to third-party plugins. Each contributor is capped at 10 open PRs at a time. Every PR runs through a multi-pass AI code review pipeline (clawsweeper) before a human maintainer ever looks at it.
This particular PR went through four review cycles. The first version put the counter in the wrong place (the wrapper, not the outer loop). The second fixed the placement but used a test setup that could not reach the assertion path. The third extracted the breaker into a pure unit-testable helper module, satisfying both the architectural critique and the test-coverage gap. The fourth was a tiny test fixture fix where the reporter's redacted phone-shaped value did not match the new digit-only validation regex.
Peter Steinberger reviewed the final version and merged it on May 3, 2026.
What This Means for Production OpenClaw
If you run OpenClaw in production, especially with cron-driven heartbeats or with multiple configured fallback profiles, this fix is now live. There is no config you need to change. The circuit breaker default of 5 consecutive idle timeouts protects against the cost-runaway pattern the reporter hit, while leaving normal slow-but-responsive streams completely unaffected.
What this also means is that the kind of consultant who can do this for you exists. When something breaks in your OpenClaw deployment, you want the person who can read the runtime, find the cause, and either patch it locally or push the fix upstream so every other deployment benefits. That is what production OpenClaw work actually looks like once you scale past the demo stage.
Frequently Asked Questions
How do I know my OpenClaw deployment is affected?
If you are on OpenClaw 2026.4.10 through any version before the merge of openclaw/openclaw#76345, you are exposed. The bug triggers any time an LLM connection silently stalls and your configured agent timeout allows the loop to keep retrying. Cron-driven heartbeats and multi-profile failover deployments are the highest-risk shapes.
What was the actual API cost in the reporter's incidents?
$20 to $30 burned in a single 60-second window in each of the two confirmed incidents. The reporter had auto-recharge enabled on their Anthropic account, which masked the spike at the time and surfaced it only at billing.
Did this require a new config option?
No. The breaker default of 5 consecutive zero-output idle timeouts is applied automatically. The cap is configurable via the helper's options, but the default protects everyone without action. Existing deployments do not need to change anything.
Why is the fix at the run-loop level instead of in the timeout wrapper?
Because the per-attempt timeout wrapper is recreated on every outer-loop iteration. Counter state at that scope resets every retry, which means a wrapper-local breaker would never actually trip on the real cost-runaway path. The clawsweeper review caught this on the first version of the patch and was right. The merged fix puts the counter at the outer loop where it survives across attempts and profile or model failover.
Who reviewed and merged the PR?
Peter Steinberger, the creator of OpenClaw. The PR went through four review cycles with clawsweeper (the project's automated AI review pipeline) before final merge.
If You Run OpenClaw in Production
If you run OpenClaw in production and want a consultant who can do this kind of work, talk to Adhiraj Hangal. The contribution log is at openclawconsult.com/contributions and the work is verifiable on GitHub.