Binary Ninja for Agents

Introducing bn tool for agent-first reverse engineering workflows

reverse engineering

agents

Author

banteg

Published

March 10, 2026

Most reverse engineering tooling is still designed for either a human in a GUI or headless batch script. Agents are an emerging third type.

They are very good with tools, but the tools need to be shaped in a way understandable to them. They like shell commands, structured or at least predictable output they could pipe to jq or rg, short feedback loops, or perhaps even an escape hatch to run arbitrary scripts that one shot the exact result they require.

MCP era: decompiling at snail pace

In my previous reversing project I used NSA’s Ghirda, and, honestly, I never even opened the GUI or bothered to create a Ghidra project. It was all a series of scripts that I had to rerun after each pass of deciphering the symbols.

For this project I wanted to try something different. I already had my fork of binary_ninja_mcp set up, so I went with it. This time I used a proper project (.bndb), so my edits persist and propagate in seconds instead of minutes I previously spent to run a full pipeline.

On paper it should have been enough: the agent could ask for decompiles, inspect types, rename things, and keep working against the live Binary Ninja session. For short, surgical questions, that mostly worked. But once it got deep into reverse engineering, the shape of the interface started to crack.

The responses often got truncated, and it kept adding mcp_ prefix to renames. While renames were just annoying and easily solved with a patch, the truncation was harder to fix.

There is a hardcoded policy in Codex for tool output: it truncates the middle, keeps the start and end, and inserts a marker like …N tokens truncated…. Obviously this doesn’t work when you are decompiling a large function responsible for core gameplay loop.

There is no easy way around it when using MCP, you can override the default 10,000 token limit in config, but it could result in a different issue where a large response chokes and bricks the session by eating into the compaction buffer.

~/.codex/config.toml

tool_output_token_limit = 50000

How `bn` started

So bn started from a complaint. I asked Codex to analyze its rollout and dream up a better tool. 10 minutes later it was already better than the existing MCP tool.

I built bn while porting Snail Mail to Zig. The original 2004 Windows binary is stripped down to no symbols, no debug info, just x86. Binary Ninja does the decompilation work, but the actual grind is in the naming, typing, cross-referencing, and inferring the symbols.

Over the next 36-hour session Codex has issued 1,548 bn commands: 547 decompiles, 246 xref walks, 217 searches, 95 inline Python snippets, 67 renames, and 48 struct edits. The interesting part was that it kept choosing bn for real work instead of routing around it.

After each batch it gave feedback and got new features it wished for a few minutes later. Later on I analyzed the transcript and cut the features it never used.

What `bn` actually is

It is a very opinionated shell layer over a live Binary Ninja session. It’s essentially a socket bridge between a CLI and a Binary Ninja GUI plugin. The plugin owns the real Binary Ninja API access. The CLI talks to the already-open database in the GUI. Because it runs through the GUI plugin, it works with a $300/year personal license and doesn’t require a $1500/year commercial license that’s required for headless mode.

bn is a coding-agent-first CLI for Binary Ninja.

First, it gives the agent stable commands for things like:

bn target list
bn function search gameplay
bn decompile update_cameraman
bn xrefs field Player.movement_state
bn types show Game
bn struct field set Player 0x308 movement_flag_selector uint32_t --preview
bn py exec --code $"print(hex(bv.entry_point))"

Second, it returns text when text is the right interface, and JSON when structure matters. Large outputs auto-spill to disk and report back the token and line counts instead of detonating the model’s context window.

Third, writes are not blind. Preview mode applies a change, refreshes analysis, captures the decompile diff, and reverts it. Real writes only report success after reading the live Binary Ninja state back and verifying that the requested change actually landed.

Python escape hatch

And finally, there is a Python escape hatch, bn py exec. Agents are very good at writing exploratory scripts, so why not allow to run them inside Binja and get the exact result shape they want for the current question. This matters even more than any polished surface command. You can build all the command coverage you want, but reverse engineering always finds the weird case. This future-proofs against all of them.

Here is one real example:

$ bn py exec --target SnailMail.exe.bndb --stdin <<'PY'
from binaryninja import Symbol, SymbolType

pairs = {
    0x4107d0: "update_frontend_state_machine",
    0x410c40: "initialize_frontend_background_distortion_grid",
    0x410d50: "activate_frontend_background_entry",
    0x410e20: "initialize_frontend_background_renderer",
    0x410f40: "set_frontend_background_texture_target",
    0x410f90: "draw_frontend_split_background",
    0x411040: "draw_frontend_warped_background",
    0x4112f0: "update_frontend_background_renderer",
    0x449460: "initialize_bass_audio_backend",
}

renamed = []
for addr, name in pairs.items():
    func = bv.get_function_at(addr)
    if func is None:
        renamed.append({"address": hex(addr), "error": "missing function"})
        continue
    bv.define_user_symbol(Symbol(SymbolType.FunctionSymbol, addr, name))
    renamed.append({"address": hex(addr), "name": bv.get_function_at(addr).name})

bv.update_analysis_and_wait()
result = renamed
PY

The manifest had already converged on the right names, but the live Binary Ninja session had not. Codex dropped into bn py exec, batch renamed nine functions directly inside the open database, forced reanalysis, and got a structured confirmation payload back. This design philosophy allows to keep the shell surface small and useful, but makes sure there is always one level lower to go.

Tool loop

The session did not look like one giant decompile followed by a magic answer. It looked like a reverser using a shell:

Find a narrow entry point
Inspect xrefs and decompile
Form a naming or type hypothesis
Preview the mutation
Commit it if the diff looks right
Reread the affected functions
Repeat

That is the whole game.

One concrete case was struct recovery. Codex inspected Player, PathTemplate, FollowState, and related movement helpers, then started pushing type information back into the live database. It would preview a field edit, look at the affected decompile diff, commit it, and then immediately reread the relevant functions. When the decompiler presentation stayed stale and still showed raw __offset(...) fields, it forced a refresh and checked again. The loop stayed tight enough that the agent could treat type recovery as an interactive process instead of a one-way write.

That same pattern showed up everywhere else. bn function search and bn strings --query were the front door to the next question. bn xrefs was how the agent kept branching through the call graph. bn decompile never truncated and allowed the agent to intelligently move forawrd. Because the surface was shell-native, the agent could mix bn with rg, temp files, jq, and little follow-up reads.

Symbol oracles

One of the best moments in the session came after loading the Android and iOS ports of Snail Mail next to the stripped Windows build. The mobile binaries were rich in symbols.

Once those targets were open in the same environment, the workflow became obvious: use the Android and iOS builds as naming or secondary behavior oracles for the Windows one. Codex could search for a symbol-rich function in libsnailmail.so, decompile it there, compare the logic to a stripped Windows function, and then promote the Windows symbol to a real name with much higher confidence. That pattern repeated across menu flow, intro state, gameplay helpers, damage-gauge code, and subgoldy lifecycle work.

That mattered because cross-port reversing is mostly comparison work. If switching targets is awkward, the agent quietly stops doing it. If the command surface stays the same across targets, the agent uses every build as a reference source.

bn made the correct workflow as easy as --target libsnailmail.so.

Failure is not an option

This was maybe the most important property. Bad tooling says “done” and leaves the real state unchanged. Good tooling either succeeds, fails noisily, or gives you a lower-level escape hatch.

bn got there by making mutation paths verify the live post-state and by keeping bn py exec first-class. During the session, symbol rename sometimes worked cleanly and sometimes did not. In one case a preview rename threw an error; a non-preview rename looked successful but did not actually change the symbol in Binary Ninja. Codex checked the state, saw the rename had not landed, and fell back to Python inside the Binary Ninja process to define the symbol directly.

This has inspied the --preview feature. A struct edit or type declaration can cascade through a lot of functions. Preview gives the agent reversible reasoning: try the change, inspect consequences, keep it if the evidence improves, drop it if it does not.

Keep tools pristine

The other reason the session worked is that I owned the tool while the agent was using it.

Early on, Codex hit real pain points: large decompile spills writing to bad paths in the sandbox, bursty parallel reads tripping connection refused, stale __offset(...) presentation after type edits, rename paths with edge-case bugs, and rough search ergonomics on big binaries. Instead of treating those as permanent limitations, the loop was pain → patch → retry. Auto-spill moved to temp, transient connect retries were added, the accept backlog was increased, decompile JSON gained warnings when stale __offset(...) survived refresh, and querying got features like --regex.

The distance between friction and fix is short enough that the agent keeps using the tool instead of permanently routing around it.

There was another small but important piece here too: the bundled Codex skill. The repo ships a skill alongside the CLI and plugin, and that matters because a lot of tool quality is really workflow quality. Teaching the agent to start with bn doctor, bn target list, bn refresh, preview mutations first, and use --out for large artifacts removes a lot of dumb failure modes before they happen.

Why this kind of tooling works

The lesson here is not just that Binary Ninja is scriptable. We already knew that.

A small, specific CLI over a live reversing tool can outperform much grander integrations if it gets a few things right.

It speaks the agent’s native language: commands, files, JSON, grep
It makes writes previewable and verifiable
It is owned closely enough that feedback turns into code instead of a wishlist

Codex used bn the way a good reverser uses a shell: probe, xref, rename, type, diff, compare builds, sync findings, repeat. This made the progress really fast.

That sets the real bar for agent-native tools. Not “can the model call tools?” but “does the tool create a tight enough loop that the model keeps choosing it for real work?”

bn is open source and free to use.