VLAPilot: a scheduling agent for any vision–language–action model
- Published
- April 2026
- Code
- github.com/JinghangLi/vlapilot
- License
- MIT
- User
- Agent → VLA
-
Vision-language-action models can act, but they cannot plan. VLAPilot is a general-purpose agent that pairs a VLM planner and a separate VLM verifier with any VLA backend, turning short-horizon skills into long-horizon missions.
The planner keeps an explicit, revisable plan — every step is traceable and every failure is recoverable. MCP-native on both ends: any upstream agent can drive it, and any robot can plug in.
How a run works
A run is an outer loop driven by an upstream caller. Inside VLAPilot, a ReAct runner orchestrates two VLMs — a planner that decides what to do next, and a verifier that watches each step. The runner mediates every interaction with the VLA backend, which lives in a separate child process and owns all hardware control.
While a task is running, a separate VLM polls roughly every two seconds. Each call gives the VLM the baseline frames — captured right before the task started — alongside the current frames, then asks whether the goal has been reached. The reply is one of three verdicts: completed when the task has succeeded, pending while it is still in progress, or failed when the task cannot recover.
The runner stops the VLA as soon as the verdict turns terminal — either completed or failed. A failed verdict hands control straight back to the planner, which can re-plan, retry, or skip the step. A pending reply — or one the parser cannot read — keeps the verifier polling until the task finishes or hits its timeout.
Because the verifier and the planner draw from the same camera feed, they share a single view of the scene. This lets the planner stay focused on structure — producing a plan, dispatching one task, then waiting — without having to decide for itself when each step is done. And since every mission carries its own verify.md prompt, the success criteria for cleaning a desk are different from those for a kitchen task: the contract is per mission, but the loop is the same.
Quickstart
Three steps. Python 3.10 or higher.
1 · Install
git clone https://github.com/JinghangLi/vlapilot.git
cd vlapilot
pip install -e ".[examples]" 2 · Configure
Copy the template into ~/.vlapilot/, then fill in your API keys and point agent.mission_dir at a mission folder.
mkdir -p ~/.vlapilot
cp config.example.json ~/.vlapilot/config.json
# Then edit ~/.vlapilot/config.json:
# agent.planner.api_key — LLM key for the planner
# agent.verify.providers.* — VLM key for the verifier
# agent.mission_dir — path to a mission directory 3 · Run
Drive it directly from the CLI for a one-shot mission:
python scripts/run_agent.py \
--config ~/.vlapilot/config.json \
-i "Clean up everything on the desk." Or expose it as an MCP server so any upstream agent can drive it:
vlapilot --config ~/.vlapilot/config.json Tools exposed: start_mission · get_status · cancel_mission · list_capabilities
Extensibility
Both ends of VLAPilot are pluggable. The robot side speaks a thin protocol; the mission side is a directory of three files. Swapping either one requires no code changes.
Bring your own VLA
Wrap any robot in a stdio MCP server that exposes six async methods. The agent talks to it over standard input/output through MCPBackendClient — no Python coupling between agent and backend.
class VLABackend(Protocol):
async def init(self) -> None
async def cleanup(self) -> None
async def observe(cameras) -> dict[str, str]
async def start(instruction: str) -> None
async def stop(self) -> None
async def is_running(self) -> bool Bring your own mission
A mission is a directory with three files. Point agent.mission_dir at it and the planner inherits the new task vocabulary, system prompt, and success criteria.
my_mission/
├── mission.md # planner system prompt
├── tasks.yaml # task vocabulary
└── verify.md # verifier system prompt Contributor
Yuhan Xi
席煜涵