VLAPilot: a scheduling agent for any vision–language–action model

Published: April 2026
Code: github.com/FutianLabs/VLAPilot
License: MIT

multi-step · autonomous · 1x

Long-horizon mission “Clean up the desk”

multi-step · autonomous · 1x

Long-horizon mission “Wrap up the desk”

single-step · 1x

Object hand-over “Hand me the bottle”

single-step · 1x

Object hand-over “Hand me the screwdriver”

single-step · 1x

Object hand-over “Hand me a tissue”

single-step · 1x

Object placement “Put the pen back”

single-step · 1x

Object placement “Charge my earphones”

single-step · 1x

Direct interaction “Wave hello”

1 / 8

User

Agent → VLA

Vision-language-action models can act, but they cannot plan. VLAPilot is a general-purpose agent that pairs a VLM planner and a separate VLM verifier with any VLA backend, turning short-horizon skills into long-horizon missions.

The planner keeps an explicit, revisable plan — every step is traceable and every failure is recoverable. MCP-native on both ends: any upstream agent can drive it, and any robot can plug in.

How a run works

A run is an outer loop driven by an upstream caller. Inside VLAPilot, a ReAct runner orchestrates two VLMs — a planner that decides what to do next, and a verifier that watches each step. The runner mediates every interaction with the VLA backend, which lives in a separate child process and owns all hardware control.

The ReAct runner mediates every interaction. The gold arrow is the verdict that closes the verify loop.

While a task is running, a separate VLM polls roughly every two seconds. Each call gives the VLM the baseline frames — captured right before the task started — alongside the current frames, then asks whether the goal has been reached. The reply is one of three verdicts: completed when the task has succeeded, pending while it is still in progress, or failed when the task cannot recover.

The runner stops the VLA as soon as the verdict turns terminal — either completed or failed. A failed verdict hands control straight back to the planner, which can re-plan, retry, or skip the step. A pending reply — or one the parser cannot read — keeps the verifier polling until the task finishes or hits its timeout.

Because the verifier and the planner draw from the same camera feed, they share a single view of the scene. This lets the planner stay focused on structure — producing a plan, dispatching one task, then waiting — without having to decide for itself when each step is done. And since every mission carries its own verify.md prompt, the success criteria for cleaning a desk are different from those for a kitchen task: the contract is per mission, but the loop is the same.

Quickstart

Three steps. Python 3.10 or higher.

1 · Install

git clone https://github.com/JinghangLi/vlapilot.git
cd vlapilot
pip install -e ".[examples]"

2 · Configure

Copy the template into ~/.vlapilot/, then fill in your API keys and point agent.mission_dir at a mission folder.

mkdir -p ~/.vlapilot
cp config.example.json ~/.vlapilot/config.json

# Then edit ~/.vlapilot/config.json:
#   agent.planner.api_key        — LLM key for the planner
#   agent.verify.providers.*     — VLM key for the verifier
#   agent.mission_dir            — path to a mission directory

3 · Run

Drive it directly from the CLI for a one-shot mission:

python scripts/run_agent.py \
  --config ~/.vlapilot/config.json \
  -i "Clean up everything on the desk."

Or expose it as an MCP server so any upstream agent can drive it:

vlapilot --config ~/.vlapilot/config.json

Tools exposed: start_mission · get_status · cancel_mission · list_capabilities

Extensibility

Both ends of VLAPilot are pluggable. The robot side speaks a thin protocol; the mission side is a directory of three files. Swapping either one requires no code changes.

Bring your own VLA

Wrap any robot in a stdio MCP server that exposes six async methods. The agent talks to it over standard input/output through MCPBackendClient — no Python coupling between agent and backend.

class VLABackend(Protocol):
    async def init(self) -> None
    async def cleanup(self) -> None
    async def observe(cameras) -> dict[str, str]
    async def start(instruction: str) -> None
    async def stop(self) -> None
    async def is_running(self) -> bool

Bring your own mission

A mission is a directory with three files. Point agent.mission_dir at it and the planner inherits the new task vocabulary, system prompt, and success criteria.

my_mission/
├── mission.md     # planner system prompt
├── tasks.yaml     # task vocabulary
└── verify.md      # verifier system prompt

Contributor

Jinghang Li

李景行

Homepage →

Qing Lian

连庆

Project Leader

Homepage →

Yuhan Xi

席煜涵

Qing Jiang

蒋擎

Homepage →