Hotword-triggered LLM conversation loop for Jibo with tool-calling agent loop, ESML expressive speech, web search/fetch, and per-conversation abort handling.
11 KiB
jibo-llm
Give Jibo a brain again. A hotword-triggered, LLM-powered conversational agent that turns Jibo into an expressive, tool-using social robot — complete with speech, vision, web search, animations, and more.
Overview
jibo-llm connects a Jibo robot to any OpenAI-compatible LLM (GPT-4o, Claude, local models via Ollama/LM Studio, etc.) through a real-time agent loop. When someone says "Hey Jibo", the system:
- Listens for the user's speech via Jibo's on-board microphone.
- Sends the transcript to an LLM along with a rich system prompt and tool definitions.
- Executes tool calls the LLM makes — speaking, animating, taking photos, searching the web, and more.
- Loops until the conversation naturally ends or the user triggers a new hotword.
Conversations are fully interruptible: saying "Hey Jibo" mid-conversation aborts the current exchange and starts a fresh one via AbortController.
Architecture
┌──────────────┐ hotword ┌──────────────┐ tool calls ┌───────────────┐
│ Jibo Robot │ ──────────▶ │ index.js │ ◀───────────▶ │ LLM (OpenAI │
│ (rom-ctrl) │ ◀────────── │ Agent Loop │ │ compatible) │
│ │ say/listen │ │ └───────────────┘
│ • mic │ photo/look │ tools.js │ web search ┌───────────────┐
│ • speaker │ display │ (executor) │ ──────────────▶ │ Brave Search │
│ • camera │ │ │ └───────────────┘
│ • screen │ │ esml-ref.js │
│ • motors │ │ (prompt ref)│
└──────────────┘ └──────────────┘
| File | Purpose |
|---|---|
index.js |
Entry point — connects to Jibo, listens for hotword, runs the agent loop with the LLM. |
tools.js |
Defines all tool schemas (OpenAI function-calling format) and the executeTool() dispatcher. |
esml-reference.js |
ESML (Embodied Speech Markup Language) cheat sheet injected into the system prompt so the LLM knows how to animate Jibo expressively. |
Features
- 🗣️ Natural conversation — multi-turn dialogue with speech recognition and TTS.
- 🎭 Expressive animations — the LLM uses ESML tags to trigger emotions, dances, emojis, and sound effects inline with speech.
- 📷 Vision — Jibo can take photos and the LLM receives the image for visual understanding.
- 🔍 Web search — real-time Brave Search integration for up-to-date answers.
- 🌐 URL fetching — reads web pages (with Cloudflare Markdown for Agents support) so Jibo can summarize articles.
- 🖥️ Display control — show text, images, or restore the default eye on Jibo's screen.
- 🤖 Head movement — point Jibo's head at specific angles (yaw / pitch).
- 🔊 Volume control — adjust speaker volume on the fly.
- ⚡ Interruptible — new hotword instantly aborts a running conversation via
AbortController. - 🔄 Retry logic — automatic retry with exponential backoff for transient LLM errors (429, 5xx, network).
- 🧹 Context management — old photos are pruned from context to control token cost.
Prerequisites
- Node.js ≥ 18 (for native
fetchandAbortController) - A Jibo robot running with int-developer mode enabled
- An OpenAI-compatible API endpoint (OpenAI, Anthropic via proxy, Ollama, LM Studio, etc.)
- (Optional) Brave Search API key for the
web_searchtool
Quick Start
1. Clone & install
git clone https://github.com/niceduckdev/jibo-llm.git
cd jibo-llm
npm install
2. Configure environment
cp .env.example .env
Edit .env with your values:
# Jibo robot IP address on your local network
JIBO_IP=192.168.1.217
# LLM API configuration (any OpenAI-compatible endpoint)
LLM_BASE_URL=https://api.openai.com/v1
LLM_API_TOKEN=sk-your-api-key-here
LLM_MODEL_ID=gpt-4o
# Optional: enables the web_search tool
BRAVE_API_KEY=your-brave-api-key
3. Run
npm start
# or: node index.js
You'll see:
[jibo-llm] Connecting to Jibo at 192.168.1.217…
[jibo-llm] Connected — session abc123
[jibo-llm] Ready — listening for "Hey Jibo"…
Say "Hey Jibo" and start talking!
Configuration
All configuration is done via environment variables (loaded from .env by dotenv):
| Variable | Required | Default | Description |
|---|---|---|---|
JIBO_IP |
No | 192.168.1.217 |
Jibo's IP address on your LAN |
LLM_BASE_URL |
No | https://api.openai.com/v1 |
Base URL for the chat completions API |
LLM_API_TOKEN |
Yes | — | API key for the LLM provider |
LLM_MODEL_ID |
No | gpt-4o |
Model identifier to use |
BRAVE_API_KEY |
No | — | Brave Search API key (enables web_search tool) |
Using alternative LLM providers
Since jibo-llm uses the OpenAI SDK, any provider with a compatible chat completions endpoint works:
# Ollama (local)
LLM_BASE_URL=http://localhost:11434/v1
LLM_API_TOKEN=ollama
LLM_MODEL_ID=llama3
# LM Studio (local)
LLM_BASE_URL=http://localhost:1234/v1
LLM_API_TOKEN=lm-studio
LLM_MODEL_ID=local-model
# OpenRouter
LLM_BASE_URL=https://openrouter.ai/api/v1
LLM_API_TOKEN=sk-or-...
LLM_MODEL_ID=anthropic/claude-sonnet-4
Available Tools
The LLM can call any of these tools during a conversation:
Communication
| Tool | Description |
|---|---|
say |
Speak ESML-formatted text through Jibo's speaker. Queued and chained so multiple say calls play in order. |
listen |
Open the microphone and transcribe user speech. Waits for pending speech to finish first. |
end_conversation |
Gracefully end the conversation (no further listening). |
Camera
| Tool | Description |
|---|---|
take_photo |
Capture a photo from Jibo's camera. The image is sent to the LLM as a base64 JPEG for visual understanding. |
Display
| Tool | Description |
|---|---|
show_text |
Display word-wrapped text on Jibo's screen. |
show_image |
Display an image from a URL on Jibo's screen. |
show_eye |
Restore the default eye animation. |
Movement
| Tool | Description |
|---|---|
look_at_angle |
Turn Jibo's head — theta (yaw ±180°) and psi (pitch ±30°). |
Audio
| Tool | Description |
|---|---|
set_volume |
Set speaker volume from 0.0 to 1.0. |
Web
| Tool | Description |
|---|---|
web_search |
Search the web via Brave Search API. Supports result count and freshness filters. |
fetch_url |
Fetch and read a web page. Prefers markdown via Cloudflare content negotiation, falls back to HTML→text conversion. |
ESML (Embodied Speech Markup Language)
ESML is how Jibo speaks expressively. The system prompt includes a full reference (esml-reference.js) that teaches the LLM to use these tags inside say calls:
<!-- Emotional reaction (most common pattern) -->
<anim cat='happy' nonBlocking='true' endNeutral='true'/> That's great news!
<!-- Voice sound (laugh, sigh, greeting) -->
<ssa cat='laughing' nonBlocking='true'/> That's hilarious!
<!-- Sound effect -->
<sfx cat='drumroll'/> And the answer is...
<!-- Dance (always needs a filter) -->
<anim cat='dance' filter='music, rom-silly'/> Watch this!
<!-- Emoji on screen -->
<anim cat='emoji' filter='!(hf), &(heart)' nonBlocking='true'/> I love that!
<!-- Dramatic pause -->
And then... <break size='1.0'/> nothing happened.
A sanitizeForTTS() function in tools.js provides defense-in-depth by stripping markdown, LaTeX, and invalid tags before they reach Jibo's TTS engine.
How the Agent Loop Works
User says "Hey Jibo" ──▶ hotword event fires
│
▼
Play acknowledgment animation
│
▼
Listen for initial speech (15s timeout)
│
▼
Build message history [system prompt, user text]
│
▼
┌─── Agent Loop (max 25 turns) ◀──┐
│ │
│ 1. Prune old images from context │
│ 2. Call LLM │
│ 3. If no tool calls → done │
│ 4. Sort tools: say → actions → listen │
│ 5. Execute each tool │
│ 6. Push results to messages │
│ 7. If end_conversation → done │
│ 8. Loop ─────────────────────────┘
│
▼
Conversation complete
Resume hotword listening
Key behaviors:
- Speech chaining: Multiple
saycalls are queued via a promise chain so they play sequentially without overlap. - Tool ordering:
sayexecutes first, then actions (photo, search, etc.), thenlisten/end_conversationlast. - Graceful limits: At turn 24 of 25, a system message nudges the LLM to wrap up.
- Image pruning: Only the 2 most recent photos are kept in context to manage token usage.
Project Structure
jibo-llm/
├── .env.example # Template for environment variables
├── .env # Your local config (git-ignored)
├── index.js # Entry point: connection, hotword handling, agent loop
├── tools.js # Tool schemas + executeTool() dispatcher
├── esml-reference.js # ESML documentation injected into the system prompt
├── package.json # Dependencies and scripts
└── node_modules/ # Installed dependencies
Dependencies
| Package | Version | Purpose |
|---|---|---|
| rom-control | ^2.0.1 | Jibo robot control client (speech, camera, display, motors) |
| openai | ^4.73.0 | OpenAI-compatible chat completions SDK |
| dotenv | ^16.4.5 | Load .env configuration |
License
MIT