Files
jibo-llm/README.md
pasketti 8955f21ab4 Initial commit: jibo-llm hotword-triggered agent
Hotword-triggered LLM conversation loop for Jibo with tool-calling agent
loop, ESML expressive speech, web search/fetch, and per-conversation
abort handling.
2026-04-26 00:05:39 -04:00

11 KiB

jibo-llm

Give Jibo a brain again. A hotword-triggered, LLM-powered conversational agent that turns Jibo into an expressive, tool-using social robot — complete with speech, vision, web search, animations, and more.

Node.js License


Overview

jibo-llm connects a Jibo robot to any OpenAI-compatible LLM (GPT-4o, Claude, local models via Ollama/LM Studio, etc.) through a real-time agent loop. When someone says "Hey Jibo", the system:

  1. Listens for the user's speech via Jibo's on-board microphone.
  2. Sends the transcript to an LLM along with a rich system prompt and tool definitions.
  3. Executes tool calls the LLM makes — speaking, animating, taking photos, searching the web, and more.
  4. Loops until the conversation naturally ends or the user triggers a new hotword.

Conversations are fully interruptible: saying "Hey Jibo" mid-conversation aborts the current exchange and starts a fresh one via AbortController.


Architecture

┌──────────────┐   hotword    ┌──────────────┐   tool calls   ┌───────────────┐
│  Jibo Robot  │ ──────────▶  │   index.js   │ ◀───────────▶  │  LLM (OpenAI  │
│  (rom-ctrl)  │ ◀──────────  │  Agent Loop  │                │  compatible)  │
│              │   say/listen │              │                └───────────────┘
│  • mic       │   photo/look │  tools.js    │   web search   ┌───────────────┐
│  • speaker   │   display    │  (executor)  │ ──────────────▶ │  Brave Search │
│  • camera    │              │              │                └───────────────┘
│  • screen    │              │  esml-ref.js │
│  • motors    │              │  (prompt ref)│
└──────────────┘              └──────────────┘
File Purpose
index.js Entry point — connects to Jibo, listens for hotword, runs the agent loop with the LLM.
tools.js Defines all tool schemas (OpenAI function-calling format) and the executeTool() dispatcher.
esml-reference.js ESML (Embodied Speech Markup Language) cheat sheet injected into the system prompt so the LLM knows how to animate Jibo expressively.

Features

  • 🗣️ Natural conversation — multi-turn dialogue with speech recognition and TTS.
  • 🎭 Expressive animations — the LLM uses ESML tags to trigger emotions, dances, emojis, and sound effects inline with speech.
  • 📷 Vision — Jibo can take photos and the LLM receives the image for visual understanding.
  • 🔍 Web search — real-time Brave Search integration for up-to-date answers.
  • 🌐 URL fetching — reads web pages (with Cloudflare Markdown for Agents support) so Jibo can summarize articles.
  • 🖥️ Display control — show text, images, or restore the default eye on Jibo's screen.
  • 🤖 Head movement — point Jibo's head at specific angles (yaw / pitch).
  • 🔊 Volume control — adjust speaker volume on the fly.
  • Interruptible — new hotword instantly aborts a running conversation via AbortController.
  • 🔄 Retry logic — automatic retry with exponential backoff for transient LLM errors (429, 5xx, network).
  • 🧹 Context management — old photos are pruned from context to control token cost.

Prerequisites

  • Node.js ≥ 18 (for native fetch and AbortController)
  • A Jibo robot running with int-developer mode enabled
  • An OpenAI-compatible API endpoint (OpenAI, Anthropic via proxy, Ollama, LM Studio, etc.)
  • (Optional) Brave Search API key for the web_search tool

Quick Start

1. Clone & install

git clone https://github.com/niceduckdev/jibo-llm.git
cd jibo-llm
npm install

2. Configure environment

cp .env.example .env

Edit .env with your values:

# Jibo robot IP address on your local network
JIBO_IP=192.168.1.217

# LLM API configuration (any OpenAI-compatible endpoint)
LLM_BASE_URL=https://api.openai.com/v1
LLM_API_TOKEN=sk-your-api-key-here
LLM_MODEL_ID=gpt-4o

# Optional: enables the web_search tool
BRAVE_API_KEY=your-brave-api-key

3. Run

npm start
# or: node index.js

You'll see:

[jibo-llm] Connecting to Jibo at 192.168.1.217…
[jibo-llm] Connected — session abc123
[jibo-llm] Ready — listening for "Hey Jibo"…

Say "Hey Jibo" and start talking!


Configuration

All configuration is done via environment variables (loaded from .env by dotenv):

Variable Required Default Description
JIBO_IP No 192.168.1.217 Jibo's IP address on your LAN
LLM_BASE_URL No https://api.openai.com/v1 Base URL for the chat completions API
LLM_API_TOKEN Yes API key for the LLM provider
LLM_MODEL_ID No gpt-4o Model identifier to use
BRAVE_API_KEY No Brave Search API key (enables web_search tool)

Using alternative LLM providers

Since jibo-llm uses the OpenAI SDK, any provider with a compatible chat completions endpoint works:

# Ollama (local)
LLM_BASE_URL=http://localhost:11434/v1
LLM_API_TOKEN=ollama
LLM_MODEL_ID=llama3

# LM Studio (local)
LLM_BASE_URL=http://localhost:1234/v1
LLM_API_TOKEN=lm-studio
LLM_MODEL_ID=local-model

# OpenRouter
LLM_BASE_URL=https://openrouter.ai/api/v1
LLM_API_TOKEN=sk-or-...
LLM_MODEL_ID=anthropic/claude-sonnet-4

Available Tools

The LLM can call any of these tools during a conversation:

Communication

Tool Description
say Speak ESML-formatted text through Jibo's speaker. Queued and chained so multiple say calls play in order.
listen Open the microphone and transcribe user speech. Waits for pending speech to finish first.
end_conversation Gracefully end the conversation (no further listening).

Camera

Tool Description
take_photo Capture a photo from Jibo's camera. The image is sent to the LLM as a base64 JPEG for visual understanding.

Display

Tool Description
show_text Display word-wrapped text on Jibo's screen.
show_image Display an image from a URL on Jibo's screen.
show_eye Restore the default eye animation.

Movement

Tool Description
look_at_angle Turn Jibo's head — theta (yaw ±180°) and psi (pitch ±30°).

Audio

Tool Description
set_volume Set speaker volume from 0.0 to 1.0.

Web

Tool Description
web_search Search the web via Brave Search API. Supports result count and freshness filters.
fetch_url Fetch and read a web page. Prefers markdown via Cloudflare content negotiation, falls back to HTML→text conversion.

ESML (Embodied Speech Markup Language)

ESML is how Jibo speaks expressively. The system prompt includes a full reference (esml-reference.js) that teaches the LLM to use these tags inside say calls:

<!-- Emotional reaction (most common pattern) -->
<anim cat='happy' nonBlocking='true' endNeutral='true'/> That's great news!

<!-- Voice sound (laugh, sigh, greeting) -->
<ssa cat='laughing' nonBlocking='true'/> That's hilarious!

<!-- Sound effect -->
<sfx cat='drumroll'/> And the answer is...

<!-- Dance (always needs a filter) -->
<anim cat='dance' filter='music, rom-silly'/> Watch this!

<!-- Emoji on screen -->
<anim cat='emoji' filter='!(hf), &(heart)' nonBlocking='true'/> I love that!

<!-- Dramatic pause -->
And then... <break size='1.0'/> nothing happened.

A sanitizeForTTS() function in tools.js provides defense-in-depth by stripping markdown, LaTeX, and invalid tags before they reach Jibo's TTS engine.


How the Agent Loop Works

User says "Hey Jibo" ──▶ hotword event fires
                              │
                              ▼
                    Play acknowledgment animation
                              │
                              ▼
                    Listen for initial speech (15s timeout)
                              │
                              ▼
                    Build message history [system prompt, user text]
                              │
                              ▼
                    ┌─── Agent Loop (max 25 turns) ◀──┐
                    │                                  │
                    │  1. Prune old images from context │
                    │  2. Call LLM                      │
                    │  3. If no tool calls → done       │
                    │  4. Sort tools: say → actions → listen │
                    │  5. Execute each tool             │
                    │  6. Push results to messages      │
                    │  7. If end_conversation → done    │
                    │  8. Loop ─────────────────────────┘
                    │
                    ▼
              Conversation complete
              Resume hotword listening

Key behaviors:

  • Speech chaining: Multiple say calls are queued via a promise chain so they play sequentially without overlap.
  • Tool ordering: say executes first, then actions (photo, search, etc.), then listen/end_conversation last.
  • Graceful limits: At turn 24 of 25, a system message nudges the LLM to wrap up.
  • Image pruning: Only the 2 most recent photos are kept in context to manage token usage.

Project Structure

jibo-llm/
├── .env.example        # Template for environment variables
├── .env                # Your local config (git-ignored)
├── index.js            # Entry point: connection, hotword handling, agent loop
├── tools.js            # Tool schemas + executeTool() dispatcher
├── esml-reference.js   # ESML documentation injected into the system prompt
├── package.json        # Dependencies and scripts
└── node_modules/       # Installed dependencies

Dependencies

Package Version Purpose
rom-control ^2.0.1 Jibo robot control client (speech, camera, display, motors)
openai ^4.73.0 OpenAI-compatible chat completions SDK
dotenv ^16.4.5 Load .env configuration

License

MIT