Files

pasketti 8955f21ab4 Initial commit: jibo-llm hotword-triggered agent

Hotword-triggered LLM conversation loop for Jibo with tool-calling agent
loop, ESML expressive speech, web search/fetch, and per-conversation
abort handling.

2026-04-26 00:05:39 -04:00

11 KiB

Raw Permalink Blame History

jibo-llm

Give Jibo a brain again. A hotword-triggered, LLM-powered conversational agent that turns Jibo into an expressive, tool-using social robot — complete with speech, vision, web search, animations, and more.

Overview

jibo-llm connects a Jibo robot to any OpenAI-compatible LLM (GPT-4o, Claude, local models via Ollama/LM Studio, etc.) through a real-time agent loop. When someone says "Hey Jibo", the system:

Listens for the user's speech via Jibo's on-board microphone.
Sends the transcript to an LLM along with a rich system prompt and tool definitions.
Executes tool calls the LLM makes — speaking, animating, taking photos, searching the web, and more.
Loops until the conversation naturally ends or the user triggers a new hotword.

Conversations are fully interruptible: saying "Hey Jibo" mid-conversation aborts the current exchange and starts a fresh one via AbortController.

Architecture

┌──────────────┐   hotword    ┌──────────────┐   tool calls   ┌───────────────┐
│  Jibo Robot  │ ──────────▶  │   index.js   │ ◀───────────▶  │  LLM (OpenAI  │
│  (rom-ctrl)  │ ◀──────────  │  Agent Loop  │                │  compatible)  │
│              │   say/listen │              │                └───────────────┘
│  • mic       │   photo/look │  tools.js    │   web search   ┌───────────────┐
│  • speaker   │   display    │  (executor)  │ ──────────────▶ │  Brave Search │
│  • camera    │              │              │                └───────────────┘
│  • screen    │              │  esml-ref.js │
│  • motors    │              │  (prompt ref)│
└──────────────┘              └──────────────┘

File	Purpose
`index.js`	Entry point — connects to Jibo, listens for hotword, runs the agent loop with the LLM.
`tools.js`	Defines all tool schemas (OpenAI function-calling format) and the `executeTool()` dispatcher.
`esml-reference.js`	ESML (Embodied Speech Markup Language) cheat sheet injected into the system prompt so the LLM knows how to animate Jibo expressively.

Features

🗣️ Natural conversation — multi-turn dialogue with speech recognition and TTS.
🎭 Expressive animations — the LLM uses ESML tags to trigger emotions, dances, emojis, and sound effects inline with speech.
📷 Vision — Jibo can take photos and the LLM receives the image for visual understanding.
🔍 Web search — real-time Brave Search integration for up-to-date answers.
🌐 URL fetching — reads web pages (with Cloudflare Markdown for Agents support) so Jibo can summarize articles.
🖥️ Display control — show text, images, or restore the default eye on Jibo's screen.
🤖 Head movement — point Jibo's head at specific angles (yaw / pitch).
🔊 Volume control — adjust speaker volume on the fly.
⚡ Interruptible — new hotword instantly aborts a running conversation via AbortController.
🔄 Retry logic — automatic retry with exponential backoff for transient LLM errors (429, 5xx, network).
🧹 Context management — old photos are pruned from context to control token cost.

Prerequisites

Node.js ≥ 18 (for native fetch and AbortController)
A Jibo robot running with int-developer mode enabled
An OpenAI-compatible API endpoint (OpenAI, Anthropic via proxy, Ollama, LM Studio, etc.)
(Optional) Brave Search API key for the web_search tool

Quick Start

1. Clone & install

git clone https://github.com/niceduckdev/jibo-llm.git
cd jibo-llm
npm install

2. Configure environment

cp .env.example .env

Edit .env with your values:

# Jibo robot IP address on your local network
JIBO_IP=192.168.1.217

# LLM API configuration (any OpenAI-compatible endpoint)
LLM_BASE_URL=https://api.openai.com/v1
LLM_API_TOKEN=sk-your-api-key-here
LLM_MODEL_ID=gpt-4o

# Optional: enables the web_search tool
BRAVE_API_KEY=your-brave-api-key

3. Run

npm start
# or: node index.js

You'll see:

[jibo-llm] Connecting to Jibo at 192.168.1.217…
[jibo-llm] Connected — session abc123
[jibo-llm] Ready — listening for "Hey Jibo"…

Say "Hey Jibo" and start talking!

Configuration

All configuration is done via environment variables (loaded from .env by dotenv):

Variable	Required	Default	Description
`JIBO_IP`	No	`192.168.1.217`	Jibo's IP address on your LAN
`LLM_BASE_URL`	No	`https://api.openai.com/v1`	Base URL for the chat completions API
`LLM_API_TOKEN`	Yes	—	API key for the LLM provider
`LLM_MODEL_ID`	No	`gpt-4o`	Model identifier to use
`BRAVE_API_KEY`	No	—	Brave Search API key (enables `web_search` tool)

Using alternative LLM providers

Since jibo-llm uses the OpenAI SDK, any provider with a compatible chat completions endpoint works:

# Ollama (local)
LLM_BASE_URL=http://localhost:11434/v1
LLM_API_TOKEN=ollama
LLM_MODEL_ID=llama3

# LM Studio (local)
LLM_BASE_URL=http://localhost:1234/v1
LLM_API_TOKEN=lm-studio
LLM_MODEL_ID=local-model

# OpenRouter
LLM_BASE_URL=https://openrouter.ai/api/v1
LLM_API_TOKEN=sk-or-...
LLM_MODEL_ID=anthropic/claude-sonnet-4

Available Tools

The LLM can call any of these tools during a conversation:

Communication

Tool	Description
`say`	Speak ESML-formatted text through Jibo's speaker. Queued and chained so multiple `say` calls play in order.
`listen`	Open the microphone and transcribe user speech. Waits for pending speech to finish first.
`end_conversation`	Gracefully end the conversation (no further listening).

Camera

Tool	Description
`take_photo`	Capture a photo from Jibo's camera. The image is sent to the LLM as a base64 JPEG for visual understanding.

Display

Tool	Description
`show_text`	Display word-wrapped text on Jibo's screen.
`show_image`	Display an image from a URL on Jibo's screen.
`show_eye`	Restore the default eye animation.

Movement

Tool	Description
`look_at_angle`	Turn Jibo's head — `theta` (yaw ±180°) and `psi` (pitch ±30°).

Audio

Tool	Description
`set_volume`	Set speaker volume from 0.0 to 1.0.

Web

Tool	Description
`web_search`	Search the web via Brave Search API. Supports result count and freshness filters.
`fetch_url`	Fetch and read a web page. Prefers markdown via Cloudflare content negotiation, falls back to HTML→text conversion.

ESML (Embodied Speech Markup Language)

ESML is how Jibo speaks expressively. The system prompt includes a full reference (esml-reference.js) that teaches the LLM to use these tags inside say calls:

<!-- Emotional reaction (most common pattern) -->
<anim cat='happy' nonBlocking='true' endNeutral='true'/> That's great news!

<!-- Voice sound (laugh, sigh, greeting) -->
<ssa cat='laughing' nonBlocking='true'/> That's hilarious!

<!-- Sound effect -->
<sfx cat='drumroll'/> And the answer is...

<!-- Dance (always needs a filter) -->
<anim cat='dance' filter='music, rom-silly'/> Watch this!

<!-- Emoji on screen -->
<anim cat='emoji' filter='!(hf), &(heart)' nonBlocking='true'/> I love that!

<!-- Dramatic pause -->
And then... <break size='1.0'/> nothing happened.

A sanitizeForTTS() function in tools.js provides defense-in-depth by stripping markdown, LaTeX, and invalid tags before they reach Jibo's TTS engine.

How the Agent Loop Works

User says "Hey Jibo" ──▶ hotword event fires
                              │
                              ▼
                    Play acknowledgment animation
                              │
                              ▼
                    Listen for initial speech (15s timeout)
                              │
                              ▼
                    Build message history [system prompt, user text]
                              │
                              ▼
                    ┌─── Agent Loop (max 25 turns) ◀──┐
                    │                                  │
                    │  1. Prune old images from context │
                    │  2. Call LLM                      │
                    │  3. If no tool calls → done       │
                    │  4. Sort tools: say → actions → listen │
                    │  5. Execute each tool             │
                    │  6. Push results to messages      │
                    │  7. If end_conversation → done    │
                    │  8. Loop ─────────────────────────┘
                    │
                    ▼
              Conversation complete
              Resume hotword listening

Key behaviors:

Speech chaining: Multiple say calls are queued via a promise chain so they play sequentially without overlap.
Tool ordering: say executes first, then actions (photo, search, etc.), then listen/end_conversation last.
Graceful limits: At turn 24 of 25, a system message nudges the LLM to wrap up.
Image pruning: Only the 2 most recent photos are kept in context to manage token usage.

Project Structure

jibo-llm/
├── .env.example        # Template for environment variables
├── .env                # Your local config (git-ignored)
├── index.js            # Entry point: connection, hotword handling, agent loop
├── tools.js            # Tool schemas + executeTool() dispatcher
├── esml-reference.js   # ESML documentation injected into the system prompt
├── package.json        # Dependencies and scripts
└── node_modules/       # Installed dependencies

Dependencies

Package	Version	Purpose
rom-control	^2.0.1	Jibo robot control client (speech, camera, display, motors)
openai	^4.73.0	OpenAI-compatible chat completions SDK
dotenv	^16.4.5	Load `.env` configuration

License

MIT

11 KiB Raw Permalink Blame History