292 lines
11 KiB
Markdown
292 lines
11 KiB
Markdown
|
|
# jibo-llm
|
||
|
|
|
||
|
|
> **Give Jibo a brain again.** A hotword-triggered, LLM-powered conversational agent that turns Jibo into an expressive, tool-using social robot — complete with speech, vision, web search, animations, and more.
|
||
|
|
|
||
|
|

|
||
|
|

|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
**jibo-llm** connects a Jibo robot to any OpenAI-compatible LLM (GPT-4o, Claude, local models via Ollama/LM Studio, etc.) through a real-time agent loop. When someone says **"Hey Jibo"**, the system:
|
||
|
|
|
||
|
|
1. **Listens** for the user's speech via Jibo's on-board microphone.
|
||
|
|
2. **Sends** the transcript to an LLM along with a rich system prompt and tool definitions.
|
||
|
|
3. **Executes** tool calls the LLM makes — speaking, animating, taking photos, searching the web, and more.
|
||
|
|
4. **Loops** until the conversation naturally ends or the user triggers a new hotword.
|
||
|
|
|
||
|
|
Conversations are fully interruptible: saying "Hey Jibo" mid-conversation aborts the current exchange and starts a fresh one via `AbortController`.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Architecture
|
||
|
|
|
||
|
|
```
|
||
|
|
┌──────────────┐ hotword ┌──────────────┐ tool calls ┌───────────────┐
|
||
|
|
│ Jibo Robot │ ──────────▶ │ index.js │ ◀───────────▶ │ LLM (OpenAI │
|
||
|
|
│ (rom-ctrl) │ ◀────────── │ Agent Loop │ │ compatible) │
|
||
|
|
│ │ say/listen │ │ └───────────────┘
|
||
|
|
│ • mic │ photo/look │ tools.js │ web search ┌───────────────┐
|
||
|
|
│ • speaker │ display │ (executor) │ ──────────────▶ │ Brave Search │
|
||
|
|
│ • camera │ │ │ └───────────────┘
|
||
|
|
│ • screen │ │ esml-ref.js │
|
||
|
|
│ • motors │ │ (prompt ref)│
|
||
|
|
└──────────────┘ └──────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
| File | Purpose |
|
||
|
|
|------|---------|
|
||
|
|
| `index.js` | Entry point — connects to Jibo, listens for hotword, runs the agent loop with the LLM. |
|
||
|
|
| `tools.js` | Defines all tool schemas (OpenAI function-calling format) and the `executeTool()` dispatcher. |
|
||
|
|
| `esml-reference.js` | ESML (Embodied Speech Markup Language) cheat sheet injected into the system prompt so the LLM knows how to animate Jibo expressively. |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Features
|
||
|
|
|
||
|
|
- 🗣️ **Natural conversation** — multi-turn dialogue with speech recognition and TTS.
|
||
|
|
- 🎭 **Expressive animations** — the LLM uses ESML tags to trigger emotions, dances, emojis, and sound effects inline with speech.
|
||
|
|
- 📷 **Vision** — Jibo can take photos and the LLM receives the image for visual understanding.
|
||
|
|
- 🔍 **Web search** — real-time Brave Search integration for up-to-date answers.
|
||
|
|
- 🌐 **URL fetching** — reads web pages (with Cloudflare Markdown for Agents support) so Jibo can summarize articles.
|
||
|
|
- 🖥️ **Display control** — show text, images, or restore the default eye on Jibo's screen.
|
||
|
|
- 🤖 **Head movement** — point Jibo's head at specific angles (yaw / pitch).
|
||
|
|
- 🔊 **Volume control** — adjust speaker volume on the fly.
|
||
|
|
- ⚡ **Interruptible** — new hotword instantly aborts a running conversation via `AbortController`.
|
||
|
|
- 🔄 **Retry logic** — automatic retry with exponential backoff for transient LLM errors (429, 5xx, network).
|
||
|
|
- 🧹 **Context management** — old photos are pruned from context to control token cost.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Prerequisites
|
||
|
|
|
||
|
|
- **Node.js** ≥ 18 (for native `fetch` and `AbortController`)
|
||
|
|
- **A Jibo robot** running with int-developer mode enabled
|
||
|
|
- **An OpenAI-compatible API endpoint** (OpenAI, Anthropic via proxy, Ollama, LM Studio, etc.)
|
||
|
|
- *(Optional)* **Brave Search API key** for the `web_search` tool
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Quick Start
|
||
|
|
|
||
|
|
### 1. Clone & install
|
||
|
|
|
||
|
|
```bash
|
||
|
|
git clone https://github.com/niceduckdev/jibo-llm.git
|
||
|
|
cd jibo-llm
|
||
|
|
npm install
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Configure environment
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cp .env.example .env
|
||
|
|
```
|
||
|
|
|
||
|
|
Edit `.env` with your values:
|
||
|
|
|
||
|
|
```env
|
||
|
|
# Jibo robot IP address on your local network
|
||
|
|
JIBO_IP=192.168.1.217
|
||
|
|
|
||
|
|
# LLM API configuration (any OpenAI-compatible endpoint)
|
||
|
|
LLM_BASE_URL=https://api.openai.com/v1
|
||
|
|
LLM_API_TOKEN=sk-your-api-key-here
|
||
|
|
LLM_MODEL_ID=gpt-4o
|
||
|
|
|
||
|
|
# Optional: enables the web_search tool
|
||
|
|
BRAVE_API_KEY=your-brave-api-key
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Run
|
||
|
|
|
||
|
|
```bash
|
||
|
|
npm start
|
||
|
|
# or: node index.js
|
||
|
|
```
|
||
|
|
|
||
|
|
You'll see:
|
||
|
|
|
||
|
|
```
|
||
|
|
[jibo-llm] Connecting to Jibo at 192.168.1.217…
|
||
|
|
[jibo-llm] Connected — session abc123
|
||
|
|
[jibo-llm] Ready — listening for "Hey Jibo"…
|
||
|
|
```
|
||
|
|
|
||
|
|
Say **"Hey Jibo"** and start talking!
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Configuration
|
||
|
|
|
||
|
|
All configuration is done via environment variables (loaded from `.env` by [dotenv](https://www.npmjs.com/package/dotenv)):
|
||
|
|
|
||
|
|
| Variable | Required | Default | Description |
|
||
|
|
|----------|----------|---------|-------------|
|
||
|
|
| `JIBO_IP` | No | `192.168.1.217` | Jibo's IP address on your LAN |
|
||
|
|
| `LLM_BASE_URL` | No | `https://api.openai.com/v1` | Base URL for the chat completions API |
|
||
|
|
| `LLM_API_TOKEN` | **Yes** | — | API key for the LLM provider |
|
||
|
|
| `LLM_MODEL_ID` | No | `gpt-4o` | Model identifier to use |
|
||
|
|
| `BRAVE_API_KEY` | No | — | Brave Search API key (enables `web_search` tool) |
|
||
|
|
|
||
|
|
### Using alternative LLM providers
|
||
|
|
|
||
|
|
Since jibo-llm uses the OpenAI SDK, any provider with a compatible chat completions endpoint works:
|
||
|
|
|
||
|
|
```env
|
||
|
|
# Ollama (local)
|
||
|
|
LLM_BASE_URL=http://localhost:11434/v1
|
||
|
|
LLM_API_TOKEN=ollama
|
||
|
|
LLM_MODEL_ID=llama3
|
||
|
|
|
||
|
|
# LM Studio (local)
|
||
|
|
LLM_BASE_URL=http://localhost:1234/v1
|
||
|
|
LLM_API_TOKEN=lm-studio
|
||
|
|
LLM_MODEL_ID=local-model
|
||
|
|
|
||
|
|
# OpenRouter
|
||
|
|
LLM_BASE_URL=https://openrouter.ai/api/v1
|
||
|
|
LLM_API_TOKEN=sk-or-...
|
||
|
|
LLM_MODEL_ID=anthropic/claude-sonnet-4
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Available Tools
|
||
|
|
|
||
|
|
The LLM can call any of these tools during a conversation:
|
||
|
|
|
||
|
|
### Communication
|
||
|
|
| Tool | Description |
|
||
|
|
|------|-------------|
|
||
|
|
| `say` | Speak ESML-formatted text through Jibo's speaker. Queued and chained so multiple `say` calls play in order. |
|
||
|
|
| `listen` | Open the microphone and transcribe user speech. Waits for pending speech to finish first. |
|
||
|
|
| `end_conversation` | Gracefully end the conversation (no further listening). |
|
||
|
|
|
||
|
|
### Camera
|
||
|
|
| Tool | Description |
|
||
|
|
|------|-------------|
|
||
|
|
| `take_photo` | Capture a photo from Jibo's camera. The image is sent to the LLM as a base64 JPEG for visual understanding. |
|
||
|
|
|
||
|
|
### Display
|
||
|
|
| Tool | Description |
|
||
|
|
|------|-------------|
|
||
|
|
| `show_text` | Display word-wrapped text on Jibo's screen. |
|
||
|
|
| `show_image` | Display an image from a URL on Jibo's screen. |
|
||
|
|
| `show_eye` | Restore the default eye animation. |
|
||
|
|
|
||
|
|
### Movement
|
||
|
|
| Tool | Description |
|
||
|
|
|------|-------------|
|
||
|
|
| `look_at_angle` | Turn Jibo's head — `theta` (yaw ±180°) and `psi` (pitch ±30°). |
|
||
|
|
|
||
|
|
### Audio
|
||
|
|
| Tool | Description |
|
||
|
|
|------|-------------|
|
||
|
|
| `set_volume` | Set speaker volume from 0.0 to 1.0. |
|
||
|
|
|
||
|
|
### Web
|
||
|
|
| Tool | Description |
|
||
|
|
|------|-------------|
|
||
|
|
| `web_search` | Search the web via Brave Search API. Supports result count and freshness filters. |
|
||
|
|
| `fetch_url` | Fetch and read a web page. Prefers markdown via Cloudflare content negotiation, falls back to HTML→text conversion. |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## ESML (Embodied Speech Markup Language)
|
||
|
|
|
||
|
|
ESML is how Jibo speaks expressively. The system prompt includes a full reference (`esml-reference.js`) that teaches the LLM to use these tags inside `say` calls:
|
||
|
|
|
||
|
|
```xml
|
||
|
|
<!-- Emotional reaction (most common pattern) -->
|
||
|
|
<anim cat='happy' nonBlocking='true' endNeutral='true'/> That's great news!
|
||
|
|
|
||
|
|
<!-- Voice sound (laugh, sigh, greeting) -->
|
||
|
|
<ssa cat='laughing' nonBlocking='true'/> That's hilarious!
|
||
|
|
|
||
|
|
<!-- Sound effect -->
|
||
|
|
<sfx cat='drumroll'/> And the answer is...
|
||
|
|
|
||
|
|
<!-- Dance (always needs a filter) -->
|
||
|
|
<anim cat='dance' filter='music, rom-silly'/> Watch this!
|
||
|
|
|
||
|
|
<!-- Emoji on screen -->
|
||
|
|
<anim cat='emoji' filter='!(hf), &(heart)' nonBlocking='true'/> I love that!
|
||
|
|
|
||
|
|
<!-- Dramatic pause -->
|
||
|
|
And then... <break size='1.0'/> nothing happened.
|
||
|
|
```
|
||
|
|
|
||
|
|
A `sanitizeForTTS()` function in `tools.js` provides defense-in-depth by stripping markdown, LaTeX, and invalid tags before they reach Jibo's TTS engine.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## How the Agent Loop Works
|
||
|
|
|
||
|
|
```
|
||
|
|
User says "Hey Jibo" ──▶ hotword event fires
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
Play acknowledgment animation
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
Listen for initial speech (15s timeout)
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
Build message history [system prompt, user text]
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─── Agent Loop (max 25 turns) ◀──┐
|
||
|
|
│ │
|
||
|
|
│ 1. Prune old images from context │
|
||
|
|
│ 2. Call LLM │
|
||
|
|
│ 3. If no tool calls → done │
|
||
|
|
│ 4. Sort tools: say → actions → listen │
|
||
|
|
│ 5. Execute each tool │
|
||
|
|
│ 6. Push results to messages │
|
||
|
|
│ 7. If end_conversation → done │
|
||
|
|
│ 8. Loop ─────────────────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
Conversation complete
|
||
|
|
Resume hotword listening
|
||
|
|
```
|
||
|
|
|
||
|
|
Key behaviors:
|
||
|
|
- **Speech chaining**: Multiple `say` calls are queued via a promise chain so they play sequentially without overlap.
|
||
|
|
- **Tool ordering**: `say` executes first, then actions (photo, search, etc.), then `listen`/`end_conversation` last.
|
||
|
|
- **Graceful limits**: At turn 24 of 25, a system message nudges the LLM to wrap up.
|
||
|
|
- **Image pruning**: Only the 2 most recent photos are kept in context to manage token usage.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Project Structure
|
||
|
|
|
||
|
|
```
|
||
|
|
jibo-llm/
|
||
|
|
├── .env.example # Template for environment variables
|
||
|
|
├── .env # Your local config (git-ignored)
|
||
|
|
├── index.js # Entry point: connection, hotword handling, agent loop
|
||
|
|
├── tools.js # Tool schemas + executeTool() dispatcher
|
||
|
|
├── esml-reference.js # ESML documentation injected into the system prompt
|
||
|
|
├── package.json # Dependencies and scripts
|
||
|
|
└── node_modules/ # Installed dependencies
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Dependencies
|
||
|
|
|
||
|
|
| Package | Version | Purpose |
|
||
|
|
|---------|---------|---------|
|
||
|
|
| [rom-control](https://github.com/niceduckdev/rom-control) | ^2.0.1 | Jibo robot control client (speech, camera, display, motors) |
|
||
|
|
| [openai](https://www.npmjs.com/package/openai) | ^4.73.0 | OpenAI-compatible chat completions SDK |
|
||
|
|
| [dotenv](https://www.npmjs.com/package/dotenv) | ^16.4.5 | Load `.env` configuration |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## License
|
||
|
|
|
||
|
|
MIT
|