Initial commit: jibo-llm hotword-triggered agent

Hotword-triggered LLM conversation loop for Jibo with tool-calling agent loop, ESML expressive speech, web search/fetch, and per-conversation abort handling.
2026-04-26 00:05:39 -04:00
commit 8955f21ab4
8 changed files with 2039 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,291 @@
+# jibo-llm
+
+> **Give Jibo a brain again.** A hotword-triggered, LLM-powered conversational agent that turns Jibo into an expressive, tool-using social robot — complete with speech, vision, web search, animations, and more.
+
+![Node.js](https://img.shields.io/badge/Node.js-18%2B-339933?logo=node.js&logoColor=white)
+![License](https://img.shields.io/badge/license-MIT-blue)
+
+---
+
+## Overview
+
+**jibo-llm** connects a Jibo robot to any OpenAI-compatible LLM (GPT-4o, Claude, local models via Ollama/LM Studio, etc.) through a real-time agent loop. When someone says **"Hey Jibo"**, the system:
+
+1. **Listens** for the user's speech via Jibo's on-board microphone.
+2. **Sends** the transcript to an LLM along with a rich system prompt and tool definitions.
+3. **Executes** tool calls the LLM makes — speaking, animating, taking photos, searching the web, and more.
+4. **Loops** until the conversation naturally ends or the user triggers a new hotword.
+
+Conversations are fully interruptible: saying "Hey Jibo" mid-conversation aborts the current exchange and starts a fresh one via `AbortController`.
+
+---
+
+## Architecture
+
+```
+┌──────────────┐   hotword    ┌──────────────┐   tool calls   ┌───────────────┐
+│  Jibo Robot  │ ──────────▶  │   index.js   │ ◀───────────▶  │  LLM (OpenAI  │
+│  (rom-ctrl)  │ ◀──────────  │  Agent Loop  │                │  compatible)  │
+│              │   say/listen │              │                └───────────────┘
+│  • mic       │   photo/look │  tools.js    │   web search   ┌───────────────┐
+│  • speaker   │   display    │  (executor)  │ ──────────────▶ │  Brave Search │
+│  • camera    │              │              │                └───────────────┘
+│  • screen    │              │  esml-ref.js │
+│  • motors    │              │  (prompt ref)│
+└──────────────┘              └──────────────┘
+```
+
+| File | Purpose |
+|------|---------|
+| `index.js` | Entry point — connects to Jibo, listens for hotword, runs the agent loop with the LLM. |
+| `tools.js` | Defines all tool schemas (OpenAI function-calling format) and the `executeTool()` dispatcher. |
+| `esml-reference.js` | ESML (Embodied Speech Markup Language) cheat sheet injected into the system prompt so the LLM knows how to animate Jibo expressively. |
+
+---
+
+## Features
+
+- 🗣️ **Natural conversation** — multi-turn dialogue with speech recognition and TTS.
+- 🎭 **Expressive animations** — the LLM uses ESML tags to trigger emotions, dances, emojis, and sound effects inline with speech.
+- 📷 **Vision** — Jibo can take photos and the LLM receives the image for visual understanding.
+- 🔍 **Web search** — real-time Brave Search integration for up-to-date answers.
+- 🌐 **URL fetching** — reads web pages (with Cloudflare Markdown for Agents support) so Jibo can summarize articles.
+- 🖥️ **Display control** — show text, images, or restore the default eye on Jibo's screen.
+- 🤖 **Head movement** — point Jibo's head at specific angles (yaw / pitch).
+- 🔊 **Volume control** — adjust speaker volume on the fly.
+- ⚡ **Interruptible** — new hotword instantly aborts a running conversation via `AbortController`.
+- 🔄 **Retry logic** — automatic retry with exponential backoff for transient LLM errors (429, 5xx, network).
+- 🧹 **Context management** — old photos are pruned from context to control token cost.
+
+---
+
+## Prerequisites
+
+- **Node.js** ≥ 18 (for native `fetch` and `AbortController`)
+- **A Jibo robot** running with int-developer mode enabled
+- **An OpenAI-compatible API endpoint** (OpenAI, Anthropic via proxy, Ollama, LM Studio, etc.)
+- *(Optional)* **Brave Search API key** for the `web_search` tool
+
+---
+
+## Quick Start
+
+### 1. Clone & install
+
+```bash
+git clone https://github.com/niceduckdev/jibo-llm.git
+cd jibo-llm
+npm install
+```
+
+### 2. Configure environment
+
+```bash
+cp .env.example .env
+```
+
+Edit `.env` with your values:
+
+```env
+# Jibo robot IP address on your local network
+JIBO_IP=192.168.1.217
+
+# LLM API configuration (any OpenAI-compatible endpoint)
+LLM_BASE_URL=https://api.openai.com/v1
+LLM_API_TOKEN=sk-your-api-key-here
+LLM_MODEL_ID=gpt-4o
+
+# Optional: enables the web_search tool
+BRAVE_API_KEY=your-brave-api-key
+```
+
+### 3. Run
+
+```bash
+npm start
+# or: node index.js
+```
+
+You'll see:
+
+```
+[jibo-llm] Connecting to Jibo at 192.168.1.217…
+[jibo-llm] Connected — session abc123
+[jibo-llm] Ready — listening for "Hey Jibo"…
+```
+
+Say **"Hey Jibo"** and start talking!
+
+---
+
+## Configuration
+
+All configuration is done via environment variables (loaded from `.env` by [dotenv](https://www.npmjs.com/package/dotenv)):
+
+| Variable | Required | Default | Description |
+|----------|----------|---------|-------------|
+| `JIBO_IP` | No | `192.168.1.217` | Jibo's IP address on your LAN |
+| `LLM_BASE_URL` | No | `https://api.openai.com/v1` | Base URL for the chat completions API |
+| `LLM_API_TOKEN` | **Yes** | — | API key for the LLM provider |
+| `LLM_MODEL_ID` | No | `gpt-4o` | Model identifier to use |
+| `BRAVE_API_KEY` | No | — | Brave Search API key (enables `web_search` tool) |
+
+### Using alternative LLM providers
+
+Since jibo-llm uses the OpenAI SDK, any provider with a compatible chat completions endpoint works:
+
+```env
+# Ollama (local)
+LLM_BASE_URL=http://localhost:11434/v1
+LLM_API_TOKEN=ollama
+LLM_MODEL_ID=llama3
+
+# LM Studio (local)
+LLM_BASE_URL=http://localhost:1234/v1
+LLM_API_TOKEN=lm-studio
+LLM_MODEL_ID=local-model
+
+# OpenRouter
+LLM_BASE_URL=https://openrouter.ai/api/v1
+LLM_API_TOKEN=sk-or-...
+LLM_MODEL_ID=anthropic/claude-sonnet-4
+```
+
+---
+
+## Available Tools
+
+The LLM can call any of these tools during a conversation:
+
+### Communication
+| Tool | Description |
+|------|-------------|
+| `say` | Speak ESML-formatted text through Jibo's speaker. Queued and chained so multiple `say` calls play in order. |
+| `listen` | Open the microphone and transcribe user speech. Waits for pending speech to finish first. |
+| `end_conversation` | Gracefully end the conversation (no further listening). |
+
+### Camera
+| Tool | Description |
+|------|-------------|
+| `take_photo` | Capture a photo from Jibo's camera. The image is sent to the LLM as a base64 JPEG for visual understanding. |
+
+### Display
+| Tool | Description |
+|------|-------------|
+| `show_text` | Display word-wrapped text on Jibo's screen. |
+| `show_image` | Display an image from a URL on Jibo's screen. |
+| `show_eye` | Restore the default eye animation. |
+
+### Movement
+| Tool | Description |
+|------|-------------|
+| `look_at_angle` | Turn Jibo's head — `theta` (yaw ±180°) and `psi` (pitch ±30°). |
+
+### Audio
+| Tool | Description |
+|------|-------------|
+| `set_volume` | Set speaker volume from 0.0 to 1.0. |
+
+### Web
+| Tool | Description |
+|------|-------------|
+| `web_search` | Search the web via Brave Search API. Supports result count and freshness filters. |
+| `fetch_url` | Fetch and read a web page. Prefers markdown via Cloudflare content negotiation, falls back to HTML→text conversion. |
+
+---
+
+## ESML (Embodied Speech Markup Language)
+
+ESML is how Jibo speaks expressively. The system prompt includes a full reference (`esml-reference.js`) that teaches the LLM to use these tags inside `say` calls:
+
+```xml
+<!-- Emotional reaction (most common pattern) -->
+<anim cat='happy' nonBlocking='true' endNeutral='true'/> That's great news!
+
+<!-- Voice sound (laugh, sigh, greeting) -->
+<ssa cat='laughing' nonBlocking='true'/> That's hilarious!
+
+<!-- Sound effect -->
+<sfx cat='drumroll'/> And the answer is...
+
+<!-- Dance (always needs a filter) -->
+<anim cat='dance' filter='music, rom-silly'/> Watch this!
+
+<!-- Emoji on screen -->
+<anim cat='emoji' filter='!(hf), &(heart)' nonBlocking='true'/> I love that!
+
+<!-- Dramatic pause -->
+And then... <break size='1.0'/> nothing happened.
+```
+
+A `sanitizeForTTS()` function in `tools.js` provides defense-in-depth by stripping markdown, LaTeX, and invalid tags before they reach Jibo's TTS engine.
+
+---
+
+## How the Agent Loop Works
+
+```
+User says "Hey Jibo" ──▶ hotword event fires
+                              │
+                              ▼
+                    Play acknowledgment animation
+                              │
+                              ▼
+                    Listen for initial speech (15s timeout)
+                              │
+                              ▼
+                    Build message history [system prompt, user text]
+                              │
+                              ▼
+                    ┌─── Agent Loop (max 25 turns) ◀──┐
+                    │                                  │
+                    │  1. Prune old images from context │
+                    │  2. Call LLM                      │
+                    │  3. If no tool calls → done       │
+                    │  4. Sort tools: say → actions → listen │
+                    │  5. Execute each tool             │
+                    │  6. Push results to messages      │
+                    │  7. If end_conversation → done    │
+                    │  8. Loop ─────────────────────────┘
+                    │
+                    ▼
+              Conversation complete
+              Resume hotword listening
+```
+
+Key behaviors:
+- **Speech chaining**: Multiple `say` calls are queued via a promise chain so they play sequentially without overlap.
+- **Tool ordering**: `say` executes first, then actions (photo, search, etc.), then `listen`/`end_conversation` last.
+- **Graceful limits**: At turn 24 of 25, a system message nudges the LLM to wrap up.
+- **Image pruning**: Only the 2 most recent photos are kept in context to manage token usage.
+
+---
+
+## Project Structure
+
+```
+jibo-llm/
+├── .env.example        # Template for environment variables
+├── .env                # Your local config (git-ignored)
+├── index.js            # Entry point: connection, hotword handling, agent loop
+├── tools.js            # Tool schemas + executeTool() dispatcher
+├── esml-reference.js   # ESML documentation injected into the system prompt
+├── package.json        # Dependencies and scripts
+└── node_modules/       # Installed dependencies
+```
+
+---
+
+## Dependencies
+
+| Package | Version | Purpose |
+|---------|---------|---------|
+| [rom-control](https://github.com/niceduckdev/rom-control) | ^2.0.1 | Jibo robot control client (speech, camera, display, motors) |
+| [openai](https://www.npmjs.com/package/openai) | ^4.73.0 | OpenAI-compatible chat completions SDK |
+| [dotenv](https://www.npmjs.com/package/dotenv) | ^16.4.5 | Load `.env` configuration |
+
+---
+
+## License
+
+MIT