# jibo-llm > **Give Jibo a brain again.** A hotword-triggered, LLM-powered conversational agent that turns Jibo into an expressive, tool-using social robot — complete with speech, vision, web search, animations, and more. ![Node.js](https://img.shields.io/badge/Node.js-18%2B-339933?logo=node.js&logoColor=white) ![License](https://img.shields.io/badge/license-MIT-blue) --- ## Overview **jibo-llm** connects a Jibo robot to any OpenAI-compatible LLM (GPT-4o, Claude, local models via Ollama/LM Studio, etc.) through a real-time agent loop. When someone says **"Hey Jibo"**, the system: 1. **Listens** for the user's speech via Jibo's on-board microphone. 2. **Sends** the transcript to an LLM along with a rich system prompt and tool definitions. 3. **Executes** tool calls the LLM makes — speaking, animating, taking photos, searching the web, and more. 4. **Loops** until the conversation naturally ends or the user triggers a new hotword. Conversations are fully interruptible: saying "Hey Jibo" mid-conversation aborts the current exchange and starts a fresh one via `AbortController`. --- ## Architecture ``` ┌──────────────┐ hotword ┌──────────────┐ tool calls ┌───────────────┐ │ Jibo Robot │ ──────────▶ │ index.js │ ◀───────────▶ │ LLM (OpenAI │ │ (rom-ctrl) │ ◀────────── │ Agent Loop │ │ compatible) │ │ │ say/listen │ │ └───────────────┘ │ • mic │ photo/look │ tools.js │ web search ┌───────────────┐ │ • speaker │ display │ (executor) │ ──────────────▶ │ Brave Search │ │ • camera │ │ │ └───────────────┘ │ • screen │ │ esml-ref.js │ │ • motors │ │ (prompt ref)│ └──────────────┘ └──────────────┘ ``` | File | Purpose | |------|---------| | `index.js` | Entry point — connects to Jibo, listens for hotword, runs the agent loop with the LLM. | | `tools.js` | Defines all tool schemas (OpenAI function-calling format) and the `executeTool()` dispatcher. | | `esml-reference.js` | ESML (Embodied Speech Markup Language) cheat sheet injected into the system prompt so the LLM knows how to animate Jibo expressively. | --- ## Features - 🗣️ **Natural conversation** — multi-turn dialogue with speech recognition and TTS. - 🎭 **Expressive animations** — the LLM uses ESML tags to trigger emotions, dances, emojis, and sound effects inline with speech. - 📷 **Vision** — Jibo can take photos and the LLM receives the image for visual understanding. - 🔍 **Web search** — real-time Brave Search integration for up-to-date answers. - 🌐 **URL fetching** — reads web pages (with Cloudflare Markdown for Agents support) so Jibo can summarize articles. - 🖥️ **Display control** — show text, images, or restore the default eye on Jibo's screen. - 🤖 **Head movement** — point Jibo's head at specific angles (yaw / pitch). - 🔊 **Volume control** — adjust speaker volume on the fly. - ⚡ **Interruptible** — new hotword instantly aborts a running conversation via `AbortController`. - 🔄 **Retry logic** — automatic retry with exponential backoff for transient LLM errors (429, 5xx, network). - 🧹 **Context management** — old photos are pruned from context to control token cost. --- ## Prerequisites - **Node.js** ≥ 18 (for native `fetch` and `AbortController`) - **A Jibo robot** running with int-developer mode enabled - **An OpenAI-compatible API endpoint** (OpenAI, Anthropic via proxy, Ollama, LM Studio, etc.) - *(Optional)* **Brave Search API key** for the `web_search` tool --- ## Quick Start ### 1. Clone & install ```bash git clone https://github.com/niceduckdev/jibo-llm.git cd jibo-llm npm install ``` ### 2. Configure environment ```bash cp .env.example .env ``` Edit `.env` with your values: ```env # Jibo robot IP address on your local network JIBO_IP=192.168.1.217 # LLM API configuration (any OpenAI-compatible endpoint) LLM_BASE_URL=https://api.openai.com/v1 LLM_API_TOKEN=sk-your-api-key-here LLM_MODEL_ID=gpt-4o # Optional: enables the web_search tool BRAVE_API_KEY=your-brave-api-key ``` ### 3. Run ```bash npm start # or: node index.js ``` You'll see: ``` [jibo-llm] Connecting to Jibo at 192.168.1.217… [jibo-llm] Connected — session abc123 [jibo-llm] Ready — listening for "Hey Jibo"… ``` Say **"Hey Jibo"** and start talking! --- ## Configuration All configuration is done via environment variables (loaded from `.env` by [dotenv](https://www.npmjs.com/package/dotenv)): | Variable | Required | Default | Description | |----------|----------|---------|-------------| | `JIBO_IP` | No | `192.168.1.217` | Jibo's IP address on your LAN | | `LLM_BASE_URL` | No | `https://api.openai.com/v1` | Base URL for the chat completions API | | `LLM_API_TOKEN` | **Yes** | — | API key for the LLM provider | | `LLM_MODEL_ID` | No | `gpt-4o` | Model identifier to use | | `BRAVE_API_KEY` | No | — | Brave Search API key (enables `web_search` tool) | ### Using alternative LLM providers Since jibo-llm uses the OpenAI SDK, any provider with a compatible chat completions endpoint works: ```env # Ollama (local) LLM_BASE_URL=http://localhost:11434/v1 LLM_API_TOKEN=ollama LLM_MODEL_ID=llama3 # LM Studio (local) LLM_BASE_URL=http://localhost:1234/v1 LLM_API_TOKEN=lm-studio LLM_MODEL_ID=local-model # OpenRouter LLM_BASE_URL=https://openrouter.ai/api/v1 LLM_API_TOKEN=sk-or-... LLM_MODEL_ID=anthropic/claude-sonnet-4 ``` --- ## Available Tools The LLM can call any of these tools during a conversation: ### Communication | Tool | Description | |------|-------------| | `say` | Speak ESML-formatted text through Jibo's speaker. Queued and chained so multiple `say` calls play in order. | | `listen` | Open the microphone and transcribe user speech. Waits for pending speech to finish first. | | `end_conversation` | Gracefully end the conversation (no further listening). | ### Camera | Tool | Description | |------|-------------| | `take_photo` | Capture a photo from Jibo's camera. The image is sent to the LLM as a base64 JPEG for visual understanding. | ### Display | Tool | Description | |------|-------------| | `show_text` | Display word-wrapped text on Jibo's screen. | | `show_image` | Display an image from a URL on Jibo's screen. | | `show_eye` | Restore the default eye animation. | ### Movement | Tool | Description | |------|-------------| | `look_at_angle` | Turn Jibo's head — `theta` (yaw ±180°) and `psi` (pitch ±30°). | ### Audio | Tool | Description | |------|-------------| | `set_volume` | Set speaker volume from 0.0 to 1.0. | ### Web | Tool | Description | |------|-------------| | `web_search` | Search the web via Brave Search API. Supports result count and freshness filters. | | `fetch_url` | Fetch and read a web page. Prefers markdown via Cloudflare content negotiation, falls back to HTML→text conversion. | --- ## ESML (Embodied Speech Markup Language) ESML is how Jibo speaks expressively. The system prompt includes a full reference (`esml-reference.js`) that teaches the LLM to use these tags inside `say` calls: ```xml That's great news! That's hilarious! And the answer is... Watch this! I love that! And then... nothing happened. ``` A `sanitizeForTTS()` function in `tools.js` provides defense-in-depth by stripping markdown, LaTeX, and invalid tags before they reach Jibo's TTS engine. --- ## How the Agent Loop Works ``` User says "Hey Jibo" ──▶ hotword event fires │ ▼ Play acknowledgment animation │ ▼ Listen for initial speech (15s timeout) │ ▼ Build message history [system prompt, user text] │ ▼ ┌─── Agent Loop (max 25 turns) ◀──┐ │ │ │ 1. Prune old images from context │ │ 2. Call LLM │ │ 3. If no tool calls → done │ │ 4. Sort tools: say → actions → listen │ │ 5. Execute each tool │ │ 6. Push results to messages │ │ 7. If end_conversation → done │ │ 8. Loop ─────────────────────────┘ │ ▼ Conversation complete Resume hotword listening ``` Key behaviors: - **Speech chaining**: Multiple `say` calls are queued via a promise chain so they play sequentially without overlap. - **Tool ordering**: `say` executes first, then actions (photo, search, etc.), then `listen`/`end_conversation` last. - **Graceful limits**: At turn 24 of 25, a system message nudges the LLM to wrap up. - **Image pruning**: Only the 2 most recent photos are kept in context to manage token usage. --- ## Project Structure ``` jibo-llm/ ├── .env.example # Template for environment variables ├── .env # Your local config (git-ignored) ├── index.js # Entry point: connection, hotword handling, agent loop ├── tools.js # Tool schemas + executeTool() dispatcher ├── esml-reference.js # ESML documentation injected into the system prompt ├── package.json # Dependencies and scripts └── node_modules/ # Installed dependencies ``` --- ## Dependencies | Package | Version | Purpose | |---------|---------|---------| | [rom-control](https://github.com/niceduckdev/rom-control) | ^2.0.1 | Jibo robot control client (speech, camera, display, motors) | | [openai](https://www.npmjs.com/package/openai) | ^4.73.0 | OpenAI-compatible chat completions SDK | | [dotenv](https://www.npmjs.com/package/dotenv) | ^16.4.5 | Load `.env` configuration | --- ## License MIT