README.md

# jibo-llm

> **Give Jibo a brain again.** A hotword-triggered, LLM-powered conversational agent that turns Jibo into an expressive, tool-using social robot — complete with speech, vision, web search, animations, and more.

![Node.js](https://img.shields.io/badge/Node.js-18%2B-339933?logo=node.js&logoColor=white)
![License](https://img.shields.io/badge/license-MIT-blue)

---

## Overview

**jibo-llm** connects a Jibo robot to any OpenAI-compatible LLM (GPT-4o, Claude, local models via Ollama/LM Studio, etc.) through a real-time agent loop. When someone says **"Hey Jibo"**, the system:

1. **Listens** for the user's speech via Jibo's on-board microphone.
2. **Sends** the transcript to an LLM along with a rich system prompt and tool definitions.
3. **Executes** tool calls the LLM makes — speaking, animating, taking photos, searching the web, and more.
4. **Loops** until the conversation naturally ends or the user triggers a new hotword.

Conversations are fully interruptible: saying "Hey Jibo" mid-conversation aborts the current exchange and starts a fresh one via `AbortController`.

---

## Architecture

```
┌──────────────┐   hotword    ┌──────────────┐   tool calls   ┌───────────────┐
│  Jibo Robot  │ ──────────▶  │   index.js   │ ◀───────────▶  │  LLM (OpenAI  │
│  (rom-ctrl)  │ ◀──────────  │  Agent Loop  │                │  compatible)  │
│              │   say/listen │              │                └───────────────┘
│  • mic       │   photo/look │  tools.js    │   web search   ┌───────────────┐
│  • speaker   │   display    │  (executor)  │ ──────────────▶ │  Brave Search │
│  • camera    │              │              │                └───────────────┘
│  • screen    │              │  esml-ref.js │
│  • motors    │              │  (prompt ref)│
└──────────────┘              └──────────────┘
```

| File | Purpose |
|------|---------|
| `index.js` | Entry point — connects to Jibo, listens for hotword, runs the agent loop with the LLM. |
| `tools.js` | Defines all tool schemas (OpenAI function-calling format) and the `executeTool()` dispatcher. |
| `esml-reference.js` | ESML (Embodied Speech Markup Language) cheat sheet injected into the system prompt so the LLM knows how to animate Jibo expressively. |

---

## Features

- 🗣️ **Natural conversation** — multi-turn dialogue with speech recognition and TTS.
- 🎭 **Expressive animations** — the LLM uses ESML tags to trigger emotions, dances, emojis, and sound effects inline with speech.
- 📷 **Vision** — Jibo can take photos and the LLM receives the image for visual understanding.
- 🔍 **Web search** — real-time Brave Search integration for up-to-date answers.
- 🌐 **URL fetching** — reads web pages (with Cloudflare Markdown for Agents support) so Jibo can summarize articles.
- 🖥️ **Display control** — show text, images, or restore the default eye on Jibo's screen.
- 🤖 **Head movement** — point Jibo's head at specific angles (yaw / pitch).
- 🔊 **Volume control** — adjust speaker volume on the fly.
- ⚡ **Interruptible** — new hotword instantly aborts a running conversation via `AbortController`.
- 🔄 **Retry logic** — automatic retry with exponential backoff for transient LLM errors (429, 5xx, network).
- 🧹 **Context management** — old photos are pruned from context to control token cost.

---

## Prerequisites

- **Node.js** ≥ 18 (for native `fetch` and `AbortController`)
- **A Jibo robot** running with int-developer mode enabled
- **An OpenAI-compatible API endpoint** (OpenAI, Anthropic via proxy, Ollama, LM Studio, etc.)
- *(Optional)* **Brave Search API key** for the `web_search` tool

---

## Quick Start

### 1. Clone & install

```bash
git clone https://github.com/niceduckdev/jibo-llm.git
cd jibo-llm
npm install
```

### 2. Configure environment

```bash
cp .env.example .env
```

Edit `.env` with your values:

```env
# Jibo robot IP address on your local network
JIBO_IP=192.168.1.217

# LLM API configuration (any OpenAI-compatible endpoint)
LLM_BASE_URL=https://api.openai.com/v1
LLM_API_TOKEN=sk-your-api-key-here
LLM_MODEL_ID=gpt-4o

# Optional: enables the web_search tool
BRAVE_API_KEY=your-brave-api-key
```

### 3. Run

```bash
npm start
# or: node index.js
```

You'll see:

```
[jibo-llm] Connecting to Jibo at 192.168.1.217…
[jibo-llm] Connected — session abc123
[jibo-llm] Ready — listening for "Hey Jibo"…
```

Say **"Hey Jibo"** and start talking!

---

## Configuration

All configuration is done via environment variables (loaded from `.env` by [dotenv](https://www.npmjs.com/package/dotenv)):

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `JIBO_IP` | No | `192.168.1.217` | Jibo's IP address on your LAN |
| `LLM_BASE_URL` | No | `https://api.openai.com/v1` | Base URL for the chat completions API |
| `LLM_API_TOKEN` | **Yes** | — | API key for the LLM provider |
| `LLM_MODEL_ID` | No | `gpt-4o` | Model identifier to use |
| `BRAVE_API_KEY` | No | — | Brave Search API key (enables `web_search` tool) |

### Using alternative LLM providers

Since jibo-llm uses the OpenAI SDK, any provider with a compatible chat completions endpoint works:

```env
# Ollama (local)
LLM_BASE_URL=http://localhost:11434/v1
LLM_API_TOKEN=ollama
LLM_MODEL_ID=llama3

# LM Studio (local)
LLM_BASE_URL=http://localhost:1234/v1
LLM_API_TOKEN=lm-studio
LLM_MODEL_ID=local-model

# OpenRouter
LLM_BASE_URL=https://openrouter.ai/api/v1
LLM_API_TOKEN=sk-or-...
LLM_MODEL_ID=anthropic/claude-sonnet-4
```

---

## Available Tools

The LLM can call any of these tools during a conversation:

### Communication
| Tool | Description |
|------|-------------|
| `say` | Speak ESML-formatted text through Jibo's speaker. Queued and chained so multiple `say` calls play in order. |
| `listen` | Open the microphone and transcribe user speech. Waits for pending speech to finish first. |
| `end_conversation` | Gracefully end the conversation (no further listening). |

### Camera
| Tool | Description |
|------|-------------|
| `take_photo` | Capture a photo from Jibo's camera. The image is sent to the LLM as a base64 JPEG for visual understanding. |

### Display
| Tool | Description |
|------|-------------|
| `show_text` | Display word-wrapped text on Jibo's screen. |
| `show_image` | Display an image from a URL on Jibo's screen. |
| `show_eye` | Restore the default eye animation. |

### Movement
| Tool | Description |
|------|-------------|
| `look_at_angle` | Turn Jibo's head — `theta` (yaw ±180°) and `psi` (pitch ±30°). |

### Audio
| Tool | Description |
|------|-------------|
| `set_volume` | Set speaker volume from 0.0 to 1.0. |

### Web
| Tool | Description |
|------|-------------|
| `web_search` | Search the web via Brave Search API. Supports result count and freshness filters. |
| `fetch_url` | Fetch and read a web page. Prefers markdown via Cloudflare content negotiation, falls back to HTML→text conversion. |

---

## ESML (Embodied Speech Markup Language)

ESML is how Jibo speaks expressively. The system prompt includes a full reference (`esml-reference.js`) that teaches the LLM to use these tags inside `say` calls:

```xml
<!-- Emotional reaction (most common pattern) -->
<anim cat='happy' nonBlocking='true' endNeutral='true'/> That's great news!

<!-- Voice sound (laugh, sigh, greeting) -->
<ssa cat='laughing' nonBlocking='true'/> That's hilarious!

<!-- Sound effect -->
<sfx cat='drumroll'/> And the answer is...

<!-- Dance (always needs a filter) -->
<anim cat='dance' filter='music, rom-silly'/> Watch this!

<!-- Emoji on screen -->
<anim cat='emoji' filter='!(hf), &(heart)' nonBlocking='true'/> I love that!

<!-- Dramatic pause -->
And then... <break size='1.0'/> nothing happened.
```

A `sanitizeForTTS()` function in `tools.js` provides defense-in-depth by stripping markdown, LaTeX, and invalid tags before they reach Jibo's TTS engine.

---

## How the Agent Loop Works

```
User says "Hey Jibo" ──▶ hotword event fires
                              │
                              ▼
                    Play acknowledgment animation
                              │
                              ▼
                    Listen for initial speech (15s timeout)
                              │
                              ▼
                    Build message history [system prompt, user text]
                              │
                              ▼
                    ┌─── Agent Loop (max 25 turns) ◀──┐
                    │                                  │
                    │  1. Prune old images from context │
                    │  2. Call LLM                      │
                    │  3. If no tool calls → done       │
                    │  4. Sort tools: say → actions → listen │
                    │  5. Execute each tool             │
                    │  6. Push results to messages      │
                    │  7. If end_conversation → done    │
                    │  8. Loop ─────────────────────────┘
                    │
                    ▼
              Conversation complete
              Resume hotword listening
```

Key behaviors:
- **Speech chaining**: Multiple `say` calls are queued via a promise chain so they play sequentially without overlap.
- **Tool ordering**: `say` executes first, then actions (photo, search, etc.), then `listen`/`end_conversation` last.
- **Graceful limits**: At turn 24 of 25, a system message nudges the LLM to wrap up.
- **Image pruning**: Only the 2 most recent photos are kept in context to manage token usage.

---

## Project Structure

```
jibo-llm/
├── .env.example        # Template for environment variables
├── .env                # Your local config (git-ignored)
├── index.js            # Entry point: connection, hotword handling, agent loop
├── tools.js            # Tool schemas + executeTool() dispatcher
├── esml-reference.js   # ESML documentation injected into the system prompt
├── package.json        # Dependencies and scripts
└── node_modules/       # Installed dependencies
```

---

## Dependencies

| Package | Version | Purpose |
|---------|---------|---------|
| [rom-control](https://github.com/niceduckdev/rom-control) | ^2.0.1 | Jibo robot control client (speech, camera, display, motors) |
| [openai](https://www.npmjs.com/package/openai) | ^4.73.0 | OpenAI-compatible chat completions SDK |
| [dotenv](https://www.npmjs.com/package/dotenv) | ^16.4.5 | Load `.env` configuration |

---

## License

MIT
Initial commit: jibo-llm hotword-triggered agent Hotword-triggered LLM conversation loop for Jibo with tool-calling agent loop, ESML expressive speech, web search/fetch, and per-conversation abort handling. 2026-04-26 00:05:39 -04:00			`# jibo-llm`

			`> Give Jibo a brain again. A hotword-triggered, LLM-powered conversational agent that turns Jibo into an expressive, tool-using social robot — complete with speech, vision, web search, animations, and more.`

			`![Node.js](https://img.shields.io/badge/Node.js-18%2B-339933?logo=node.js&logoColor=white)`
			`![License](https://img.shields.io/badge/license-MIT-blue)`

			`---`

			`## Overview`

			`jibo-llm connects a Jibo robot to any OpenAI-compatible LLM (GPT-4o, Claude, local models via Ollama/LM Studio, etc.) through a real-time agent loop. When someone says "Hey Jibo", the system:`

			`1. Listens for the user's speech via Jibo's on-board microphone.`
			`2. Sends the transcript to an LLM along with a rich system prompt and tool definitions.`
			`3. Executes tool calls the LLM makes — speaking, animating, taking photos, searching the web, and more.`
			`4. Loops until the conversation naturally ends or the user triggers a new hotword.`

			Conversations are fully interruptible: saying "Hey Jibo" mid-conversation aborts the current exchange and starts a fresh one via `AbortController`.

			`---`

			`## Architecture`

			```
			`┌──────────────┐ hotword ┌──────────────┐ tool calls ┌───────────────┐`
			`│ Jibo Robot │ ──────────▶ │ index.js │ ◀───────────▶ │ LLM (OpenAI │`
			`│ (rom-ctrl) │ ◀────────── │ Agent Loop │ │ compatible) │`
			`│ │ say/listen │ │ └───────────────┘`
			`│ • mic │ photo/look │ tools.js │ web search ┌───────────────┐`
			`│ • speaker │ display │ (executor) │ ──────────────▶ │ Brave Search │`
			`│ • camera │ │ │ └───────────────┘`
			`│ • screen │ │ esml-ref.js │`
			`│ • motors │ │ (prompt ref)│`
			`└──────────────┘ └──────────────┘`
			```

			`\| File \| Purpose \|`
			`\|------\|---------\|`
			\| `index.js` \| Entry point — connects to Jibo, listens for hotword, runs the agent loop with the LLM. \|
			\| `tools.js` \| Defines all tool schemas (OpenAI function-calling format) and the `executeTool()` dispatcher. \|
			\| `esml-reference.js` \| ESML (Embodied Speech Markup Language) cheat sheet injected into the system prompt so the LLM knows how to animate Jibo expressively. \|

			`---`

			`## Features`

			`- 🗣️ Natural conversation — multi-turn dialogue with speech recognition and TTS.`
			`- 🎭 Expressive animations — the LLM uses ESML tags to trigger emotions, dances, emojis, and sound effects inline with speech.`
			`- 📷 Vision — Jibo can take photos and the LLM receives the image for visual understanding.`
			`- 🔍 Web search — real-time Brave Search integration for up-to-date answers.`
			`- 🌐 URL fetching — reads web pages (with Cloudflare Markdown for Agents support) so Jibo can summarize articles.`
			`- 🖥️ Display control — show text, images, or restore the default eye on Jibo's screen.`
			`- 🤖 Head movement — point Jibo's head at specific angles (yaw / pitch).`
			`- 🔊 Volume control — adjust speaker volume on the fly.`
			- ⚡ Interruptible — new hotword instantly aborts a running conversation via `AbortController`.
			`- 🔄 Retry logic — automatic retry with exponential backoff for transient LLM errors (429, 5xx, network).`
			`- 🧹 Context management — old photos are pruned from context to control token cost.`

			`---`

			`## Prerequisites`

			- Node.js ≥ 18 (for native `fetch` and `AbortController`)
			`- A Jibo robot running with int-developer mode enabled`
			`- An OpenAI-compatible API endpoint (OpenAI, Anthropic via proxy, Ollama, LM Studio, etc.)`
			- (Optional) Brave Search API key for the `web_search` tool

			`---`

			`## Quick Start`

			`### 1. Clone & install`

			```bash
			`git clone https://github.com/niceduckdev/jibo-llm.git`
			`cd jibo-llm`
			`npm install`
			```

			`### 2. Configure environment`

			```bash
			`cp .env.example .env`
			```

			Edit `.env` with your values:

			```env
			`# Jibo robot IP address on your local network`
			`JIBO_IP=192.168.1.217`

			`# LLM API configuration (any OpenAI-compatible endpoint)`
			`LLM_BASE_URL=https://api.openai.com/v1`
			`LLM_API_TOKEN=sk-your-api-key-here`
			`LLM_MODEL_ID=gpt-4o`

			`# Optional: enables the web_search tool`
			`BRAVE_API_KEY=your-brave-api-key`
			```

			`### 3. Run`

			```bash
			`npm start`
			`# or: node index.js`
			```

			`You'll see:`

			```
			`[jibo-llm] Connecting to Jibo at 192.168.1.217…`
			`[jibo-llm] Connected — session abc123`
			`[jibo-llm] Ready — listening for "Hey Jibo"…`
			```

			`Say "Hey Jibo" and start talking!`

			`---`

			`## Configuration`

			All configuration is done via environment variables (loaded from `.env` by [dotenv](https://www.npmjs.com/package/dotenv)):

			`\| Variable \| Required \| Default \| Description \|`
			`\|----------\|----------\|---------\|-------------\|`
			\| `JIBO_IP` \| No \| `192.168.1.217` \| Jibo's IP address on your LAN \|
			\| `LLM_BASE_URL` \| No \| `https://api.openai.com/v1` \| Base URL for the chat completions API \|
			\| `LLM_API_TOKEN` \| Yes \| — \| API key for the LLM provider \|
			\| `LLM_MODEL_ID` \| No \| `gpt-4o` \| Model identifier to use \|
			\| `BRAVE_API_KEY` \| No \| — \| Brave Search API key (enables `web_search` tool) \|

			`### Using alternative LLM providers`

			`Since jibo-llm uses the OpenAI SDK, any provider with a compatible chat completions endpoint works:`

			```env
			`# Ollama (local)`
			`LLM_BASE_URL=http://localhost:11434/v1`
			`LLM_API_TOKEN=ollama`
			`LLM_MODEL_ID=llama3`

			`# LM Studio (local)`
			`LLM_BASE_URL=http://localhost:1234/v1`
			`LLM_API_TOKEN=lm-studio`
			`LLM_MODEL_ID=local-model`

			`# OpenRouter`
			`LLM_BASE_URL=https://openrouter.ai/api/v1`
			`LLM_API_TOKEN=sk-or-...`
			`LLM_MODEL_ID=anthropic/claude-sonnet-4`
			```

			`---`

			`## Available Tools`

			`The LLM can call any of these tools during a conversation:`

			`### Communication`
			`\| Tool \| Description \|`
			`\|------\|-------------\|`
			\| `say` \| Speak ESML-formatted text through Jibo's speaker. Queued and chained so multiple `say` calls play in order. \|
			\| `listen` \| Open the microphone and transcribe user speech. Waits for pending speech to finish first. \|
			\| `end_conversation` \| Gracefully end the conversation (no further listening). \|

			`### Camera`
			`\| Tool \| Description \|`
			`\|------\|-------------\|`
			\| `take_photo` \| Capture a photo from Jibo's camera. The image is sent to the LLM as a base64 JPEG for visual understanding. \|

			`### Display`
			`\| Tool \| Description \|`
			`\|------\|-------------\|`
			\| `show_text` \| Display word-wrapped text on Jibo's screen. \|
			\| `show_image` \| Display an image from a URL on Jibo's screen. \|
			\| `show_eye` \| Restore the default eye animation. \|

			`### Movement`
			`\| Tool \| Description \|`
			`\|------\|-------------\|`
			\| `look_at_angle` \| Turn Jibo's head — `theta` (yaw ±180°) and `psi` (pitch ±30°). \|

			`### Audio`
			`\| Tool \| Description \|`
			`\|------\|-------------\|`
			\| `set_volume` \| Set speaker volume from 0.0 to 1.0. \|

			`### Web`
			`\| Tool \| Description \|`
			`\|------\|-------------\|`
			\| `web_search` \| Search the web via Brave Search API. Supports result count and freshness filters. \|
			\| `fetch_url` \| Fetch and read a web page. Prefers markdown via Cloudflare content negotiation, falls back to HTML→text conversion. \|

			`---`

			`## ESML (Embodied Speech Markup Language)`

			ESML is how Jibo speaks expressively. The system prompt includes a full reference (`esml-reference.js`) that teaches the LLM to use these tags inside `say` calls:

			```xml
			`<!-- Emotional reaction (most common pattern) -->`
			`<anim cat='happy' nonBlocking='true' endNeutral='true'/> That's great news!`

			`<!-- Voice sound (laugh, sigh, greeting) -->`
			`<ssa cat='laughing' nonBlocking='true'/> That's hilarious!`

			`<!-- Sound effect -->`
			`<sfx cat='drumroll'/> And the answer is...`

			`<!-- Dance (always needs a filter) -->`
			`<anim cat='dance' filter='music, rom-silly'/> Watch this!`

			`<!-- Emoji on screen -->`
			`<anim cat='emoji' filter='!(hf), &(heart)' nonBlocking='true'/> I love that!`

			`<!-- Dramatic pause -->`
			`And then... <break size='1.0'/> nothing happened.`
			```

			A `sanitizeForTTS()` function in `tools.js` provides defense-in-depth by stripping markdown, LaTeX, and invalid tags before they reach Jibo's TTS engine.

			`---`

			`## How the Agent Loop Works`

			```
			`User says "Hey Jibo" ──▶ hotword event fires`
			`│`
			`▼`
			`Play acknowledgment animation`
			`│`
			`▼`
			`Listen for initial speech (15s timeout)`
			`│`
			`▼`
			`Build message history [system prompt, user text]`
			`│`
			`▼`
			`┌─── Agent Loop (max 25 turns) ◀──┐`
			`│ │`
			`│ 1. Prune old images from context │`
			`│ 2. Call LLM │`
			`│ 3. If no tool calls → done │`
			`│ 4. Sort tools: say → actions → listen │`
			`│ 5. Execute each tool │`
			`│ 6. Push results to messages │`
			`│ 7. If end_conversation → done │`
			`│ 8. Loop ─────────────────────────┘`
			`│`
			`▼`
			`Conversation complete`
			`Resume hotword listening`
			```

			`Key behaviors:`
			- Speech chaining: Multiple `say` calls are queued via a promise chain so they play sequentially without overlap.
			- Tool ordering: `say` executes first, then actions (photo, search, etc.), then `listen`/`end_conversation` last.
			`- Graceful limits: At turn 24 of 25, a system message nudges the LLM to wrap up.`
			`- Image pruning: Only the 2 most recent photos are kept in context to manage token usage.`

			`---`

			`## Project Structure`

			```
			`jibo-llm/`
			`├── .env.example # Template for environment variables`
			`├── .env # Your local config (git-ignored)`
			`├── index.js # Entry point: connection, hotword handling, agent loop`
			`├── tools.js # Tool schemas + executeTool() dispatcher`
			`├── esml-reference.js # ESML documentation injected into the system prompt`
			`├── package.json # Dependencies and scripts`
			`└── node_modules/ # Installed dependencies`
			```

			`---`

			`## Dependencies`

			`\| Package \| Version \| Purpose \|`
			`\|---------\|---------\|---------\|`
			`\| [rom-control](https://github.com/niceduckdev/rom-control) \| ^2.0.1 \| Jibo robot control client (speech, camera, display, motors) \|`
			`\| [openai](https://www.npmjs.com/package/openai) \| ^4.73.0 \| OpenAI-compatible chat completions SDK \|`
			\| [dotenv](https://www.npmjs.com/package/dotenv) \| ^16.4.5 \| Load `.env` configuration \|

			`---`

			`## License`

			`MIT`