Local ASR, TTS, and Voice Round-Trip on Jibo (Post-Cloud)
This document describes the first confirmed working voice interaction on a Jibo robot after official cloud services were discontinued.
Summary
Short version: Jibo can still have a full conversation loop locally.
We now have:
- Speech → text (STT) working locally
- Text → speech (TTS) working locally
- A working loop where Jibo hears something and responds
This is all happening without the original cloud services.
Key Findings
Here’s what we now know for sure:
- Wake word detection (
hey jibo) still works locally - Speaker ID is still running locally (even if it rejects us 😄)
jibo-asr-servicecan be started and controlled manually- ASR (speech recognition) is exposed over HTTP on port
8088 - TTS (speech output) is exposed over HTTP on port
8089
ASR Endpoints
Confirmed working endpoints:
/asr_simple_interface/audio_source/asr_control/status
WebSocket Outputs
ASR results are streamed over WebSockets:
ws://<jibo-ip>:8088/portws://<jibo-ip>:8088/simple_port
Example STT Start Payload
{
"command": "start",
"task_id": "DEBUG:task3",
"audio_source_id": "alsa1",
"hotphrase": "none",
"speech_to_text": true,
"request_id": "stt_start3"
}What’s Actually Happening (Architecture)
Here’s the real flow in plain English:
- We send a request to Jibo to start listening
- Jibo captures audio from its mic (ALSA)
- The ASR engine processes it
- Results come back over WebSocket
- Our app reads the transcript
- Our app decides what to say
- We send that to Jibo’s TTS
- Jibo speaks
Visual version:
HTTP POST (/asr_simple_interface)
↓
ASR service captures audio
↓
Speech recognition runs locally
↓
WebSocket emits events
↓
External app receives transcript
↓
External logic decides response
↓
HTTP POST (/tts_speak)
↓
Jibo talks
Example WebSocket Output
Here’s a trimmed real example of a final result:
{
"event_type": "speech_to_text_final",
"task_id": "DEBUG:task3",
"utterances": [
{
"utterance": "what time is it",
"score": 975.9
}
]
}You’ll also see:
speech_to_text_incremental(partial results)end_of_speechhotphrase(for “hey jibo”)
Demo Flow (How to Reproduce)
This is the important part.
1. Make sure you are in int-developer mode and ASR service is running
From ssh:
/usr/local/bin/jibo-asr-service -c /usr/local/etc/jibo-asr-service.json
2. Connect to WebSocket
ws://<jibo-ip>:8088/simple_port
3. Start an STT task
POST to:
http://<jibo-ip>:8088/asr_simple_interface
With:
{
"command": "start",
"task_id": "DEBUG:task3",
"audio_source_id": "alsa1",
"hotphrase": "none",
"speech_to_text": true
}4. Speak to Jibo
Say something like:
“what time is it”
5. Wait for final transcript
Watch for:
event_type: speech_to_text_final
6. Send response to TTS
POST to:
http://<jibo-ip>:8089/tts_speak
With something like:
{
"text": "It is demo time."
}7. Jibo speaks 🎉
Known Behaviors / Quirks
Some things we’ve seen so far:
- WebSocket connections can drop → reconnect logic helps
- Incremental results can be messy or duplicated
- Multiple transcript guesses can show up
- Wake word (
task0) runs alongside your custom task - Saying “hey jibo” during a manual STT session can interfere
- Speaker ID often rejects (but doesn’t block STT)
Corrections to Previous Assumptions
Some things we (and others) thought before that are now clearly wrong or incomplete:
- “ASR is dead without cloud” → Not true in developer mode
- “Only wake word works locally” → Incomplete
- “No way to get transcripts” → False (WebSocket output exists)
- “Jibo can’t answer questions anymore” → Also false now 🙂
What This Means
This is a big deal:
- Jibo’s core voice pipeline is still there
- The cloud was orchestration, not the whole system
- We can now rebuild the “brain” externally
Next Steps
Where this naturally goes next:
-
Hook wake word → automatically trigger STT
-
Figure out how this behaves in “normal mode”
-
See if Jibo tries to initiate outbound connections (old cloud flow)
-
Intercept or replace those endpoints locally
-
Build a simple always-on bridge service:
- Wake word → STT → AI → TTS
Final Thought
We didn’t just poke at endpoints here.
We proved Jibo can:
- hear
- understand
- and respond again
That’s a pretty great place to be.