Merge pull request 'Add STT/Voice Docs location and starter' (#3) from stt-docs into main

Reviewed-on: Kevin/JiboDocs#3 will add a link to the useful items list
2026-03-19 13:52:45 +00:00
parent 98c64f6efb 0420f4026b
commit f6aada604d
1 changed files with 264 additions and 0 deletions
--- a/Documentation/Voice/local-asr-tts-roundtrip.md
+++ b/Documentation/Voice/local-asr-tts-roundtrip.md
@@ -0,0 +1,264 @@
+# Local ASR, TTS, and Voice Round-Trip on Jibo (Post-Cloud)
+
+> This document describes the first confirmed working voice interaction on a Jibo robot after official cloud services were discontinued.
+
+---
+
+## Summary
+
+Short version: Jibo can still have a full conversation loop locally.
+
+We now have:
+
+* Speech → text (STT) working locally
+* Text → speech (TTS) working locally
+* A working loop where Jibo hears something and responds
+
+This is all happening without the original cloud services.
+
+---
+
+## Key Findings
+
+Here’s what we now know for sure:
+
+* Wake word detection (`hey jibo`) still works locally
+* Speaker ID is still running locally (even if it rejects us 😄)
+* `jibo-asr-service` can be started and controlled manually
+* ASR (speech recognition) is exposed over HTTP on port `8088`
+* TTS (speech output) is exposed over HTTP on port `8089`
+
+### ASR Endpoints
+
+Confirmed working endpoints:
+
+* `/asr_simple_interface`
+* `/audio_source`
+* `/asr_control`
+* `/status`
+
+### WebSocket Outputs
+
+ASR results are streamed over WebSockets:
+
+* `ws://<jibo-ip>:8088/port`
+* `ws://<jibo-ip>:8088/simple_port`
+
+### Example STT Start Payload
+
+```json
+{
+  "command": "start",
+  "task_id": "DEBUG:task3",
+  "audio_source_id": "alsa1",
+  "hotphrase": "none",
+  "speech_to_text": true,
+  "request_id": "stt_start3"
+}
+```
+
+---
+
+## What’s Actually Happening (Architecture)
+
+Here’s the real flow in plain English:
+
+1. We send a request to Jibo to start listening
+2. Jibo captures audio from its mic (ALSA)
+3. The ASR engine processes it
+4. Results come back over WebSocket
+5. Our app reads the transcript
+6. Our app decides what to say
+7. We send that to Jibo’s TTS
+8. Jibo speaks
+
+Visual version:
+
+```
+HTTP POST (/asr_simple_interface)
+        ↓
+ASR service captures audio
+        ↓
+Speech recognition runs locally
+        ↓
+WebSocket emits events
+        ↓
+External app receives transcript
+        ↓
+External logic decides response
+        ↓
+HTTP POST (/tts_speak)
+        ↓
+Jibo talks
+```
+
+---
+
+## Example WebSocket Output
+
+Here’s a trimmed real example of a final result:
+
+```json
+{
+  "event_type": "speech_to_text_final",
+  "task_id": "DEBUG:task3",
+  "utterances": [
+    {
+      "utterance": "what time is it",
+      "score": 975.9
+    }
+  ]
+}
+```
+
+You’ll also see:
+
+* `speech_to_text_incremental` (partial results)
+* `end_of_speech`
+* `hotphrase` (for "hey jibo")
+
+---
+
+## Demo Flow (How to Reproduce)
+
+This is the important part.
+
+### 1. Make sure you are in `int-developer` mode and ASR service is running
+
+From ssh:
+
+```
+/usr/local/bin/jibo-asr-service -c /usr/local/etc/jibo-asr-service.json
+```
+
+---
+
+### 2. Connect to WebSocket
+
+```
+ws://<jibo-ip>:8088/simple_port
+```
+
+---
+
+### 3. Start an STT task
+
+POST to:
+
+```
+http://<jibo-ip>:8088/asr_simple_interface
+```
+
+With:
+
+```json
+{
+  "command": "start",
+  "task_id": "DEBUG:task3",
+  "audio_source_id": "alsa1",
+  "hotphrase": "none",
+  "speech_to_text": true
+}
+```
+
+---
+
+### 4. Speak to Jibo
+
+Say something like:
+
+> “what time is it”
+
+---
+
+### 5. Wait for final transcript
+
+Watch for:
+
+```
+event_type: speech_to_text_final
+```
+
+---
+
+### 6. Send response to TTS
+
+POST to:
+
+```
+http://<jibo-ip>:8089/tts_speak
+```
+
+With something like:
+
+```json
+{
+  "text": "It is demo time."
+}
+```
+
+---
+
+### 7. Jibo speaks 🎉
+
+---
+
+## Known Behaviors / Quirks
+
+Some things we’ve seen so far:
+
+* WebSocket connections can drop → reconnect logic helps
+* Incremental results can be messy or duplicated
+* Multiple transcript guesses can show up
+* Wake word (`task0`) runs alongside your custom task
+* Saying “hey jibo” during a manual STT session can interfere
+* Speaker ID often rejects (but doesn’t block STT)
+
+---
+
+## Corrections to Previous Assumptions
+
+Some things we (and others) thought before that are now clearly wrong or incomplete:
+
+* “ASR is dead without cloud” → **Not true in developer mode**
+* “Only wake word works locally” → **Incomplete**
+* “No way to get transcripts” → **False (WebSocket output exists)**
+* “Jibo can’t answer questions anymore” → **Also false now 🙂**
+
+---
+
+## What This Means
+
+This is a big deal:
+
+* Jibo’s core voice pipeline is still there
+* The cloud was orchestration, not the whole system
+* We can now rebuild the “brain” externally
+
+---
+
+## Next Steps
+
+Where this naturally goes next:
+
+* Hook wake word → automatically trigger STT
+* Figure out how this behaves in “normal mode”
+* See if Jibo tries to initiate outbound connections (old cloud flow)
+* Intercept or replace those endpoints locally
+* Build a simple always-on bridge service:
+
+  * Wake word → STT → AI → TTS
+
+---
+
+## Final Thought
+
+We didn’t just poke at endpoints here.
+
+We proved Jibo can:
+
+* hear
+* understand
+* and respond again
+
+That’s a pretty great place to be.