Merge pull request 'Add STT/Voice Docs location and starter' (#3) from stt-docs into main

Reviewed-on: Kevin/JiboDocs#3 will add a link to the useful items list
2026-03-19 13:52:45 +00:00
parent 98c64f6efb 0420f4026b
commit f6aada604d
1 changed files with 264 additions and 0 deletions
--- a/Documentation/Voice/local-asr-tts-roundtrip.md
+++ b/Documentation/Voice/local-asr-tts-roundtrip.md
@@ -0,0 +1,264 @@
 # Local ASR, TTS, and Voice Round-Trip on Jibo (Post-Cloud)
 > This document describes the first confirmed working voice interaction on a Jibo robot after official cloud services were discontinued.
 ---
 ## Summary
 Short version: Jibo can still have a full conversation loop locally.
 We now have:
 * Speech → text (STT) working locally
 * Text → speech (TTS) working locally
 * A working loop where Jibo hears something and responds
 This is all happening without the original cloud services.
 ---
 ## Key Findings
 Here’s what we now know for sure:
 * Wake word detection (`hey jibo`) still works locally
 * Speaker ID is still running locally (even if it rejects us 😄)
 * `jibo-asr-service` can be started and controlled manually
 * ASR (speech recognition) is exposed over HTTP on port `8088`
 * TTS (speech output) is exposed over HTTP on port `8089`
 ### ASR Endpoints
 Confirmed working endpoints:
 * `/asr_simple_interface`
 * `/audio_source`
 * `/asr_control`
 * `/status`
 ### WebSocket Outputs
 ASR results are streamed over WebSockets:
 * `ws://<jibo-ip>:8088/port`
 * `ws://<jibo-ip>:8088/simple_port`
 ### Example STT Start Payload
 ```json
 {
  "command": "start",
  "task_id": "DEBUG:task3",
  "audio_source_id": "alsa1",
  "hotphrase": "none",
  "speech_to_text": true,
  "request_id": "stt_start3"
 }
 ```
 ---
 ## What’s Actually Happening (Architecture)
 Here’s the real flow in plain English:
 1. We send a request to Jibo to start listening
 2. Jibo captures audio from its mic (ALSA)
 3. The ASR engine processes it
 4. Results come back over WebSocket
 5. Our app reads the transcript
 6. Our app decides what to say
 7. We send that to Jibo’s TTS
 8. Jibo speaks
 Visual version:
 ```
 HTTP POST (/asr_simple_interface)
        ↓
 ASR service captures audio
        ↓
 Speech recognition runs locally
        ↓
 WebSocket emits events
        ↓
 External app receives transcript
        ↓
 External logic decides response
        ↓
 HTTP POST (/tts_speak)
        ↓
 Jibo talks
 ```
 ---
 ## Example WebSocket Output
 Here’s a trimmed real example of a final result:
 ```json
 {
  "event_type": "speech_to_text_final",
  "task_id": "DEBUG:task3",
  "utterances": [
    {
      "utterance": "what time is it",
      "score": 975.9
    }
  ]
 }
 ```
 You’ll also see:
 * `speech_to_text_incremental` (partial results)
 * `end_of_speech`
 * `hotphrase` (for "hey jibo")
 ---
 ## Demo Flow (How to Reproduce)
 This is the important part.
 ### 1. Make sure you are in `int-developer` mode and ASR service is running
 From ssh:
 ```
 /usr/local/bin/jibo-asr-service -c /usr/local/etc/jibo-asr-service.json
 ```
 ---
 ### 2. Connect to WebSocket
 ```
 ws://<jibo-ip>:8088/simple_port
 ```
 ---
 ### 3. Start an STT task
 POST to:
 ```
 http://<jibo-ip>:8088/asr_simple_interface
 ```
 With:
 ```json
 {
  "command": "start",
  "task_id": "DEBUG:task3",
  "audio_source_id": "alsa1",
  "hotphrase": "none",
  "speech_to_text": true
 }
 ```
 ---
 ### 4. Speak to Jibo
 Say something like:
 > “what time is it”
 ---
 ### 5. Wait for final transcript
 Watch for:
 ```
 event_type: speech_to_text_final
 ```
 ---
 ### 6. Send response to TTS
 POST to:
 ```
 http://<jibo-ip>:8089/tts_speak
 ```
 With something like:
 ```json
 {
  "text": "It is demo time."
 }
 ```
 ---
 ### 7. Jibo speaks 🎉
 ---
 ## Known Behaviors / Quirks
 Some things we’ve seen so far:
 * WebSocket connections can drop → reconnect logic helps
 * Incremental results can be messy or duplicated
 * Multiple transcript guesses can show up
 * Wake word (`task0`) runs alongside your custom task
 * Saying “hey jibo” during a manual STT session can interfere
 * Speaker ID often rejects (but doesn’t block STT)
 ---
 ## Corrections to Previous Assumptions
 Some things we (and others) thought before that are now clearly wrong or incomplete:
 * “ASR is dead without cloud” → **Not true in developer mode**
 * “Only wake word works locally” → **Incomplete**
 * “No way to get transcripts” → **False (WebSocket output exists)**
 * “Jibo can’t answer questions anymore” → **Also false now 🙂**
 ---
 ## What This Means
 This is a big deal:
 * Jibo’s core voice pipeline is still there
 * The cloud was orchestration, not the whole system
 * We can now rebuild the “brain” externally
 ---
 ## Next Steps
 Where this naturally goes next:
 * Hook wake word → automatically trigger STT
 * Figure out how this behaves in “normal mode”
 * See if Jibo tries to initiate outbound connections (old cloud flow)
 * Intercept or replace those endpoints locally
 * Build a simple always-on bridge service:
  * Wake word → STT → AI → TTS
 ---
 ## Final Thought
 We didn’t just poke at endpoints here.
 We proved Jibo can:
 * hear
 * understand
 * and respond again
 That’s a pretty great place to be.