Merge pull request 'Add STT/Voice Docs location and starter' (#3) from stt-docs into main
Reviewed-on: Kevin/JiboDocs#3 will add a link to the useful items list
This commit was merged in pull request #3.
This commit is contained in:
264
Documentation/Voice/local-asr-tts-roundtrip.md
Normal file
264
Documentation/Voice/local-asr-tts-roundtrip.md
Normal file
@@ -0,0 +1,264 @@
|
||||
# Local ASR, TTS, and Voice Round-Trip on Jibo (Post-Cloud)
|
||||
|
||||
> This document describes the first confirmed working voice interaction on a Jibo robot after official cloud services were discontinued.
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
Short version: Jibo can still have a full conversation loop locally.
|
||||
|
||||
We now have:
|
||||
|
||||
* Speech → text (STT) working locally
|
||||
* Text → speech (TTS) working locally
|
||||
* A working loop where Jibo hears something and responds
|
||||
|
||||
This is all happening without the original cloud services.
|
||||
|
||||
---
|
||||
|
||||
## Key Findings
|
||||
|
||||
Here’s what we now know for sure:
|
||||
|
||||
* Wake word detection (`hey jibo`) still works locally
|
||||
* Speaker ID is still running locally (even if it rejects us 😄)
|
||||
* `jibo-asr-service` can be started and controlled manually
|
||||
* ASR (speech recognition) is exposed over HTTP on port `8088`
|
||||
* TTS (speech output) is exposed over HTTP on port `8089`
|
||||
|
||||
### ASR Endpoints
|
||||
|
||||
Confirmed working endpoints:
|
||||
|
||||
* `/asr_simple_interface`
|
||||
* `/audio_source`
|
||||
* `/asr_control`
|
||||
* `/status`
|
||||
|
||||
### WebSocket Outputs
|
||||
|
||||
ASR results are streamed over WebSockets:
|
||||
|
||||
* `ws://<jibo-ip>:8088/port`
|
||||
* `ws://<jibo-ip>:8088/simple_port`
|
||||
|
||||
### Example STT Start Payload
|
||||
|
||||
```json
|
||||
{
|
||||
"command": "start",
|
||||
"task_id": "DEBUG:task3",
|
||||
"audio_source_id": "alsa1",
|
||||
"hotphrase": "none",
|
||||
"speech_to_text": true,
|
||||
"request_id": "stt_start3"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What’s Actually Happening (Architecture)
|
||||
|
||||
Here’s the real flow in plain English:
|
||||
|
||||
1. We send a request to Jibo to start listening
|
||||
2. Jibo captures audio from its mic (ALSA)
|
||||
3. The ASR engine processes it
|
||||
4. Results come back over WebSocket
|
||||
5. Our app reads the transcript
|
||||
6. Our app decides what to say
|
||||
7. We send that to Jibo’s TTS
|
||||
8. Jibo speaks
|
||||
|
||||
Visual version:
|
||||
|
||||
```
|
||||
HTTP POST (/asr_simple_interface)
|
||||
↓
|
||||
ASR service captures audio
|
||||
↓
|
||||
Speech recognition runs locally
|
||||
↓
|
||||
WebSocket emits events
|
||||
↓
|
||||
External app receives transcript
|
||||
↓
|
||||
External logic decides response
|
||||
↓
|
||||
HTTP POST (/tts_speak)
|
||||
↓
|
||||
Jibo talks
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Example WebSocket Output
|
||||
|
||||
Here’s a trimmed real example of a final result:
|
||||
|
||||
```json
|
||||
{
|
||||
"event_type": "speech_to_text_final",
|
||||
"task_id": "DEBUG:task3",
|
||||
"utterances": [
|
||||
{
|
||||
"utterance": "what time is it",
|
||||
"score": 975.9
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
You’ll also see:
|
||||
|
||||
* `speech_to_text_incremental` (partial results)
|
||||
* `end_of_speech`
|
||||
* `hotphrase` (for "hey jibo")
|
||||
|
||||
---
|
||||
|
||||
## Demo Flow (How to Reproduce)
|
||||
|
||||
This is the important part.
|
||||
|
||||
### 1. Make sure you are in `int-developer` mode and ASR service is running
|
||||
|
||||
From ssh:
|
||||
|
||||
```
|
||||
/usr/local/bin/jibo-asr-service -c /usr/local/etc/jibo-asr-service.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. Connect to WebSocket
|
||||
|
||||
```
|
||||
ws://<jibo-ip>:8088/simple_port
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. Start an STT task
|
||||
|
||||
POST to:
|
||||
|
||||
```
|
||||
http://<jibo-ip>:8088/asr_simple_interface
|
||||
```
|
||||
|
||||
With:
|
||||
|
||||
```json
|
||||
{
|
||||
"command": "start",
|
||||
"task_id": "DEBUG:task3",
|
||||
"audio_source_id": "alsa1",
|
||||
"hotphrase": "none",
|
||||
"speech_to_text": true
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. Speak to Jibo
|
||||
|
||||
Say something like:
|
||||
|
||||
> “what time is it”
|
||||
|
||||
---
|
||||
|
||||
### 5. Wait for final transcript
|
||||
|
||||
Watch for:
|
||||
|
||||
```
|
||||
event_type: speech_to_text_final
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 6. Send response to TTS
|
||||
|
||||
POST to:
|
||||
|
||||
```
|
||||
http://<jibo-ip>:8089/tts_speak
|
||||
```
|
||||
|
||||
With something like:
|
||||
|
||||
```json
|
||||
{
|
||||
"text": "It is demo time."
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 7. Jibo speaks 🎉
|
||||
|
||||
---
|
||||
|
||||
## Known Behaviors / Quirks
|
||||
|
||||
Some things we’ve seen so far:
|
||||
|
||||
* WebSocket connections can drop → reconnect logic helps
|
||||
* Incremental results can be messy or duplicated
|
||||
* Multiple transcript guesses can show up
|
||||
* Wake word (`task0`) runs alongside your custom task
|
||||
* Saying “hey jibo” during a manual STT session can interfere
|
||||
* Speaker ID often rejects (but doesn’t block STT)
|
||||
|
||||
---
|
||||
|
||||
## Corrections to Previous Assumptions
|
||||
|
||||
Some things we (and others) thought before that are now clearly wrong or incomplete:
|
||||
|
||||
* “ASR is dead without cloud” → **Not true in developer mode**
|
||||
* “Only wake word works locally” → **Incomplete**
|
||||
* “No way to get transcripts” → **False (WebSocket output exists)**
|
||||
* “Jibo can’t answer questions anymore” → **Also false now 🙂**
|
||||
|
||||
---
|
||||
|
||||
## What This Means
|
||||
|
||||
This is a big deal:
|
||||
|
||||
* Jibo’s core voice pipeline is still there
|
||||
* The cloud was orchestration, not the whole system
|
||||
* We can now rebuild the “brain” externally
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
Where this naturally goes next:
|
||||
|
||||
* Hook wake word → automatically trigger STT
|
||||
* Figure out how this behaves in “normal mode”
|
||||
* See if Jibo tries to initiate outbound connections (old cloud flow)
|
||||
* Intercept or replace those endpoints locally
|
||||
* Build a simple always-on bridge service:
|
||||
|
||||
* Wake word → STT → AI → TTS
|
||||
|
||||
---
|
||||
|
||||
## Final Thought
|
||||
|
||||
We didn’t just poke at endpoints here.
|
||||
|
||||
We proved Jibo can:
|
||||
|
||||
* hear
|
||||
* understand
|
||||
* and respond again
|
||||
|
||||
That’s a pretty great place to be.
|
||||
Reference in New Issue
Block a user