Development Plan

Summary

This document is the working implementation plan after the initial hosted-cloud scaffold.

It is intentionally broader than the current Node server. The Node server is a protocol oracle and discovery tool, not the complete map of Jibo.

Current Scope

stable .NET cloud scaffold
Azure-oriented architecture and data ownership
normalized runtime contracts for cloud-to-runtime handoff
bootstrap documentation for region injection and targeted device patching
starter endpoint coverage for account, notification, robot, loop, update, uploads, and core WebSocket acceptance
starter xUnit coverage for the .NET application layer

Next Implementation Scope

expand HTTP X-Amz-Target coverage from observed traffic and fixtures
grow WebSocket compatibility from stub acceptance into realistic turn orchestration
keep websocket parity fixture-driven, starting with exact sequencing and payload-shape fidelity for the successful joke vertical slice before claiming broader skill coverage
replace in-memory state with Azure SQL-backed persistence
add structured fixture replay tests
harden region/bootstrap docs by software version

Discovery Scope

We still need to map more than the current Node server expresses. Priority discovery areas:

all hostnames and service prefixes observed in real startup and turn traffic
skill launch and skill lifecycle flows
interactivity command families beyond the current joke flow
richer embodied speech and animation behaviors
upload, logging, backup, and key-sharing flows
per-version configuration differences and region handling

Current WebSocket Discovery Focus

The next fixture-driven websocket work should continue to separate three buckets:

discovered behavior Grounded by the Node oracle, sanitized fixtures, and live captures
implemented parity Only the narrow slices currently replayed and tested in .NET
future hypotheses Ideas to investigate later, but not behaviors to silently bake into the hosted cloud

Right now the strongest implemented vertical slice beyond basic listen completion is the successful joke turn:

CLIENT_ASR transcript-carrying turn completion
synthetic LISTEN result shaping
EOS
delayed joke SKILL_ACTION

That should remain the model for future websocket work: capture first, fixture second, parity third.

The latest live captures also support a second discovery track:

menu-driven CLIENT_NLU parity for clock, timer, and alarm flows
richer transcript-bearing CLIENT_ASR discovery beyond jokes
buffered-audio preservation for eventual real ASR in .NET

Near-term ASR work should stay staged:

preserve and replay the websocket audio payloads honestly
validate a local tool-based decode/transcribe loop in .NET
compare that against Azure-hosted STT before choosing a default production path

That keeps Node as the reverse-engineering oracle while letting the long-term .NET cloud gain real STT seams without pretending they are finished.

Working Cloud Framework

The current evidence in captures, fixtures, and Node behavior supports three main cloud interaction paths:

local Jibo behavior observed by the cloud The robot or its local skill stack already interpreted the turn and the cloud mainly tracks, acknowledges, or lightly completes it.
local Jibo behavior overridden or redirected by the cloud The robot reports the turn state, but the cloud chooses a different synthetic reply path.
raw audio interpreted by the cloud The robot sends buffered audio and the cloud performs transcript resolution before sending back LISTEN, EOS, and ESML-driven playback.

Those are the right primary buckets for now. Additional side channels may still emerge later, especially around proactive traffic, direct skill/service sockets, or future on-device OS changes, but they should be treated as extensions to this model until captures prove otherwise.

Speech, Animation, And ESML

The current joke flow is only a small foothold into Jibo expressiveness.

Future work should map:

direct speech modifiers
animation selection and filtering
embodied speech behaviors
ESML and SSML subsets
interactions between speech, visuals, and timing

Useful external references:

Future Scope

full endpoint inventory beyond the current Node mapping
OTA-driven recovery
paid hosted plans or donation-supported hosting
deeper on-device bridge and OS modernization
more capable skill/runtime integration
possible LLM or tool-use patterns inspired by workshop-era experimentation

MCP-Like Ideas

Recent MIT workshop materials suggest experimentation around modern AI tooling for Jibo, including an MCP-oriented idea. We should treat that as inspiration for future OpenJibo directions, not as a present dependency or supported integration.

4.9 KiB Raw Blame History