Files
JiboExperiments/OpenJibo/docs/DesignDoc/original-server-design.md
2026-05-23 00:21:42 +03:00

24 KiB

Original Jibo Server (Pegasus) Design Document

Executive Summary

The original Jibo server, codenamed "Pegasus" (formerly V1.X), is a cloud-based microservices architecture that powers the Jibo social robot's conversational AI capabilities. It is built as a Lerna monorepo using Node.js/TypeScript and deployed via Docker containers. The system processes speech, performs natural language understanding, routes to appropriate skills, and manages proactive behaviors.

Architecture Overview

Monorepo Structure

The codebase is organized as a Lerna monorepo with the following main packages:

  • packages/hub - Central orchestration service
  • packages/parser - NLU (Natural Language Understanding) service
  • packages/history - Data persistence service (MongoDB)
  • packages/baseskill - Base class and framework for cloud skills
  • packages/interfaces - TypeScript interfaces and API contracts
  • packages/utils - Shared utility libraries
  • packages/chitchat-skill - Example conversational skill
  • packages/report-skill - Reporting skill
  • packages/lasso - External data integration service
  • packages/hub-client - Client library for hub communication
  • packages/history-client - Client library for history service
  • packages/test-utils - Testing utilities

Technology Stack

  • Language: TypeScript 2.5.3
  • Runtime: Node.js 8.9.4
  • Package Manager: Yarn 1.7.0
  • Containerization: Docker
  • Orchestration: Docker Compose (local), AWS ECS (production)
  • Database: MongoDB 3.6.0
  • Cache: Redis 3
  • NLU: Dialogflow (API.ai)
  • ASR: Google Cloud Speech API
  • WebSocket: ws library
  • HTTP: Express.js
  • Authentication: JWT (jsonwebtoken)

Core Services

1. Hub Service (packages/hub)

The Hub is the central orchestrator that coordinates all interactions between the robot and cloud services.

Key Components

HubService (HubService.ts)

  • Main service class extending BaseService
  • Initializes and manages all hub components
  • Registers WebSocket and HTTP handlers

HubComponents - Dependency injection container:

  • parser: ParserClient - NLU service client
  • skillConfigManager: SkillConfigManager - Manages skill configurations
  • intentRouter: IntentRouter - Routes intents to skills
  • skillRequestMaker: SkillRequestMaker - Makes HTTP requests to skills
  • history: HistoryServiceClient - History service client
  • hubSettings: HubSettings - Hub configuration
  • settingsClient: SettingsClient - Settings service client

Endpoints

WebSocket Endpoints:

  • /listen and /v1/listen - Handles speech recognition and NLU
  • /proactive and /v1/proactive - Handles proactive triggers

HTTP Endpoints:

  • /skills and /v1/skills - Lists available skills
  • /healthcheck - Service health check

Listen Flow

The listen transaction follows a state machine implemented in ListenTransactionHandler:

States:
  WAIT_LISTEN → ASR → NLU → ROUTE → DONE
  WAIT_LISTEN → WAIT_CLIENT_ASR → NLU → ROUTE → DONE
  WAIT_LISTEN → WAIT_CLIENT_NLU → ROUTE → DONE

State Transitions:

  1. WAIT_LISTEN - Receives LISTEN message from robot
  2. ASR - Performs Automatic Speech Recognition using Google Cloud Speech API
    • Streams audio packets
    • Emits SOS (Start of Speech) when speech detected
    • Emits EOS (End of Speech) when speech ends
    • Handles timeouts (SOS timeout, max speech timeout)
  3. NLU - Sends ASR text to Parser service for intent recognition
    • Includes context (loop users, perception, etc.)
    • Supports external Dialogflow agents
  4. ROUTE - Intent Router determines which skill to launch
    • Matches NLU result against skill intent configurations
    • Decision Mediator can alter decisions based on external factors
    • Routes to on-robot skills or cloud skills
  5. DONE - Transaction complete

Listen Transaction Handler (ListenTransactionHandler.ts):

  • Manages audio streaming via AudioBuffer
  • Creates ASRSession for speech recognition
  • Handles timeouts (ASR: 40s, Parser: 10s, Context: 5s, Skill: 10s)
  • Records speech history to MongoDB and optionally S3
  • Supports client-provided ASR/NLU (for menu clicks, etc.)
  • Handles skill redirects

Proactive Flow

The proactive system allows Jibo to initiate conversations based on context, history, and triggers.

Proactive Transaction Handler (ProactiveTransactionHandler.ts):

  1. Receives TRIGGER message from robot
  2. Waits for CONTEXT message (robot state)
  3. Action Selection:
    • Gets all proactive skill configurations
    • Filters by context rules (time, location, people present, etc.)
    • Filters by interaction history rules (frequency, recency)
    • Filters by user settings
    • Randomly selects from eligible actions
  4. Launches selected skill (on-robot or cloud)
  5. Returns match response or no-action response

Proactive Registration: Skills register proactive behaviors with:

  • Trigger types (time-based, event-based, surprise)
  • Context rules (when this can trigger)
  • Interaction history rules (how often it can trigger)
  • Settings rules (user preferences)

2. Parser Service (packages/parser)

The Parser service performs Natural Language Understanding using Dialogflow.

ParserService (ParserService.ts):

  • Starts RobustParser process on port 8787 (optional)
  • Initializes Dialogflow client
  • Initializes Robust Parser client
  • Handles POST requests to /v1/parse
  • Exposes state at /state endpoint

NLU Pipeline:

  1. Receives text, rules, and context
  2. Queries Dialogflow with configured agents
  3. Optionally queries Robust Parser (custom NLU)
  4. Returns intent, entities, and rules

Configuration:

  • Dialogflow API key
  • Robust Parser enable/disable
  • Multiple external agents support

3. History Service (packages/history)

The History service persists interaction data to MongoDB.

HistoryService (HistoryService.ts):

  • Two database clients:
    • SkillLaunchDBClient - Records skill launches
    • SpeechHistoryDBClient - Records speech interactions (optional)
  • HTTP endpoints:
    • /v1/skill/launch - Skill launch history
    • /v1/speech - Speech history (if enabled)
  • Health check endpoint

Data Stored:

  • Skill launches (skill ID, intent, timestamp, robot ID, account ID)
  • Speech interactions (ASR result, NLU result, audio file URL, error tracking)

4. Lasso Service (packages/lasso)

Lasso provides external data integration for skills.

Features:

  • OAuth2 credential management
  • Calendar client integration
  • Weather data (Dark Sky API)
  • Maps data (Google Maps API)
  • News data (AP News)
  • MongoDB for credential storage
  • Redis for caching

LassoService (LassoService.ts):

  • Manages OAuth2 flows
  • Provides relay endpoints for external APIs
  • Caches responses in Redis

Skill Framework

BaseSkill (packages/baseskill)

BaseSkill (BaseSkill.ts):

  • Abstract base class for all cloud skills
  • Extends BaseHttpHandler
  • Handles POST requests to /
  • Provides error handling
  • Tracks timing

GraphSkill (GraphSkill.ts):

  • Extends BaseSkill with graph-based state machine
  • Implements node-based conversation flow
  • Supports skill redirects
  • Tracks analytics events
  • Supports supplemental behaviors (parallel/sequence)

Graph System

The graph system provides a state machine framework for skills.

Graph (Graph.ts):

  • Directed graph of connected nodes
  • Supports subgraphs (hierarchical)
  • Exit transitions for graph termination
  • Validation (reachability, transition completeness)
  • GraphViz dot file generation

GraphManager (GraphManager.ts):

  • Singleton per skill
  • Manages node IDs and mappings
  • Executes graph:
    • start() - Creates session, enters initial node
    • enterNode() - Calls node's enter method
    • exitNode() - Calls node's exit method with action results
    • executeTransition() - Moves to next node
  • Maintains session state (node ID, data, trace)

Node (Node.ts):

  • Abstract base class for graph nodes
  • Has transition names and destinations
  • Two lifecycle methods:
    • enter(data) - Called when node is entered, returns action or redirect
    • exit(data) - Called with action results, returns next transition
  • Supports graph traversal (BFS)

Built-in Node Types:

  • DefaultNode - Simple terminal node
  • JCPNode - Returns JCP action
  • NoOpNode - No operation
  • TrueFalseNode - Conditional branching
  • SetLooperIDNode - Sets speaker ID

MIM (Motion Interaction Model) System:

  • ANFactory - Creates graph for playing MIM animations
  • Supports scripted responses, emotion responses, fallback responses
  • Semi-specific responses (context-aware)

Skill Request/Response Protocol

Skill Request Types (skill/request.ts):

  • LISTEN_LAUNCH - Launch skill from listen interaction
  • LISTEN_UPDATE - Update skill with action results
  • PROACTIVE_LAUNCH - Launch skill proactively

Skill Request Data:

{
  type: MessageType,
  msgID: UUID,
  ts: number,
  data: {
    general: { accountID, robotID, lang, release },
    runtime: { character, location, loop, perception, dialog },
    skill: { id, session? },
    result: any,  // Action results for UPDATE
    nlu: NLUResult,
    asr: ASRResult,
    memo?: any
  }
}

Skill Response Types (skill/response.ts):

  • SKILL_ACTION - Returns action to execute
  • SKILL_REDIRECT - Redirects to another skill
  • ERROR - Error response

Skill Action Data:

{
  action: JCPAction,  // JCP protocol behavior
  analytics?: AnalyticsData,
  final?: boolean,  // Is this the final response?
  fireAndForget?: boolean
}

JCP Action (skill/action.ts):

{
  type: ActionType.JCP,
  config: {
    version: "1.0.0",
    jcp: SupportedBehaviors  // SLIM, Sequence, Parallel, SetPresentPerson, ImpactEmotion
  }
}

Skill Configuration

SkillConfig (skill/config.ts):

{
  id: SkillID,
  intents: [{
    name: IntentName,
    entities?: EntityConfig[],
    memo?: any
  }],
  proactives?: ProactiveRegistration[],
  IHQueries?: IHQueryDefinitions,
  onRobot?: boolean,
  URL: string,
  settings?: ManifestSettings
}

Entity Config:

  • name - Entity name
  • value - Expected value
  • matchRule - 'EXACT' or 'NOT'

Proactive Registration:

  • Trigger type and conditions
  • Context rules
  • Interaction history rules
  • Settings rules

Interfaces Package

The interfaces package defines all TypeScript interfaces for communication between services.

Key Interface Modules

service.ts - Base message types:

  • BaseMessage<T, D> - Generic message with type, msgID, timestamp, data
  • BaseResponse<T, D> - Response with final flag and timings
  • IAuthDetails - Authentication details (account ID, access keys)

hub/ - Hub-specific interfaces:

  • request.ts - LISTEN, CONTEXT, CLIENT_ASR, CLIENT_NLU messages
  • response.ts - ASR, NLU, LISTEN, SKILL_REDIRECT, ERROR responses
  • MessageType.ts - Message type enums
  • HubErrorCode.ts - Error code enums

skill/ - Skill-specific interfaces:

  • request.ts - LISTEN_LAUNCH, LISTEN_UPDATE, PROACTIVE_LAUNCH
  • response.ts - SKILL_ACTION, SKILL_REDIRECT, ERROR
  • action.ts - JCP action types
  • config.ts - Skill configuration
  • behaviors.ts - Supported JCP behaviors
  • analytics.ts - Analytics event types

nlu.ts - NLU interfaces:

  • NLURequestData - Text, rules, loop users, external agents
  • NLUResult - Intent, entities, rules
  • ExternalAgentRequest - External Dialogflow agent config

asr.ts - ASR interfaces:

  • ASRResult - Text, confidence, annotation
  • ASRConfig - Language, hints, timeouts

jibo/ - Jibo-specific data:

  • data.ts - GeneralData (account, robot, language), SkillData (session, trace)
  • runtime.ts - RuntimeContext (character, location, loop, perception, dialog)

proactive/ - Proactive interfaces:

  • Context field definitions
  • History rules
  • Settings rules
  • Proactive trigger/request/response

history/ - History interfaces:

  • Skill launch data
  • Speech history data

Utils Package

The utils package provides shared functionality.

BaseService (utils/service/BaseService.ts)

Base class for all Pegasus services:

Features:

  • Express.js HTTP server
  • WebSocket server (ws library)
  • JWT authentication
  • Request/response logging with jibo-log
  • New Relic monitoring
  • Health check endpoint
  • Error handling middleware

Methods:

  • addSocketHandler(path, handler) - Register WebSocket handler
  • addHttpHandler(path, handler) - Register HTTP handler
  • init(port) - Start server
  • close() - Stop server

Authentication:

  • JWT token verification
  • Bearer token scheme
  • Configurable secret via ETCO_server_hubTokenSecret

Logging:

  • Per-request log instances
  • Transaction ID tracking
  • Robot ID tracking
  • Configurable log levels per namespace

Other Utils

  • PegasusRequest - Enhanced Express request with Jibo headers
  • PegasusWebSocket - Enhanced WebSocket with auth and logging
  • JiboHeaders - Parses Jibo-specific headers (transID, robotID, logging config)
  • ResponseWrapper - Wraps WebSocket responses
  • HttpError - HTTP error with status code

Communication Protocols

WebSocket Protocol

Connection:

  • URL: ws://hub:9000/listen or ws://hub:9000/proactive
  • Authentication: Bearer token in Authorization header
  • Headers: x-jibo-transid, x-jibo-robotid, x-jibo-logging-config

Message Format:

{
  "type": "MESSAGE_TYPE",
  "msgID": "uuid",
  "ts": 1234567890,
  "data": { ... }
}

Listen Flow Messages:

  1. Robot → Hub: LISTEN (with ASR config, rules, language)
  2. Robot → Hub: Audio packets (binary)
  3. Hub → Robot: SOS (Start of Speech)
  4. Robot → Hub: CONTEXT (runtime context)
  5. Hub → Robot: EOS (End of Speech)
  6. Hub → Robot: LISTEN (with ASR result, NLU result, match)
  7. Hub → Robot: SKILL_ACTION (if cloud skill)
  8. Robot → Hub: CMD_RESULT (action results)
  9. Hub → Robot: SKILL_ACTION (next action) or final

Proactive Flow Messages:

  1. Robot → Hub: TRIGGER (trigger data)
  2. Robot → Hub: CONTEXT (runtime context)
  3. Hub → Robot: PROACTIVE (match or no-action)
  4. Hub → Robot: SKILL_ACTION (if cloud skill)

HTTP Protocol

Skill Request:

  • Method: POST
  • URL: http://skill-host:port/
  • Headers: Authorization, x-jibo-transid, x-jibo-robotid
  • Body: SkillRequest JSON

Parser Request:

  • Method: POST
  • URL: http://parser:8080/v1/parse
  • Body: NLURequestData JSON

Authentication & Security

JWT Authentication

Token Format:

{
  "id": "account-id",
  "accessKeyId": "client-id",
  "secretAccessKey": "client-secret",
  "friendlyId": "robot-name"
}

Verification:

  • Secret: ETCO_server_hubTokenSecret environment variable
  • Scheme: Bearer
  • Applied to WebSocket connections and HTTP endpoints

Network Security

  • All services run in Docker containers
  • Services communicate via Docker network (pegasus-nw)
  • External access via load balancer
  • TLS termination at load balancer

Deployment

Docker Compose (Local Development)

Services:

  • hub - Hub service (port 9000)
  • parser - Parser service (port 9005)
  • history - History service (port 9006)
  • chitchat-skill - Chitchat skill (port 9004)
  • report-skill - Report skill (port 9003)
  • lasso - Lasso service (port 9007)
  • redis - Redis cache (port 6379)
  • mongo_lasso - MongoDB for Lasso (port 27017)
  • history_cluster - MongoDB for History (from docker-compose-history-db.yml)

Configuration:

  • Environment variables prefixed with ETCO_ (ETCO = Environment TO Configuration)
  • Volume mounting: ./:/pegasus:consistent for live code editing
  • Debug ports: 5850-5855 for Node.js debugging

Build Process

Commands:

docker build -t pegasus_base:latest .
yarn docker:bootstrap
yarn docker:build
./pegasus.js build-docker-image --services hub

CLI Tool (cli/):

  • bootstrap - Install dependencies
  • build - Build TypeScript
  • test - Run tests
  • docker-run - Run commands in Docker
  • build-docker-image - Build Docker images for services

Production Deployment

  • AWS ECS (Elastic Container Service)
  • ECR (Elastic Container Registry) for Docker images
  • Application Load Balancer
  • MongoDB Atlas for production databases
  • ElastiCache for Redis
  • CloudWatch for logging
  • New Relic for monitoring

Data Flow Examples

Example 1: User Says "Tell Me a Joke"

  1. Robot → Hub: LISTEN message with ASR config
  2. Robot → Hub: Audio stream
  3. Hub: Detects SOS, emits SOS message
  4. Hub: Streams audio to Google Cloud Speech API
  5. Hub: Detects EOS, emits EOS message
  6. Robot → Hub: CONTEXT message (runtime state)
  7. Hub → Parser: POST /v1/parse with text "tell me a joke"
  8. Parser → Dialogflow: Query with "joke" intent rules
  9. Dialogflow → Parser: Intent="joke_tell", entities={}
  10. Parser → Hub: NLU result
  11. Hub → IntentRouter: Match intent to "joke-skill"
  12. Hub → joke-skill: POST LISTEN_LAUNCH request
  13. joke-skill: Executes graph, selects joke
  14. joke-skill → Hub: SKILL_ACTION with JCP behavior (SayText)
  15. Hub → Robot: SKILL_ACTION message
  16. Robot: Executes behavior, speaks joke
  17. Robot → Hub: CMD_RESULT with action result
  18. Hub → joke-skill: POST LISTEN_UPDATE request
  19. joke-skill: Returns final=true
  20. Hub → Robot: Final SKILL_ACTION

Example 2: Proactive Greeting

  1. Robot: Detects person entering room
  2. Robot → Hub: TRIGGER message with trigger data
  3. Robot → Hub: CONTEXT message (runtime state)
  4. Hub: Queries all proactive skill configs
  5. Hub: Filters by context (time, people present)
  6. Hub: Filters by history (last greeting time)
  7. Hub: Filters by settings (user greeting preference)
  8. Hub: Selects "greeting-skill"
  9. Hub → greeting-skill: POST PROACTIVE_LAUNCH request
  10. greeting-skill → Hub: SKILL_ACTION with greeting behavior
  11. Hub → Robot: PROACTIVE response with match
  12. Hub → Robot: SKILL_ACTION message
  13. Robot: Executes greeting

Error Handling

Error Types

Hub Error Codes (HubErrorCode.ts):

  • TIMEOUT_ASR - ASR timeout
  • TIMEOUT_PARSER - Parser timeout
  • TIMEOUT_CONTEXT - Context timeout
  • TIMEOUT_SKILL - Skill timeout
  • PARSER - Parser error
  • ASR - ASR error

Skill Request Errors (SkillRequestError):

  • SKILL_NOT_FOUND - Skill does not exist
  • TIMEOUT - Skill request timeout

Error Response Format

{
  "type": "ERROR",
  "msgID": "uuid",
  "ts": 1234567890,
  "final": true,
  "data": {
    "message": "Error description",
    "code": "ERROR_CODE"
  },
  "timings": {
    "total": 1234
  }
}

Timeout Handling

  • ASR: 40 seconds (configurable via sosTimeout, maxSpeechTimeout)
  • Parser: 10 seconds
  • Context: 5 seconds
  • Skill: 10 seconds
  • Transaction: 60 seconds (configurable)

Monitoring & Logging

Logging

jibo-log Integration:

  • Per-namespace log levels
  • Transaction ID correlation
  • Robot ID tracking
  • Structured logging support

Log Levels:

  • Configured via x-jibo-logging-config header
  • Per-namespace granularity
  • Environment variable: ETCO_server_logLevel

Monitoring

New Relic:

  • HTTP request tracking
  • WebSocket transaction tracking
  • Error tracking
  • Custom attributes (transID, robotID)

Health Checks:

  • /healthcheck endpoint on all services
  • Returns service-specific health data
  • Database connection status

Speech History Recording

Optional Features:

  • Record skill launches to MongoDB
  • Record speech interactions to MongoDB
  • Upload speech logs to S3 (JSON with audio base64)

Configuration:

  • ETCO_hub_recordLaunchHistory - Enable launch history
  • ETCO_hub_recordSpeechHistory - Enable speech history
  • ETCO_hub_recordSpeechLogBucket - S3 bucket for speech logs

Skill Development Guide

Creating a New Skill

  1. Extend GraphSkill:
export class MySkill extends GraphSkill<Transition> {
  constructor() {
    super('my-skill');
  }

  createGraph(): Graph<Transition> {
    const g = new Graph('My Skill', generateTransitions<Transition>(Transition));
    // Add nodes and transitions
    g.finalize();
    return g;
  }
}
  1. Define Transitions:
enum Transition {
  Done = 'Done',
  Retry = 'Retry'
}
  1. Create Nodes:
class MyNode extends Node<Transition> {
  async enter(data: Data): Promise<EnterResponse> {
    // Return action or redirect
    return { action: myJCPAction };
  }

  async exit(data: Data): Promise<ExitResponse> {
    // Return next transition
    return { transition: Transition.Done };
  }
}
  1. Create Skill Manifest:
{
  "id": "my-skill",
  "intents": [
    {
      "name": "my_intent",
      "entities": []
    }
  ],
  "onRobot": false
}
  1. Register with Hub:
  • Add skill config to skills-local.json or environment
  • Deploy skill service
  • Hub will load configuration

Skill Best Practices

  • Use graph for complex flows, direct responses for simple ones
  • Track analytics events for monitoring
  • Handle errors gracefully with try-catch
  • Use supplemental behaviors for parallel actions
  • Set appropriate timeouts
  • Log important events
  • Test with both LISTEN_LAUNCH and PROACTIVE_LAUNCH

Key Design Decisions

Why Graph-Based Skills?

  • State Management: Explicit state machine with session tracking
  • Visualization: GraphViz generation for debugging
  • Reusability: Subgraphs for common patterns
  • Testability: Isolated node testing
  • Maintainability: Clear flow structure

Why WebSocket for Robot Communication?

  • Low Latency: Real-time bidirectional communication
  • Audio Streaming: Binary message support for audio
  • Stateful: Single connection per transaction
  • Efficiency: No HTTP overhead for each message

Why Separate Services?

  • Scalability: Scale each service independently
  • Isolation: Failure in one service doesn't affect others
  • Technology: Different services can use different tech stacks
  • Deployment: Independent deployment cycles

Why Lerna Monorepo?

  • Code Sharing: Easy to share interfaces and utils
  • Versioning: Linked versioning for interdependent packages
  • Development: Single repository for all services
  • Testing: Integration tests across packages

Limitations & Known Issues

  1. Single Graph Manager: Skills cannot have concurrent sessions (singleton pattern)
  2. Sequential Skill Redirects: Only one level of redirect supported
  3. No Skill-to-Skill Communication: Skills must go through hub
  4. Fixed Timeouts: Hardcoded timeouts in some places
  5. No Skill Hot-Reload: Requires container rebuild for skill changes
  6. Limited NLU: Dialogflow dependency, no custom model training
  7. No Skill Versioning: Skills identified by ID only
  8. Synchronous Skill Requests: Hub waits for skill response (no async)

Future Considerations

  1. Skill Versioning: Support multiple versions of same skill
  2. Skill-to-Skill Direct Communication: Allow skills to call each other
  3. Async Skill Responses: Long-running skills with callback pattern
  4. Custom NLU Models: Support for custom trained models
  5. Skill Hot-Reload: Dynamic skill loading without restart
  6. Multi-Session Skills: Support concurrent skill sessions
  7. Skill Marketplace: Third-party skill distribution
  8. A/B Testing: Framework for testing skill variations

Conclusion

The original Jibo server (Pegasus) is a well-architected microservices system that provides a robust foundation for conversational AI on the Jibo robot. The graph-based skill framework offers flexibility and maintainability, while the separation of concerns enables independent scaling and development. The system successfully handles real-time speech processing, natural language understanding, skill routing, and proactive behaviors in a distributed cloud environment.