first stab at solving for menus and real STT

This commit is contained in:
Jacob Dubin
2026-04-16 15:40:28 -05:00
parent efe4dfd04e
commit fe1e11653f
19 changed files with 799 additions and 19 deletions

View File

@@ -108,6 +108,8 @@ The current websocket bridge now also includes server-driven raw-audio turn comp
The current richer websocket parity slice is still intentionally narrow:
- the successful joke path now has fixture-backed reply sequencing and partial payload-shape fidelity through `CLIENT_ASR -> LISTEN -> EOS -> delayed SKILL_ACTION`
- menu-side `CLIENT_NLU` parity is beginning to expand from live captures, starting with preserved clock-menu intent/rules/entities
- `.NET` now preserves buffered websocket audio frames so local tool-based ASR experiments can run without changing the stable cloud-first architecture
- this is not a claim of broad skill parity or full Jibo websocket coverage
## Important Docs

View File

@@ -55,6 +55,20 @@ Right now the strongest implemented vertical slice beyond basic listen completio
That should remain the model for future websocket work: capture first, fixture second, parity third.
The latest live captures also support a second discovery track:
- menu-driven `CLIENT_NLU` parity for clock, timer, and alarm flows
- richer transcript-bearing `CLIENT_ASR` discovery beyond jokes
- buffered-audio preservation for eventual real ASR in `.NET`
Near-term ASR work should stay staged:
1. preserve and replay the websocket audio payloads honestly
2. validate a local tool-based decode/transcribe loop in `.NET`
3. compare that against Azure-hosted STT before choosing a default production path
That keeps Node as the reverse-engineering oracle while letting the long-term `.NET` cloud gain real STT seams without pretending they are finished.
## Speech, Animation, And ESML
The current joke flow is only a small foothold into Jibo expressiveness.

View File

@@ -108,6 +108,65 @@ What remains intentionally unclaimed for that slice:
- whether additional websocket messages appear in other successful skill paths
- whether any timing gaps besides the observed 75 ms `EOS -> SKILL_ACTION` delay matter
### Latest Live Capture Additions From April 16, 2026
The newest repo-root websocket capture at [captures/websocket/20260416.events.ndjson](/C:/Projects/JiboExperiments/captures/websocket/20260416.events.ndjson) adds more grounded websocket discovery without implying broad protocol coverage.
Observed `CLIENT_ASR` transcript-bearing turns now include:
- `tell me a joke`
- `do a dance`
- `surprise me`
- `personal report`
- `tell me about the weather`
- `tell me about my calendar`
- `what does my commute look like`
- `tell me about the news`
Observed menu-driven `CLIENT_NLU` intents now include:
- `loadMenu`
- `askForTime`
- `askForDate`
- `start`
- `timerValue`
- `set`
- `alarmValue`
Observed entity/rule shapes from those menu flows include:
- `askForTime` with `entities.domain = "clock"` and `rules = ["clock/clock_menu"]`
- `askForDate` with the same `clock` menu rule family
- `timerValue` with timer duration entities
- `alarmValue` with alarm time entities such as `ampm` and `time`
Current `.NET` parity for that new slice is still intentionally partial:
- menu-side `CLIENT_NLU` replies now preserve the observed inbound intent/rules/entities in the synthetic outbound `LISTEN` payload
- `askForTime` and `askForDate` are now fixture-backed as mapped menu intents
- `do a dance` is now recognized as a distinct chat/dance intent in the current synthetic path
Still unknown:
- whether `surprise me`, `personal report`, weather, calendar, commute, and news should map to richer skill-specific websocket payloads
- whether menu-side clock/timer/alarm flows require additional websocket messages beyond the currently observed `LISTEN` and `EOS`
- how much of those flows are actually completed robot-side versus merely acknowledged by the cloud
### Buffered Audio / ASR Direction
The `.NET` hosted implementation now has two STT lanes:
- existing synthetic transcript-hint replay for fixture-driven parity work
- a new opt-in local buffered-audio path that preserves websocket Ogg/Opus frames and can invoke external `ffmpeg` plus `whisper.cpp`
That local tool-based path is intentionally experimental and disabled by default. Its purpose is to let us iterate on real buffered-audio decoding in `.NET` without changing the stable cloud-first architecture or claiming production ASR parity yet.
Future provider options still under consideration:
- local decode/transcribe in `.NET` using preserved websocket audio plus external tools
- Azure Speech as a hosted STT option for the long-term cloud path
- direct managed Opus decode later if a library proves stable enough for the hosted deployment target
Current raw-audio fallback behavior remains explicitly synthetic:
- when a buffered-audio turn can be resolved through the synthetic transcript-hint seam, `.NET` now auto-finalizes and emits `LISTEN` + `EOS` + `SKILL_ACTION`

View File

@@ -108,3 +108,25 @@ Current raw-audio behavior is still a compatibility bridge:
- if buffered audio has a synthetic transcript hint, the server now auto-finalizes the turn and emits `LISTEN` + `EOS` + `SKILL_ACTION`
- if buffered audio crosses the finalize threshold without a usable transcript, the server now emits a Node-style fallback completion with `EOS` instead of hanging the turn forever
- this is intentionally not a claim of real ASR parity
## Buffered Audio STT
The current `.NET` websocket stack now preserves buffered Ogg/Opus websocket frames in memory for each in-flight turn.
That enables two distinct STT paths:
- fixture-oriented synthetic transcript hints for replay and parity tests
- an opt-in local tool-based path that can normalize the buffered Ogg pages, call `ffmpeg`, and then call `whisper.cpp`
The local tool path is intentionally off by default. It exists to help map real robot audio behavior while the stable hosted cloud remains the primary goal.
Configuration lives under `OpenJibo:Stt`:
- `EnableLocalWhisperCpp`
- `FfmpegPath`
- `WhisperCliPath`
- `WhisperModelPath`
- `WhisperLanguage`
- `TempDirectory`
This is not yet a claim of production-ready onboard ASR. It is a `.NET` discovery seam that keeps us compatible with the Node oracle while we evaluate longer-term options such as Azure-hosted STT or a managed decode/transcribe stack.

View File

@@ -8,22 +8,30 @@ public sealed class DemoConversationBroker : IConversationBroker
{
var transcript = (turn.NormalizedTranscript ?? turn.RawTranscript ?? string.Empty).Trim();
var lowered = transcript.ToLowerInvariant();
var clientIntent = turn.Attributes.TryGetValue("clientIntent", out var rawClientIntent)
? rawClientIntent?.ToString()
: null;
var semanticIntent = ResolveSemanticIntent(lowered, clientIntent);
var reply = transcript.Length == 0
var reply = semanticIntent switch
{
"time" => $"It is {DateTime.Now:hh:mm tt}.",
"date" => $"Today is {DateTime.Now:dddd, MMMM d}.",
"dance" => "Okay. Watch this.",
_ => transcript.Length == 0
? "I am listening."
: lowered.Contains("time")
? $"It is {DateTime.Now:hh:mm tt}."
: lowered.Contains("hello") || lowered.Contains("hi")
? "Hello from the OpenJibo cloud."
: lowered.Contains("joke")
? "Why did the robot bring a ladder? Because it wanted to reach the cloud."
: $"I heard: {transcript}";
: lowered.Contains("hello") || lowered.Contains("hi")
? "Hello from the OpenJibo cloud."
: lowered.Contains("joke")
? "Why did the robot bring a ladder? Because it wanted to reach the cloud."
: $"I heard: {transcript}"
};
var plan = new ResponsePlan
{
SessionId = turn.SessionId,
Status = ResponseStatus.Succeeded,
IntentName = lowered.Contains("joke") ? "joke" : lowered.Contains("time") ? "time" : "chat",
IntentName = semanticIntent,
Topic = "conversation",
DeviceId = turn.DeviceId,
TargetHost = turn.HostName,
@@ -72,4 +80,39 @@ public sealed class DemoConversationBroker : IConversationBroker
return Task.FromResult(plan);
}
private static string ResolveSemanticIntent(string loweredTranscript, string? clientIntent)
{
if (string.Equals(clientIntent, "askForTime", StringComparison.OrdinalIgnoreCase))
{
return "time";
}
if (string.Equals(clientIntent, "askForDate", StringComparison.OrdinalIgnoreCase))
{
return "date";
}
if (loweredTranscript.Contains("joke", StringComparison.Ordinal))
{
return "joke";
}
if (loweredTranscript.Contains("dance", StringComparison.Ordinal))
{
return "dance";
}
if (loweredTranscript.Contains("time", StringComparison.Ordinal))
{
return "time";
}
if (loweredTranscript.Contains("date", StringComparison.Ordinal) || loweredTranscript.Contains("day", StringComparison.Ordinal))
{
return "date";
}
return "chat";
}
}

View File

@@ -9,12 +9,12 @@ public sealed class ProtocolToTurnContextMapper
public TurnContext MapListenMessage(WebSocketMessageEnvelope envelope, CloudSession session, string messageType)
{
var turnState = session.TurnState;
var text = ExtractTranscript(envelope.Text);
var protocolOperation = messageType.ToLowerInvariant();
var attributes = new Dictionary<string, object?>(StringComparer.OrdinalIgnoreCase)
{
["messageType"] = messageType
};
var text = ExtractTranscript(envelope.Text, attributes);
if (!string.IsNullOrWhiteSpace(turnState.TransId))
{
@@ -35,6 +35,7 @@ public sealed class ProtocolToTurnContextMapper
{
attributes["bufferedAudioBytes"] = turnState.BufferedAudioBytes;
attributes["bufferedAudioChunks"] = turnState.BufferedAudioChunkCount;
attributes["bufferedAudioFrames"] = turnState.BufferedAudioFrames.Select(frame => frame.ToArray()).ToArray();
}
if (!string.IsNullOrWhiteSpace(turnState.AudioTranscriptHint))
@@ -66,7 +67,7 @@ public sealed class ProtocolToTurnContextMapper
};
}
private static string? ExtractTranscript(string? text)
private static string? ExtractTranscript(string? text, IDictionary<string, object?> attributes)
{
if (string.IsNullOrWhiteSpace(text))
{
@@ -99,6 +100,25 @@ public sealed class ProtocolToTurnContextMapper
}
if (data.TryGetProperty("intent", out var intent) && intent.ValueKind == JsonValueKind.String)
{
attributes["clientIntent"] = intent.GetString();
}
if (data.TryGetProperty("rules", out var rules) && rules.ValueKind == JsonValueKind.Array)
{
attributes["clientRules"] = rules.EnumerateArray()
.Where(item => item.ValueKind == JsonValueKind.String)
.Select(item => item.GetString() ?? string.Empty)
.Where(rule => !string.IsNullOrWhiteSpace(rule))
.ToArray();
}
if (data.TryGetProperty("entities", out var entities) && entities.ValueKind == JsonValueKind.Object)
{
attributes["clientEntities"] = entities.Clone();
}
if (intent.ValueKind == JsonValueKind.String)
{
return intent.GetString();
}

View File

@@ -10,11 +10,20 @@ public sealed class ResponsePlanToSocketMessagesMapper
{
var speak = plan.Actions.OfType<SpeakAction>().FirstOrDefault();
var skill = plan.Actions.OfType<InvokeNativeSkillAction>().FirstOrDefault();
var messageType = ReadAttribute(turn, "messageType");
var transId = turn.Attributes.TryGetValue("transID", out var transIdValue)
? transIdValue?.ToString() ?? string.Empty
: session.LastTransId ?? string.Empty;
var transcript = turn.NormalizedTranscript ?? turn.RawTranscript ?? string.Empty;
var rules = ReadRules(turn);
var clientIntent = ReadAttribute(turn, "clientIntent");
var rules = ReadRules(turn, messageType);
var outboundIntent = string.Equals(messageType, "CLIENT_NLU", StringComparison.OrdinalIgnoreCase) && !string.IsNullOrWhiteSpace(clientIntent)
? clientIntent!
: plan.IntentName ?? "unknown";
var outboundAsrText = string.Equals(messageType, "CLIENT_NLU", StringComparison.OrdinalIgnoreCase) && !string.IsNullOrWhiteSpace(clientIntent)
? clientIntent!
: transcript;
var entities = ReadEntities(turn, messageType);
var messages = new List<SocketReplyPlan>();
messages.Add(new SocketReplyPlan(JsonSerializer.Serialize(new
@@ -27,18 +36,18 @@ public sealed class ResponsePlanToSocketMessagesMapper
{
confidence = 0.95,
final = true,
text = transcript
text = outboundAsrText
},
nlu = new
{
confidence = 0.95,
intent = plan.IntentName ?? "unknown",
intent = outboundIntent,
rules,
entities = new Dictionary<string, object?>()
entities
},
match = new
{
intent = plan.IntentName ?? "unknown",
intent = outboundIntent,
rule = rules.FirstOrDefault() ?? string.Empty,
score = 0.95
}
@@ -107,9 +116,13 @@ public sealed class ResponsePlanToSocketMessagesMapper
];
}
private static IReadOnlyList<string> ReadRules(TurnContext turn)
private static IReadOnlyList<string> ReadRules(TurnContext turn, string? messageType)
{
if (!turn.Attributes.TryGetValue("listenRules", out var value))
var attributeName = string.Equals(messageType, "CLIENT_NLU", StringComparison.OrdinalIgnoreCase)
? "clientRules"
: "listenRules";
if (!turn.Attributes.TryGetValue(attributeName, out var value))
{
return [];
}
@@ -122,12 +135,42 @@ public sealed class ResponsePlanToSocketMessagesMapper
};
}
private static object ReadEntities(TurnContext turn, string? messageType)
{
if (!string.Equals(messageType, "CLIENT_NLU", StringComparison.OrdinalIgnoreCase))
{
return new Dictionary<string, object?>();
}
if (!turn.Attributes.TryGetValue("clientEntities", out var value) || value is null)
{
return new Dictionary<string, object?>();
}
return value switch
{
JsonElement jsonElement when jsonElement.ValueKind == JsonValueKind.Object => jsonElement,
IDictionary<string, object?> dictionary => dictionary,
_ => new Dictionary<string, object?>()
};
}
private static string? ReadAttribute(TurnContext turn, string key)
{
return turn.Attributes.TryGetValue(key, out var value)
? value?.ToString()
: null;
}
private static object BuildSkillPayload(ResponsePlan plan, TurnContext turn, string transId, SpeakAction speak, InvokeNativeSkillAction? skill)
{
var isJoke = string.Equals(plan.IntentName, "joke", StringComparison.OrdinalIgnoreCase) ||
string.Equals(skill?.SkillName, "@be/joke", StringComparison.OrdinalIgnoreCase);
var isDance = string.Equals(plan.IntentName, "dance", StringComparison.OrdinalIgnoreCase);
var skillId = isJoke ? "@be/joke" : skill?.SkillName ?? "chitchat-skill";
var esml = isJoke
var esml = isDance
? "<speak>Okay.<break size='0.2'/> Watch this.<anim cat='dance' filter='music, rom-upbeat' /></speak>"
: isJoke
? $"<speak><es cat='happy' filter='!ssa-only, !sfx-only' endNeutral='true'>{EscapeXml(speak.Text)}</es></speak>"
: $"<speak><es cat='neutral' filter='!ssa-only, !sfx-only' endNeutral='true'>{EscapeXml(speak.Text)}</es></speak>";
var mimId = isJoke ? "runtime-joke" : "runtime-chat";

View File

@@ -23,6 +23,10 @@ public sealed class WebSocketTurnFinalizationService(
turnState.FirstAudioReceivedUtc ??= DateTimeOffset.UtcNow;
turnState.BufferedAudioChunkCount += 1;
turnState.BufferedAudioBytes += envelope.Binary?.Length ?? 0;
if (envelope.Binary is { Length: > 0 })
{
turnState.BufferedAudioFrames.Add(envelope.Binary.ToArray());
}
turnState.LastAudioReceivedUtc = DateTimeOffset.UtcNow;
turnState.AwaitingTurnCompletion = true;
session.Metadata["lastAudioBytes"] = envelope.Binary?.Length ?? 0;
@@ -223,6 +227,7 @@ public sealed class WebSocketTurnFinalizationService(
session.TurnState.BufferedAudioChunkCount = 0;
session.TurnState.FirstAudioReceivedUtc = null;
session.TurnState.LastAudioReceivedUtc = null;
session.TurnState.BufferedAudioFrames.Clear();
session.TurnState.FinalizeAttemptCount = 0;
session.Metadata.Remove("audioTranscriptHint");
}
@@ -236,6 +241,7 @@ public sealed class WebSocketTurnFinalizationService(
turnState.LastAudioReceivedUtc = null;
turnState.BufferedAudioChunkCount = 0;
turnState.BufferedAudioBytes = 0;
turnState.BufferedAudioFrames.Clear();
turnState.FinalizeAttemptCount = 0;
turnState.AwaitingTurnCompletion = false;
turnState.SawListen = false;

View File

@@ -9,6 +9,7 @@ public sealed class WebSocketTurnState
public DateTimeOffset? LastAudioReceivedUtc { get; set; }
public int BufferedAudioChunkCount { get; set; }
public int BufferedAudioBytes { get; set; }
public List<byte[]> BufferedAudioFrames { get; } = [];
public int FinalizeAttemptCount { get; set; }
public bool AwaitingTurnCompletion { get; set; }
public bool SawListen { get; set; }

View File

@@ -0,0 +1,11 @@
namespace Jibo.Cloud.Infrastructure.Audio;
public sealed class BufferedAudioSttOptions
{
public bool EnableLocalWhisperCpp { get; set; }
public string? FfmpegPath { get; set; }
public string? WhisperCliPath { get; set; }
public string? WhisperModelPath { get; set; }
public string WhisperLanguage { get; set; } = "en";
public string? TempDirectory { get; set; }
}

View File

@@ -0,0 +1,42 @@
using System.Diagnostics;
namespace Jibo.Cloud.Infrastructure.Audio;
public sealed class ExternalProcessRunner : IExternalProcessRunner
{
public async Task<ExternalProcessResult> RunAsync(string fileName, IReadOnlyList<string> arguments, CancellationToken cancellationToken = default)
{
using var process = new Process
{
StartInfo = new ProcessStartInfo
{
FileName = fileName,
RedirectStandardOutput = true,
RedirectStandardError = true,
UseShellExecute = false,
CreateNoWindow = true
}
};
foreach (var argument in arguments)
{
process.StartInfo.ArgumentList.Add(argument);
}
process.Start();
var stdOutTask = process.StandardOutput.ReadToEndAsync(cancellationToken);
var stdErrTask = process.StandardError.ReadToEndAsync(cancellationToken);
await process.WaitForExitAsync(cancellationToken);
var stdOut = await stdOutTask;
var stdErr = await stdErrTask;
if (process.ExitCode != 0)
{
throw new InvalidOperationException($"External process '{fileName}' failed with exit code {process.ExitCode}: {stdErr}");
}
return new ExternalProcessResult(process.ExitCode, stdOut, stdErr);
}
}

View File

@@ -0,0 +1,8 @@
namespace Jibo.Cloud.Infrastructure.Audio;
public interface IExternalProcessRunner
{
Task<ExternalProcessResult> RunAsync(string fileName, IReadOnlyList<string> arguments, CancellationToken cancellationToken = default);
}
public sealed record ExternalProcessResult(int ExitCode, string StdOut, string StdErr);

View File

@@ -0,0 +1,153 @@
using System.Text.Json;
using Jibo.Runtime.Abstractions;
namespace Jibo.Cloud.Infrastructure.Audio;
public sealed class LocalWhisperCppBufferedAudioSttStrategy(
BufferedAudioSttOptions options,
IExternalProcessRunner processRunner) : ISttStrategy
{
public string Name => "local-whispercpp-buffered-audio";
public bool CanHandle(TurnContext turn)
{
return options.EnableLocalWhisperCpp &&
!string.IsNullOrWhiteSpace(options.FfmpegPath) &&
!string.IsNullOrWhiteSpace(options.WhisperCliPath) &&
!string.IsNullOrWhiteSpace(options.WhisperModelPath) &&
ReadBufferedAudioFrames(turn).Count > 0;
}
public async Task<SttResult> TranscribeAsync(TurnContext turn, CancellationToken cancellationToken = default)
{
var frames = ReadBufferedAudioFrames(turn);
if (frames.Count == 0)
{
throw new InvalidOperationException("Local whisper.cpp STT requires buffered websocket audio frames.");
}
var tempDirectory = options.TempDirectory;
if (string.IsNullOrWhiteSpace(tempDirectory))
{
tempDirectory = Path.Combine(Path.GetTempPath(), "openjibo-stt");
}
Directory.CreateDirectory(tempDirectory);
var baseName = $"turn-{turn.TurnId}";
var oggPath = Path.Combine(tempDirectory, $"{baseName}.ogg");
var wavPath = Path.Combine(tempDirectory, $"{baseName}.wav");
try
{
await File.WriteAllBytesAsync(oggPath, OggOpusAudioNormalizer.Normalize(frames), cancellationToken);
await processRunner.RunAsync(
options.FfmpegPath!,
["-y", "-i", oggPath, "-ar", "16000", "-ac", "1", "-f", "wav", wavPath],
cancellationToken);
var whisperResult = await processRunner.RunAsync(
options.WhisperCliPath!,
["-m", options.WhisperModelPath!, "-f", wavPath, "-l", options.WhisperLanguage],
cancellationToken);
var transcript = ExtractTranscript(whisperResult.StdOut);
if (string.IsNullOrWhiteSpace(transcript))
{
throw new InvalidOperationException("whisper.cpp returned no transcript for the buffered audio turn.");
}
return new SttResult
{
Text = transcript,
Provider = Name,
Locale = turn.Locale,
Metadata = new Dictionary<string, object?>(StringComparer.OrdinalIgnoreCase)
{
["bufferedAudioBytes"] = ReadBufferedAudioBytes(turn),
["bufferedAudioChunks"] = frames.Count,
["ffmpegPath"] = options.FfmpegPath,
["whisperCliPath"] = options.WhisperCliPath,
["wavPath"] = wavPath
}
};
}
finally
{
TryDelete(oggPath);
TryDelete(wavPath);
}
}
private static IReadOnlyList<byte[]> ReadBufferedAudioFrames(TurnContext turn)
{
if (!turn.Attributes.TryGetValue("bufferedAudioFrames", out var value) || value is null)
{
return [];
}
return value switch
{
byte[][] jagged => jagged,
IReadOnlyList<byte[]> typed => typed,
IEnumerable<byte[]> enumerable => enumerable.ToArray(),
JsonElement jsonElement when jsonElement.ValueKind == JsonValueKind.Array => jsonElement.EnumerateArray()
.Where(static item => item.ValueKind == JsonValueKind.Array)
.Select(static item => item.EnumerateArray().Select(static b => (byte)b.GetInt32()).ToArray())
.ToArray(),
_ => []
};
}
private static int ReadBufferedAudioBytes(TurnContext turn)
{
return turn.Attributes.TryGetValue("bufferedAudioBytes", out var bufferedAudioBytes) && bufferedAudioBytes is not null
? bufferedAudioBytes switch
{
int value => value,
long value => (int)value,
string value when int.TryParse(value, out var parsed) => parsed,
_ => 0
}
: 0;
}
private static string ExtractTranscript(string standardOutput)
{
var lines = standardOutput
.Split(['\r', '\n'], StringSplitOptions.RemoveEmptyEntries | StringSplitOptions.TrimEntries);
var timecoded = lines
.Where(static line => line.StartsWith("[", StringComparison.Ordinal) && line.Contains("-->", StringComparison.Ordinal))
.Select(static line =>
{
var closingBracket = line.IndexOf(']');
return closingBracket >= 0 ? line[(closingBracket + 1)..].Trim() : line.Trim();
})
.Where(static line => !string.IsNullOrWhiteSpace(line))
.ToArray();
if (timecoded.Length > 0)
{
return string.Join(" ", timecoded).Trim();
}
return string.Join(" ", lines).Trim();
}
private static void TryDelete(string path)
{
try
{
if (File.Exists(path))
{
File.Delete(path);
}
}
catch
{
// Best-effort cleanup only.
}
}
}

View File

@@ -0,0 +1,114 @@
using System.Buffers.Binary;
using System.Text;
namespace Jibo.Cloud.Infrastructure.Audio;
internal static class OggOpusAudioNormalizer
{
private static readonly uint[] CrcTable = BuildCrcTable();
public static byte[] Normalize(IReadOnlyList<byte[]> pages)
{
if (pages.Count == 0)
{
return [];
}
var parsed = pages.Select(ParsePage).ToArray();
var baseGranule = parsed.Length > 1 ? parsed[1].GranulePosition : parsed[0].GranulePosition;
var normalized = new List<byte[]>(pages.Count);
for (var index = 0; index < pages.Count; index += 1)
{
var output = pages[index].ToArray();
var parsedPage = parsed[index];
var newGranule = index >= 1 && parsedPage.GranulePosition >= baseGranule
? parsedPage.GranulePosition - baseGranule
: 0UL;
BinaryPrimitives.WriteUInt64LittleEndian(output.AsSpan(6, 8), newGranule);
BinaryPrimitives.WriteUInt32LittleEndian(output.AsSpan(18, 4), (uint)index);
var headerType = output[5];
output[5] = index == pages.Count - 1
? (byte)(headerType | 0x04)
: (byte)(headerType & ~0x04);
output[22] = 0;
output[23] = 0;
output[24] = 0;
output[25] = 0;
BinaryPrimitives.WriteUInt32LittleEndian(output.AsSpan(22, 4), ComputeCrc(output));
normalized.Add(output);
}
return normalized.SelectMany(static page => page).ToArray();
}
private static ParsedOggPage ParsePage(byte[] buffer)
{
if (buffer.Length < 27)
{
throw new InvalidOperationException($"Buffered Ogg page is too short ({buffer.Length} bytes).");
}
if (!Encoding.ASCII.GetString(buffer, 0, 4).Equals("OggS", StringComparison.Ordinal))
{
throw new InvalidOperationException("Buffered audio frame did not begin with an OggS capture pattern.");
}
var pageSegments = buffer[26];
if (buffer.Length < 27 + pageSegments)
{
throw new InvalidOperationException("Buffered Ogg page segment table was truncated.");
}
var payloadLength = 0;
for (var index = 0; index < pageSegments; index += 1)
{
payloadLength += buffer[27 + index];
}
var expectedLength = 27 + pageSegments + payloadLength;
if (buffer.Length < expectedLength)
{
throw new InvalidOperationException("Buffered Ogg page payload was truncated.");
}
return new ParsedOggPage(BinaryPrimitives.ReadUInt64LittleEndian(buffer.AsSpan(6, 8)));
}
private static uint ComputeCrc(byte[] buffer)
{
uint crc = 0;
foreach (var value in buffer)
{
crc = (crc << 8) ^ CrcTable[((crc >> 24) ^ value) & 0xff];
}
return crc;
}
private static uint[] BuildCrcTable()
{
var table = new uint[256];
for (uint index = 0; index < table.Length; index += 1)
{
var remainder = index << 24;
for (var bit = 0; bit < 8; bit += 1)
{
remainder = (remainder & 0x80000000) != 0
? (remainder << 1) ^ 0x04c11db7
: remainder << 1;
}
table[index] = remainder;
}
return table;
}
private sealed record ParsedOggPage(ulong GranulePosition);
}

View File

@@ -1,5 +1,6 @@
using Jibo.Cloud.Application.Abstractions;
using Jibo.Cloud.Application.Services;
using Jibo.Cloud.Infrastructure.Audio;
using Jibo.Cloud.Infrastructure.Persistence;
using Jibo.Cloud.Infrastructure.Telemetry;
using Jibo.Runtime.Abstractions;
@@ -12,14 +13,19 @@ public static class ServiceCollectionExtensions
{
public static IServiceCollection AddOpenJiboCloud(this IServiceCollection services, IConfiguration? configuration = null)
{
var sttOptions = new BufferedAudioSttOptions();
if (configuration is not null)
{
services.Configure<WebSocketTelemetryOptions>(configuration.GetSection("OpenJibo:Telemetry"));
services.Configure<ProtocolTelemetryOptions>(configuration.GetSection("OpenJibo:ProtocolTelemetry"));
configuration.GetSection("OpenJibo:Stt").Bind(sttOptions);
}
services.AddSingleton(sttOptions);
services.AddSingleton<ICloudStateStore, InMemoryCloudStateStore>();
services.AddSingleton<IConversationBroker, DemoConversationBroker>();
services.AddSingleton<IExternalProcessRunner, ExternalProcessRunner>();
services.AddSingleton<ISttStrategy, LocalWhisperCppBufferedAudioSttStrategy>();
services.AddSingleton<ISttStrategy, SyntheticBufferedAudioSttStrategy>();
services.AddSingleton<ISttStrategySelector, DefaultSttStrategySelector>();
services.AddSingleton<IWebSocketTelemetrySink, FileWebSocketTelemetrySink>();

View File

@@ -12,6 +12,7 @@ Current fixture groups:
Current websocket fixture depth is uneven on purpose:
- `neo-hub-client-asr-joke.flow.json` now asserts a richer vertical slice than reply types alone. It captures the observed Node-oriented `CLIENT_ASR -> LISTEN -> EOS -> delayed SKILL_ACTION` joke turn with payload-shape expectations for `EOS` and joke `SKILL_ACTION`.
- `neo-hub-client-nlu-clock-ask-time.flow.json` captures a real menu-style `CLIENT_NLU` turn from the latest live captures and asserts that `.NET` preserves the observed NLU intent/rules/entities in the synthetic websocket reply instead of flattening everything into generic chat.
- The other websocket fixtures are still mainly sequencing fixtures. They are useful for replay and guardrails, but they should not be read as proof of broader payload parity.
Expand this folder whenever new robot traffic is captured and cleaned.

View File

@@ -0,0 +1,82 @@
{
"name": "neo-hub client nlu clock ask time flow",
"session": {
"hostName": "neo-hub.jibo.com",
"path": "/listen",
"kind": "neo-hub-listen",
"token": "fixture-clock-nlu-token"
},
"steps": [
{
"text": {
"type": "LISTEN",
"transID": "fixture-trans-clock-time",
"data": {
"lang": "en-US",
"rules": [
"clock/clock_menu",
"globals/global_commands_launch"
],
"mode": "CLIENT_NLU"
}
},
"expectedReplyTypes": [
"OPENJIBO_TURN_PENDING"
]
},
{
"text": {
"type": "CLIENT_NLU",
"transID": "fixture-trans-clock-time",
"data": {
"entities": {
"domain": "clock"
},
"intent": "askForTime",
"rules": [
"clock/clock_menu"
]
}
},
"expectedReplyTypes": [
"LISTEN",
"EOS"
],
"expectedReplies": [
{
"type": "LISTEN",
"jsonSubset": {
"type": "LISTEN",
"transID": "fixture-trans-clock-time",
"data": {
"asr": {
"text": "askForTime"
},
"nlu": {
"intent": "askForTime",
"rules": [
"clock/clock_menu"
],
"entities": {
"domain": "clock"
}
},
"match": {
"intent": "askForTime",
"rule": "clock/clock_menu"
}
}
}
},
{
"type": "EOS",
"jsonSubset": {
"type": "EOS",
"transID": "fixture-trans-clock-time",
"data": {}
}
}
]
}
]
}

View File

@@ -294,6 +294,42 @@ public sealed class JiboWebSocketServiceTests
Assert.Equal("trans-follow-up", session.LastTransId);
}
[Fact]
public async Task ClientNlu_ClockAskForTime_PreservesObservedIntentRulesAndEntities()
{
var listenReplies = await _service.HandleMessageAsync(new WebSocketMessageEnvelope
{
HostName = "neo-hub.jibo.com",
Path = "/listen",
Kind = "neo-hub-listen",
Token = "hub-clock-menu-token",
Text = """{"type":"LISTEN","transID":"trans-clock-time","data":{"lang":"en-US","rules":["clock/clock_menu","globals/global_commands_launch"],"mode":"CLIENT_NLU"}}"""
});
Assert.Single(listenReplies);
Assert.Equal("OPENJIBO_TURN_PENDING", ReadReplyType(listenReplies[0]));
var nluReplies = await _service.HandleMessageAsync(new WebSocketMessageEnvelope
{
HostName = "neo-hub.jibo.com",
Path = "/listen",
Kind = "neo-hub-listen",
Token = "hub-clock-menu-token",
Text = """{"type":"CLIENT_NLU","transID":"trans-clock-time","data":{"entities":{"domain":"clock"},"intent":"askForTime","rules":["clock/clock_menu"]}}"""
});
Assert.Equal(2, nluReplies.Count);
Assert.Equal("LISTEN", ReadReplyType(nluReplies[0]));
Assert.Equal("EOS", ReadReplyType(nluReplies[1]));
using var listenPayload = JsonDocument.Parse(nluReplies[0].Text!);
Assert.Equal("askForTime", listenPayload.RootElement.GetProperty("data").GetProperty("asr").GetProperty("text").GetString());
Assert.Equal("askForTime", listenPayload.RootElement.GetProperty("data").GetProperty("nlu").GetProperty("intent").GetString());
Assert.Equal("clock", listenPayload.RootElement.GetProperty("data").GetProperty("nlu").GetProperty("entities").GetProperty("domain").GetString());
Assert.Equal("clock/clock_menu", listenPayload.RootElement.GetProperty("data").GetProperty("nlu").GetProperty("rules")[0].GetString());
Assert.Equal("clock/clock_menu", listenPayload.RootElement.GetProperty("data").GetProperty("match").GetProperty("rule").GetString());
}
[Fact]
public async Task BufferedAudio_WithSyntheticTranscriptHint_FinalizesThroughSttSeam()
{
@@ -562,6 +598,7 @@ public sealed class JiboWebSocketServiceTests
[Theory]
[InlineData("fixtures\\neo-hub-client-asr-joke.flow.json")]
[InlineData("fixtures\\neo-hub-context-client-nlu.flow.json")]
[InlineData("fixtures\\neo-hub-client-nlu-clock-ask-time.flow.json")]
[InlineData("fixtures\\neo-hub-buffered-audio-synthetic-asr.flow.json")]
[InlineData("fixtures\\neo-hub-multichunk-audio-chat.flow.json")]
[InlineData("fixtures\\neo-hub-buffered-audio-pending.flow.json")]

View File

@@ -0,0 +1,116 @@
using Jibo.Cloud.Infrastructure.Audio;
using Jibo.Runtime.Abstractions;
namespace Jibo.Cloud.Tests.WebSockets;
public sealed class LocalWhisperCppBufferedAudioSttStrategyTests
{
[Fact]
public void CanHandle_ReturnsFalse_WhenLocalWhisperIsDisabled()
{
var strategy = new LocalWhisperCppBufferedAudioSttStrategy(
new BufferedAudioSttOptions
{
EnableLocalWhisperCpp = false,
FfmpegPath = "ffmpeg",
WhisperCliPath = "whisper-cli",
WhisperModelPath = "model.bin"
},
new FakeExternalProcessRunner());
var turn = new TurnContext
{
Attributes = new Dictionary<string, object?>
{
["bufferedAudioFrames"] = new[] { BuildMinimalOggPage() }
}
};
Assert.False(strategy.CanHandle(turn));
}
[Fact]
public async Task TranscribeAsync_UsesFfmpegAndWhisperCpp_WhenConfigured()
{
var tempDirectory = Path.Combine(Path.GetTempPath(), $"openjibo-stt-test-{Guid.NewGuid():N}");
Directory.CreateDirectory(tempDirectory);
try
{
var runner = new FakeExternalProcessRunner();
var strategy = new LocalWhisperCppBufferedAudioSttStrategy(
new BufferedAudioSttOptions
{
EnableLocalWhisperCpp = true,
FfmpegPath = "ffmpeg",
WhisperCliPath = "whisper-cli",
WhisperModelPath = "model.bin",
TempDirectory = tempDirectory
},
runner);
var turn = new TurnContext
{
TurnId = "turn-local-stt",
Locale = "en-US",
Attributes = new Dictionary<string, object?>
{
["bufferedAudioBytes"] = 47,
["bufferedAudioFrames"] = new[] { BuildMinimalOggPage() }
}
};
var result = await strategy.TranscribeAsync(turn);
Assert.Equal("tell me a joke", result.Text);
Assert.Equal("local-whispercpp-buffered-audio", result.Provider);
Assert.Equal(2, runner.Calls.Count);
Assert.Equal("ffmpeg", runner.Calls[0].FileName);
Assert.Equal("whisper-cli", runner.Calls[1].FileName);
Assert.Equal(47, result.Metadata["bufferedAudioBytes"]);
}
finally
{
if (Directory.Exists(tempDirectory))
{
Directory.Delete(tempDirectory, recursive: true);
}
}
}
private static byte[] BuildMinimalOggPage()
{
return
[
0x4F, 0x67, 0x67, 0x53,
0x00,
0x02,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x01, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00,
0x01,
0x13,
0x4F, 0x70, 0x75, 0x73, 0x48, 0x65, 0x61, 0x64, 0x01, 0x01, 0x38, 0x01, 0x80, 0xBB, 0x00, 0x00, 0x00, 0x00, 0x00
];
}
private sealed class FakeExternalProcessRunner : IExternalProcessRunner
{
public List<(string FileName, IReadOnlyList<string> Arguments)> Calls { get; } = [];
public Task<ExternalProcessResult> RunAsync(string fileName, IReadOnlyList<string> arguments, CancellationToken cancellationToken = default)
{
Calls.Add((fileName, arguments));
if (string.Equals(fileName, "ffmpeg", StringComparison.OrdinalIgnoreCase))
{
var outputPath = arguments.Last();
File.WriteAllBytes(outputPath, [0x52, 0x49, 0x46, 0x46]);
return Task.FromResult(new ExternalProcessResult(0, string.Empty, string.Empty));
}
return Task.FromResult(new ExternalProcessResult(0, "[00:00:00.000 --> 00:00:01.000] tell me a joke", string.Empty));
}
}
}