1.2 KiB
Executable File
Embodied Speech Markup Language (ESML) is a specialized XML-based markup language designed to control how virtual humans, avatars, or robots communicate. Unlike standard text-to-speech (TTS) which only focuses on audio, ESML "embodies" the speech by synchronizing the voice with non-verbal behaviors like gestures, facial expressions, and posture.
It acts as a bridge between the "brain" of an AI (the text it wants to say) and the "body" of the character (how it should move while saying it).
1. Key Components of ESML
ESML allows developers to tag text with specific instructions that the animation engine interprets in real-time.
-
Prosody Control: Adjusting pitch, rate, and volume to make the voice sound more human and less robotic.
-
Gestural Markers: Telling the avatar exactly when to point, shrug, or nod during a sentence.
-
Facial Expression Tags: Triggering emotions like
<smile>or<frown>that coincide with the spoken words. -
Synchronization: Ensuring that a "pointing" gesture happens exactly when the avatar says the word "there."
Warning
The Above explanations is AI Generated, Learn more at : ESML-SDK.pdf