EMMA (Extensible MultiModal Annotation) is a standard which was developed by the W3C Multimodal Interaction Working Group. Its primary purpose is to provide a way to represent user inputs in an interoperable way, especially inputs from different modalities in a multimodal application, using XML.
These examples illustrate EMMA 2.0, which was not finalized as a W3C Recommendation as was EMMA 2.0. However, EMMA 2.0 has more capabilities than EMMA 1.0, especially the ability to represent system outputs.
This page will let you put together some of the basic information that might go into an EMMA document and see the result in EMMA. You can enter whatever tokens and slot values that you like and see how these would be formatted in EMMA.
You should be aware that there are many other available optional EMMA elements and attributes that aren't shown here. Other features of EMMA that aren't shown here support capabilities like grouping multiple semantically related inputs from different modalities into a single input ("group"), representing additional, application-specific metadata ("info"), representing alternative interpretations, and representing the processing history of a single input ("derivation" and "derived-from").
Let's say the user is ordering a pizza by speaking their request. A speech recognizer would recognize the words that they said and they would be represented in EMMA as the tokens of the input.
Tokens will be included in the EMMA document as an attribute of one of several EMMA elements, probably the most common of which is "<interpretation>". This attribute is optional. Not every input even has tokens, especially a non-verbal one like a drawing or photograph.
Application-specific properties and their values (this information will go into the "<interpretation>" tag).
This is the meaning of what the user said, typed, or wrote. It is the result of interpreting, or understanding, the "tokens", while possibly taking into account the context in which the speech occurred. For example, if someone is ordering a pizza, one of the properties might be "size" and a possible value would be "large".
Note that this input is not part of the EMMA namespace; the contents of this element are entirely up to the application.Property | Value |
The confidence attribute can have a value between 0 and 1.0, where 0 is not at all confident and 1 is certainty. Typical values of "confidence" will vary depending on the technology, e.g. speech recognition vs. handwriting recognition, and on the interpreting platform. That is, different speech recognizers may generate different confidences for the same recognition input, depending on the details of their recognition algorithms. This attribute is optional.
System's confidence in the interpretation:
The signal attribute has a URI value which designates the location of the signal. This attribute is optional.
The location of the signal:
Many types of inputs take appreciable time, for example, speech, handwriting or typing. The start and end timestamps provide information about when the input stopped and started, and consequently how long it took. Another attribute, "duration", is also available if the duration is of interest, but where perhaps the start and end timestamps are not available. There are absolute and relative timestamps, although here we only illustrate the absolute timestamps. The timestamps are represented in milliseconds since January 1, 1970, which is not very human-readable, but there are widely available functions for converting this format to a human-readable form.
Starting Time: The starting time of the input would normally be provided by the speech recognizer or other EMMA generator, but for the purposes of this exercise, we will assume the starting time is "right now" or more precisely, when the "Display EMMA" button is clicked. Of course, components located on different platforms in a distributed application might have different clocks. Consequently the timestamps might not perfectly synchronized in a distributed application.
Ending time: (we will assume the ending time is "right now" + 3 seconds)
The medium of the input represents whether the input was visual (via an image), acoustic (via sound) or tactile (through by touching, for example, a mouse or keyboard).
This input will go into the "medium" attribute of the "<interpretation>" tag.
The mode of input provides more detail about the input. For example,
an input using the visual medium might be a still image or a video.
The choices represent an open set of modes of input, so modes other than those listed here are possible. This input will go into the "mode" attribute of the "<interpretation>" tag.
This attribute describes what the function of the input is. "dialog" is input that is part of a spoken dialog, "recording" is simply recording, for example, capturing a drawing, verification is input that is used to decide who the person is, and transcription is input that is designed to be converted from an analog format such as speech or handwriting, to written text. Other functions are possible besides the ones that are enumerated in the spec because the set of possible values of "function" is an open set.
This input is verbal
Verbal input refers to input representing some form of language, for example, speech, handwriting, typing, or sign language. This is in contrast to a non-verbal input such as a drawing or a photograph of an object.
This input is uninterpreted. This might be because the interpreter couldn't find an interpretation, or because the input is inherently not interpreted, such as an audio recording. If the input is uninterpreted, the contents of the "interpretation" element have to be empty.