Visual Communications--paper for Joanna Porvin

Integrated Communication: The Reciprocal Voice-Caption Model

Call for Captioning
As visual communications, specifically video-based communications, become more prevalent in our day to day life, the adage "A picture is worth a thousand words" takes on new meaning. One view is that as image and voice play larger roles in our daily experience, text may become less important and potentially superfluous. However it is arguable, and in fact likely, that text will be an integral mode of communication for the foreseeable future. Text can facilitate video communication in the form of Closed Captioning. I propose that we should be working towards what I will call a reciprocal voice-caption model. This system would be able to transcribe spoken speech by multiple users and produce speech from text.

Reciprocal Voice-Caption Model: Definition
This system could be an integrated computer and hardware configuration or diverse tools that are used in concert. The transcription function should be able to: transcribe multi- participant conversations, display the text in caption format for analog or digital broadcast, and archive the text in an ASCII file which could be used for a variety of purposes. This system should accommodate users in local and remote locations and require little or no human mediation of the technology. The production function should incorporate a "read back" feature to reproduce out loud transcribed excerpts from the conversation. In addition, the system should be able to read aloud non-transcribed text that users input.

Reciprocal Voice-Caption Model: Rationale
Integrating text into audio-video interactions is worthwhile for a variety of reasons. Captions or subtitles facilitate understanding of spoken words, even for native speakers of a given language. This may be particularly helpful for environments where audio quality is questionable. The caption text itself could be useful for users as a record of a meeting, and could also provide the building blocks for future documents. This system would additionally benefit heating and visually impaired members of a collaborative community. Finally, this opens up new possibilities for browsing information. It is conceivable that users could search captions for relevant keywords or even more easily follow several conversations simultaneously.

State of the Art
At present, live captioning is a human-driven process based upon the stenography model of transcribing phonemes. Automated speech recognition technology is becoming more sophisticated and it is reasonable to expect that systems will become capable of transcribing live speech from a variety of users, what is formally known as 'continuous speech recognition.' Though speech reproduction technology exists, there are varying levels of quality -- most too low functioning to be of widespread use at this time. This technology will evolve to have higher levels of functionality. Speech recognition and reproduction currently exist as discrete applications that combine software and hardware. It would not be necessary to combine them into one application in order to use them in tandem, however this may be the optimal scenario to consider developing.