MEETING BROWSER: TRACKING AND SUMMARIZING MEETINGS

0
510

To provide rapid access to meetings between human beings, transcription, tracking, retrieval and summarization of on-going human-to-human conversation has to be achieved. In DARPA and DoD sponsored work (projects GENOA and CLARITY) we aim to develop strategies to transcribe human discourse and provide rapid access to the structure and content of this human exchange. The system consists of four major components: 1.) the speech transcription engine, based on the JANUS recognition toolkit, 2.) the summarizer, a statistical tool that attempts to find salient and novel turns in the exchange, 3.) the discourse component that attempts to identify the speech acts, and 4.) the non-verbal structure, including speaker types and non-verbal visual cues. The meeting browser also attempts to identify the speech acts found in the turns of the meeting, and track topics. The browser is implemented in Java and also includes video capture of the individuals in the meeting. It attempts to identify the speakers, and their focus of attention from acoustic and visual cues. 1. THE MEETING RECOGNITION ENGINE The speech recognition component of the meeting browser is based on the JANUS Switchboard recognizer trained for the 1997 NIST Hub-5E evaluation [3]. The gender independent, vocal tract length normalized, large vocabulary recognizer features dynamic, speaking mode adaptive acoustic and pronunciation models [2] which allow for robust recognition of conversational speech as observed in human to human dialogs. 1.1 Speaking Mode Dependent Pronunciation Modeling In spontaneous conversational human-to-human speech as observed in meetings there is a large amount of variability due to accents, speaking styles and speaking rates (also known as the speaking mode [6]. Because current recognition systems usually use only a relatively small number of pronunciation variants for the words in their dictionaries, the amount of variability that can be modeled is limited. Increasing the number of variants per dictionary entry may seem to be the obvious solution, but doing so actually results in a increase in error rate. This is explained by the greater confusion between the dictionary entries, particularly, for short reduced words. We developed a probabilistic model based on context dependent phonetic rewrite rules to derive a list of possible pronunciations for all words or sequences of words [2][4]. In order to reduce the confusion of this expanded dictionary, each variant of a word is annotated with an observation probability. To this aim we automatically retranscribe the corpus based on all allowable variants using flexible utterance transcription graphs (Flexible Transcription Alignment (FTA) [5]) and speaker adapted models. The alignments are then used to train a model of how likely which form of variation (i.e. rule) is and how likely a variant is, to be observed in a certain context (acoustic, word, speaking mode or dialogue) is. For decoding, the probability of encountering pronunciation variants is then defined to be a function of the speaking style (phonetic context, linguistic context, speaking rate and duration). The probability function is learned through decision trees from rule based generated pronunciation variants as observed on the Switchboard corpus [2]. 1.2 Experimental Setup To date, we have experimented with three different meeting environments and tasks to assess the performance in terms of word accuracy and summarization quality: i.) Switchboard human to human telephone conversations, ii.) Research group meetings recorded in the Interactive Systems labs and iii.) Simulated crisis management meetings (3 participants) which also include video capture of the individuals. We report results from speech recognition experiments in the first two conditions. 1) Human to Human Telephone The test set to evaluate the use of the flexible transcription alignment approach consisted of the Switchboard and CallHome partitions of the 1996 NIST Hub-5e evaluation set. All test runs were carried out using a Switchboard recognizer trained with the JANUS Recognition Toolkit (JRTk) [4]. The preprocessing of the system begins by extracting MFCC based feature vectors every 10 ms. A truncated LDA transformation is performed over a concatenation of MFCCs and their first and second order derivatives are determined. Vocal tract length normalization and cepstral mean subtraction are computed to reduce speaker and channel differences. The rule-based expanded dictionary that was used in these tests included 1.78 pronunciation variants/word, compared to 1.13 found in the baseform dictionary (PronLex). The first list of results in Table 1 is based on a recognizer whose polyphonic decision trees were still trained on Viterbi alignments based on the unexpanded dictionary. We compare a baseline system trained on the base dictionary with an expanded dictionary FTA trained system tested in two different ways: with the base dictionary and with the expanded one. It turns out, that FTA training reduces the word error rate significantly, which means, that we improved the quality of the transcriptions through FTA and pronunciation modeling. Due to the added confusion of the expanded dictionary the test with the large dictionary without any weighting of the variants yields slightly worse results than testing with the baseline dictionary. Condition SWB WER CH WER Baseline 32.2% 43.7% FTA traing test w.basedict 30.7% 41.9% FTA traing test w.expanded dict 31.1% 42.5% Table 1 Recognition results using flexible transcription alignment training and label boosting. The test using the expanded dictionary was done without weighting the variants Adding vowel stress related questions to the phonetic clustering procedure and regrowing the polyphonic decision tree based on FTA labels improved the performance by 2.6% absolute on SWB and 2.2% absolute on CallHome. Table 2 shows results for mode dependent pronunciation weighting. We gain an additional ~2% absolute by weighting the pronunciation based on mode related features. Condition SWB WER CH WER unweighted 28.7% 38.6% Weighted p(r|w) 27.1% 36.7% Weighted p(r|w,m) 26.7% 36.1% Table 2 Results using different pronunciation variant weighting schemes. 2) Research Group Meetings In a second experiment we used recorded during internal group meetings at our lab. We placed lapel microphones on three out of ten participants, and recorded the signals on those three channels. Each meeting was approximately one hour in length, for a total of three hours of speech on which to adapt and test. Since we have no additional training data collected in this particular environment, the following unsupervised adaptation techniques was used to adapt a read speech, clean environment Wall Street Journal dictation recognizer to the meeting conditions: 1. MLLR based adaptation: In our system, we employed a regression tree, constructed using an acoustic similarity criterion for the defnition of regression classes. The tree is pruned as necessary to ensure sufficient adaptation data on each leaf. For each leaf node we calculate a linear transformation that maximizes the likelihood of the adaptation data. The number of transformations is determined automatically. 2. Iterative batch-mode unsupervised adaptation: The quality of adaptation depends directly on the quality of the hypotheses on which the alignments are based. We iterate the adaptation procedure, improving both the acoustic models and the hypotheses they produce. Significant gains were observed during the two iterations, after which performance converges. 3. Adaptation wth confidence measures: Confidence measures were used to automatically select the best candidates for adaptation. We used the stability of a hypothesis in a lattice as indicator of confidence. If, in rescoring the lattice with a variety of language model weights and insertion penalties, a word appears in every possible top-1 hypothesis, acoustic stability is indicated. Such acoustic stability often identifies a good candidate for adaptation. Using only these words in the adaptation procedure produces 1-2% gains in word accuracy over blind adaptation [9]. The baseline performance of the JRTk based WSJ Recognizer over the Hub4-Nov94 test set is about 7% WER. These preliminary experiments suggest that due to the effects of spontaneous human-to-human speech, significant differences in recording conditions, significant crosstalk on the recorded channels, significantly different microphone characteristics, and inappropriate language models the error rate on meetings is in a range of 40-50\% WER. Adaptation Iterations Speaker 0 1 2 Adaptation Gain maxl 51.7 45.3 45.2 12% fdmg 48.4 43.8 44.9 9% flsl 63.8 59.5 59.6 7% Total 54.8 49.6 49.9 Table 3 Error rates for three different speakers in a research group meeting using JRTk trained over WSJ dictation data.