Broadcast news story segmentation using latent topics on data manifold

0
566

This paper proposes to use Laplacian Probabilistic Latent Semantic Analysis (LapPLSA) for broadcast news story segmentation. The latent topic distributions estimated by LapPLSA are used to replace term frequency vector as the representation of sentences and measure the cohesive strength between the sentences. Subword n-gram is used as the basic term unit in the computation. Dynamic Programming is used for story boundary detection. LapPLSA projects the data into a low-dimensional semantic topic representation while preserving the intrinsic local geometric structure of the data. The locality preserving property attempts to make the estimated latent topic distributions more robust to the noise from automatic speech recognition errors. Experiments are conducted on the ASR transcripts of TDT2 Mandarin broadcast news corpus. Our proposed approach is compared with other approaches which use dimensionality reduction technique with the locality preserving property, and two different topic modeling techniques. Experiment results show that our proposed approach provides the highest F1-measure of 0.8228, which significantly outperforms the best previous approaches.Â