TPC: An automatically generated comprehensive English-Persian parallel corpus

0
596

Nowadays; Parallel corpus is one of the most important resources which can be employed in different researches such as machine translation, bilingual lexicography, and linguistics. This paper describes the process of building a large-scale (about 400, 000 sentence pairs) English-Persian parallel corpus called Tehran Parallel Corpus (TPC). The aim of study is to introduce the structure and explain the materials utilized for constructing TPC. In addition, some useful tools developed within the project have been introduced and three sorts of the statistical machine translation systems trained by TPC have been considered. In order to develop a high quality parallel corpus, unsure alignments recognized via a MaxEnt classifier have been eliminated from the corpus. As an intrinsic evaluation, 1,600 sentence pairs are elicited randomly and compared manually with a gold standard test set. As an extrinsic evaluation, three Phrase-based SMT systems, which is trained by TPC are incorporated. The results demonstrate the superiority of our translator systems over English to Persian Google translator system in term of BLEU and TER metrics.