Automatic Stopwords Identification from Very Small Corpora

0
495

Abstract

Natural Language Processing tools use language-specific linguistic resources, that might be unavailable for many languages. Since manually building them is complex, it would be desirable to learn these resources automatically from sample texts. In this paper we focus on stopwords, i.e., terms which are not relevant to understand the topic and content of a document. Specifically, we compare the performance of different techniques proposed in the literature when applied to very small corpora (even single documents), as may be the case for very local languages lacking a wide literature. Experiments show that simple term-frequency is an extremely reliable indicator, that outperforms other more complex approaches. While the study is conducted on Italian, the approach is generic and applicable to other languages.