A Review of Tools and Techniques for Preprocessing of Textual Data

0
666

Abstract

With the high availability of computing facilities, a huge amount of data is available in electronic form. Processing of huge data is required to discover new facts and knowledge. But dealing with huge datasets is challenging because real-world data is generally incomplete, inconsistent, contains errors or outliers. More than 80% of the data is unstructured or semi-structured. The data is prepared by data preprocessing. Data preprocessing has become an essential step in data mining. Data Preprocessing takes 80% of the total efforts of any data mining project and it directly affects the quality of data mining. The selection of the right technique and tool for data preprocessing helps to enhance the speed of data mining process. This paper discusses different preprocessing techniques, different tools available for text preprocessing, carries out their comparison and briefs the challenges faced such as knowledge of sentence structure of a language to perform tokenization, difficulty in constructing domain-specific stop words list, over stemming and under stemming etc.