IMPROVING THE COMPRESSION ALGORITHMS PERFORMANCE FOR SCANNED YORUBA PDF FILES

0
76

The advancement and accessibility of digital computers and the introduction of the Internet and the World Wide Web led to a massive information explosion all over the world. There for large amounts of newspapers, magazines, and printed documents available with numerous information and knowledge of different areas. PDF file format documents facilitate office automation and the move towards a paperless office. PDFs can become inconveniently large when they contain a large amount of high-resolution content such as images and Graphics, or even just a very large number of pages. To make the information and knowledge embedded in these PDF documents accessible and share with the public there is a need to minimize the data size using different mechanisms. This study has been conducted to develop an Amharic PDF file document compression system by applying an effective page segmentation technique that can identify text and non-text blocks to reconstruct PDF document layouts to optimize memory space requirement and bandwidth for transmission. The first step of the proposed approach is separating textual and non-textual objects. After a applying combination of page segmentation techniques, namely: connected component with Dilation and connected components Area, Height and width analysis techniques is applied to detect a graphics part of a document. Based on the experiment the average 78% accuracy rate is achieved from the proposed approach. The next step after textual and non-textual separation is column block detection for textual objects. Similar page segmentation techniques are applied to segment column layouts. The proposed technique accurately identified column layout with an accuracy of 89%, thereby all coordinate information about the column block is stored for reconstructing stage. Finally, the extracted objects are compressed using Huffman compression algorithms. The proposed approach experimented on different PDF documents and compresses the extracted objects with a compression ratio of less than 50%, which is a better compression result than existing commercial compression tools. The proposed approach is also capable of reconstructing the compressed data after decompression. Based on the stored layout coordinate information the original PDF document’s non-textual blocks and textual columns were reconstructed on an average of 74% accuracy. From correctly segmented column and paragraph block the proposed techniques have 92% accuracy rates. However, the performance of the proposed method greatly affected black shades in PDF document images while scanning, irregularly shaped images with non-rectangular shaped text blocks resulted in the loss of some text and difficulty in segmentation.

IMPROVING THE COMPRESSION ALGORITHMS PERFORMANCE FOR SCANNED YORUBA PDF FILES, GET MORE COMPUTER SCIENCE PROJECT TOPICS AND MATERIALS

DOWNLOAD PROJECT