Linux Format

Qt Box Editor

Version: 1.12rc1 Web: https://github.com/ zdenop/qt-box-editor

-

T

esseract is a great example of optical character recognitio­n (OCR) technology. You might think that Tesseract should belong to the OpenCV family, but in fact it came out before OpenCV. Tesseract is a free alternativ­e to ABBYYFiner­eader, a commercial product that delivers state-of-the-art OCR quality. There are many ways you can achieve a Finereader- like experience with Tesseract in Linux, and perhaps the best one would be using the gImageRead­er front-end (see LXF229). You’ll notice that while Tesseract has almost no trouble with quality images like screen grabs or high-resolution scans of laser printouts, it stumble over less-readable images.

Various Tesseract training tutorials describe how to tackle this problem. The core idea is to take a sample image, extract characters from it (‘as is’) forming a Box file, and then manually edit it and correct all erroneous characters. Tesseract can them match the way a letter looks on the image with a correct Unicode symbol. The more valid pairs Tesseract has learned, the more precise future recognitio­n attempts will be.

Editing a Box file is the most time-consuming operation. It requires lots of patience and diligence. Qt

BoxEditor is a tool that helps the process along by providing a smart GUI. It shows the source image on the right and a narrow spreadshee­t-like area on the left. Navigating between cells is very fast and can be controlled by the arrow keys.

Compared to a convenient text editor, QtBoxEdito­r enables you to complete an average page nearly twice as fast. When you move to the next row in the ‘spreadshee­t’ area, the applicatio­n highlights the correspond­ing letter on the image. When working with scanned old typewriter sheets or other poorly decipherab­le images, Tesseract sometimes makes errors when detecting letter ’boxes’. Luckily, QtBoxEdito­r features a selection tool and makes it simple to correct the box.

 ??  ?? With a bit of effort Tesseract can ‘learn’ to read blurry letters.
With a bit of effort Tesseract can ‘learn’ to read blurry letters.

Newspapers in English

Newspapers from Australia