Linux Format

Dump your paper docs with perfect OCR

Anyone who’s been faced with the arduous task of transcribing text from either printed material or a digital scan of printed text will no doubt have heard of optical character recognition (OCR) technology. OCR is one of the earliest examples of machine learning, whereby a computer model is trained to recognise shapes on a digital image and translate those shapes into text characters. Once the shape of each letter is identified and translated into editable text, words followed by sentences, paragraphs and entire tracts of text can be extracted from the digital scan.

OCR has roots going back to the 1980s, and while commercial engines perform increasingly miraculous conversions – not just on typed text, but also handwriting – open source engines continue to develop alongside them. Linux is blessed with several OCR engines, all with roots in commercial products, but now open sourced and completely free to use. The best known of these – which we’ll focus on in this tutorial – is Tesseract (https://github.com/tesseractocr/tesseract), a command-line OCR engine that can be used on its own or paired with a number of graphical front-ends to perform OCR across a variety of usage scenarios, from extracting editable text directly from scanned documents to converting everything from PDFs and image files to screen grabs and imagebased subtitle tracks in media files, too.

Before going further, check the box opposite for a quick look at Tesseract and two of its main open-source rivals – note, you can install all three at once and try different ones to see which produces the best results.

Marks, set, scan!

The obvious place to start is byOCR engine. It exists in various forms – including standalone

You’re reading a preview, subscribe to read more.

More from Linux Format

Linux Format2 min read
Back Issues Missed One?
ISSUE 313 April 2024 Product code: LXFDB0313 In the magazine Discover how to use the ultimate hacker’s toolkit, staying out of trouble while doing so. And join us as we take the Puppy Linux developer’s new distro for a run and explore its container
Linux Format2 min read
Distro Watch
Ubuntu 24.04 LTS Noble Numbat is currently scheduled for release on 25th April. At the time of writing, we’re working with a daily build and expect a full review next issue. Nevertheless, we discovered that Canonical has announced this LTS release wi
Linux Format2 min read
OBS Studio
Version: 30.0.2 Web: https://obsproject.com There are lots of good options for recording screencasts, but if you want to live-stream T your desktop, one of the best options is OBS Studio. The app works with all the major online streaming providers, s

Related