Open navigation menu

Welcome to Everand!

Linux Format

Dump your paper docs with perfect OCR

Dec 12, 2023 10 minutes

Anyone who’s been faced with the arduous task of transcribing text from either printed material or a digital scan of printed text will no doubt have heard of optical character recognition (OCR) technology. OCR is one of the earliest examples of machine learning, whereby a computer model is trained to recognise shapes on a digital image and translate those shapes into text characters. Once the shape of each letter is identified and translated into editable text, words followed by sentences, paragraphs and entire tracts of text can be extracted from the digital scan.

OCR has roots going back to the 1980s, and while commercial engines perform increasingly miraculous conversions – not just on typed text, but also handwriting – open source engines continue to develop alongside them. Linux is blessed with several OCR engines, all with roots in commercial products, but now open sourced and completely free to use. The best known of these – which we’ll focus on in this tutorial – is Tesseract (https://github.com/tesseractocr/tesseract), a command-line OCR engine that can be used on its own or paired with a number of graphical front-ends to perform OCR across a variety of usage scenarios, from extracting editable text directly from scanned documents to converting everything from PDFs and image files to screen grabs and imagebased subtitle tracks in media files, too.

Before going further, check the box opposite for a quick look at Tesseract and two of its main open-source rivals – note, you can install all three at once and try different ones to see which produces the best results.

Marks, set, scan!

The obvious place to start is byOCR engine. It exists in various forms – including standalone

You’re reading a preview, subscribe to read more.

Start your free 30 days

Sharing Options

More from Linux Format

Linux Format2 min read

Back Issues Missed One?

ISSUE 313 April 2024 Product code: LXFDB0313 In the magazine Discover how to use the ultimate hacker’s toolkit, staying out of trouble while doing so. And join us as we take the Puppy Linux developer’s new distro for a run and explore its container

Linux Format2 min read

Ubuntu 24.04 LTS Noble Numbat is currently scheduled for release on 25th April. At the time of writing, we’re working with a daily build and expect a full review next issue. Nevertheless, we discovered that Canonical has announced this LTS release wi

Linux Format2 min read

Version: 30.0.2 Web: https://obsproject.com There are lots of good options for recording screencasts, but if you want to live-stream T your desktop, one of the best options is OBS Studio. The app works with all the major online streaming providers, s

Related

Firewalls Don't Stop Dragons: A Step-by-Step Guide to Computer Security for Non-Techies
Ebook
Firewalls Don't Stop Dragons: A Step-by-Step Guide to Computer Security for Non-Techies
byCarey Parker
Rating: 5 out of 5 stars
5/5
PowerCLI Essentials
Ebook
PowerCLI Essentials
byChris Halverson
Rating: 0 out of 5 stars
0 ratings
Podman in Action: Secure, rootless containers for Kubernetes, microservices, and more
Ebook
Podman in Action: Secure, rootless containers for Kubernetes, microservices, and more
byDaniel Walsh
Rating: 0 out of 5 stars
0 ratings
NNG Reference Manual, Second Edition
Ebook
NNG Reference Manual, Second Edition
byGarrett D'Amore
Rating: 0 out of 5 stars
0 ratings
Windows Subsystem For Linux A Complete Guide - 2020 Edition
Ebook
Windows Subsystem For Linux A Complete Guide - 2020 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
macOS Ventura For Dummies
Ebook
macOS Ventura For Dummies
byGuy Hart-Davis
Rating: 0 out of 5 stars
0 ratings
Dark Transmissions: A Tale of the Jinxed Thirteenth
Ebook
Dark Transmissions: A Tale of the Jinxed Thirteenth
byDavila LeBlanc
Rating: 4 out of 5 stars
4/5
Gluster Filesystem - Practical Method
Ebook
Gluster Filesystem - Practical Method
byFabian Mestre
Rating: 0 out of 5 stars
0 ratings
Getting Started with SpriteKit
Ebook
Getting Started with SpriteKit
byJordán Jorge
Rating: 0 out of 5 stars
0 ratings
Puppet Cookbook - Third Edition
Ebook
Puppet Cookbook - Third Edition
byJohn Arundel
Rating: 5 out of 5 stars
5/5
Haskell from Another Site
Ebook
Haskell from Another Site
byJagoda Górska
Rating: 0 out of 5 stars
0 ratings