Optical Character Recognition Software (OCR) Information
Optical character recognition software, or OCR software, translates images of printed, handwritten or typewritten text into a computer editable digital, usually ASCII, text format. The digital text can then be opened and used with desktop publishing software, word processing, and other computer applications. This process is also known as scan-to-text. Organizations use optical character recognition software to reduce data-entry errors and speed the processing of older paper or image-based archives.
Video credit: YouTube via CC BY-SA 4.0
OCR software works by analyzing a document and comparing the text with all the different text fonts stored in the software’s database or by noting shapes and features common to most characters. It then creates a text document based on the characters it recognized.
Like most digital processes, OCR devices evolved from dedicated hardware machines with specialized circuit boards and limited computing and storage power to the current software-based processes. This shift was accomplished through the extreme advances in power, speed and storage of personal computers. OCR software today can be used on many different computers attached to a wide variety of scanning devices. This versatility is the main advantage of a software based system versus the older dedicated OCR machines.
Most Windows OS computers have a basic OCR program built into the standard photo-fax viewer application that will work with a standard PC-capable scanner. There are also free or low-price open source versions available on the web. These applications are fine for personal use but may not be adequate for heavy professional use or for difficult-to-read images. Also, there are websites that offer free conversions for uploaded images. Those sites may be too insecure or slow for all but the simplest work.
OCR software can convert scanned images into searchable PDF files. This is used in most cases to create modern digital records out of traditional paper archives, or in general to convert the old media to the new media. They can also create a digital image file such as JPEG, GIF or PNG, if there is no scannable text, or if the paper is all images. OCR software can also create Excel files out of printed tables or HTML files out of complicated text and image layouts mixing the image binary files with text content and tags. The default output format for OCR software is usually a PDF file.
OCR software interprets visual scans, then isolates the textual parts of a document from other elements, such as images, charts, wrinkles, creases, stains, spots, and tables. Most OCR software allows users to select either an entire document for scanning, or specific parts or chapters. Search features vary among OCR systems. Some optical character recognition software allows the data following a search to be stored for future use. After the text is selected, OCR, softwareanalyzes and interprets each character. The OCR software then checks whole words and matches them against a standard and/or custom dictionary. Some OCR software and OCR systems are capable of reproducing formatted outputs that closely approximate the original document, in terms of images, columns, and other non-textual components. Pattern recognition, artificial intelligence, and machine vision are used to convert scanned images into text that is then added to searchable databases. This allows the retrieval of scanned images based upon their content. Additional considerations when selecting optical character recognition software include the quality and contrast of the scanned image. As a rule, images that are dirty or damaged, or printed on wrinkled paper are more difficult for the OCR software to detect. The contrast between text and background is a prime factor. For example, documents that consist of black text against a white background provide 100% contrast, thus increasing the probability that the optical character recognition software, OCR, will interpret the text properly.
These are the most important aspects to look for in an optical character recognition software package:
- Character recognition accuracy
- Page layout reconstruction accuracy
- Support for multiple languages
- Adaptability for the speed and host computer’s operating system
- Support for searchable .pdf outputs as well as HTML, XLS, and other formats
- Quality of the user interface
Most important, behind the interface of every OCR software application is a character-recognition engine that does the work of converting images into text. The best graphical user interface (GUI) or variety of output options can't make up for the limits of an OCR recognition engine behind it. So the quality of the recognition engine is the most important aspect to consider in any OCR software purchase decisions.