Optical character recognition

Advanced systems capable of producing a high degree of accuracy for most fonts are now common, and with support for a variety of image file format inputs.

[2] Some systems are capable of reproducing formatted output that closely approximates the original page including images, columns, and other non-textual components.

Early optical character recognition may be traced to technologies involving telegraphy and creating reading devices for the blind.

[4] Concurrently, Edmund Fournier d'Albe developed the Optophone, a handheld scanner that when moved across a printed page, produced tones that corresponded to specific letters or characters.

[5] In the late 1920s and into the 1930s, Emanuel Goldberg developed what he called a "Statistical Machine" for searching microfilm archives using an optical code recognition system.

On January 13, 1976, the finished product was unveiled during a widely reported news conference headed by Kurzweil and the leaders of the National Federation of the Blind.

LexisNexis was one of the first customers, and bought the program to upload legal paper and news documents onto its nascent online databases.

In the 2000s, OCR was made available online as a service (WebOCR), in a cloud computing environment, and in mobile applications like real-time translation of foreign-language signs on a smartphone.

OCR engines have been developed into software applications specializing in various subjects such as receipts, invoices, checks, and legal billing documents.

[14] Instead of merely using the shapes of glyphs and words, this technique is able to capture motion, such as the order in which segments are drawn, the direction, and the pattern of putting the pen down and lifting it.

For proportional fonts, more sophisticated techniques are needed because whitespace between letters can sometimes be greater than that between words, and vertical lines can intersect more than one character.

[26][needs update] Others like OCRopus and Tesseract use neural networks which are trained to recognize whole lines of text instead of focusing on single characters.

[27] The OCR result can be stored in the standardized ALTO format, a dedicated XML schema maintained by the United States Library of Congress.

Beyond an application-specific lexicon, better performance may be had by taking into account business rules, standard expression,[clarification needed] or rich information contained in color images.

[28] Palm OS used a special set of glyphs, known as Graffiti, which are similar to printed English characters but simplified or modified for easier recognition on the platform's computationally limited hardware.

The National Library of Finland has developed an online interface for users to correct OCRed texts in the standardized ALTO format.

[33] Commissioned by the U.S. Department of Energy (DOE), the Information Science Research Institute (ISRI) had the mission to foster the improvement of automated technologies for understanding machine printed documents, and it conducted the most authoritative of the Annual Test of OCR Accuracy from 1992 to 1996.

[39][34] Web-based OCR systems for recognizing hand-printed text on the fly have become well known as commercial products in recent years[when?]

Reading the Amount line of a check (which is always a written-out number) is an example where using a smaller dictionary can increase recognition rates greatly.

Video of the process of scanning and real-time optical character recognition (OCR) with a portable scanner
Occurrence of laft and last in Google's n-grams database, in English documents from 1700 to 1900, based on OCR scans for the "English 2009" corpus
Occurrence of laft and last in Google's n-grams database, based on OCR scans for the "English 2012" corpus [ 34 ]
Searching for words with a long S in English 2012 or later are normalized to an S.