10 November 2008 |

Google uses OCR to index scanned documents.

When indexing content on the web, scanned documents like
academic papers and government reports were not included or picked up by the
spiders. The reason for this was because the scanned document would appear as
one large image instead of text.

Using Google’s latest technology called
Optical Character Recognition (OCR), Google
is able to read these documents and turn them into text which will then be
indexed in the search engine.
Before OCR Google was only able to search the
filename and limited Meta data associated with these files in order to include
them in search results.

Google’s technology will turn the scanned “images of
text” into computer readable text. As with PDF files you will be able to view
the original version or text only version Google created.  To view an example of this technology visit repairing
aluminum wiring
(the first result should be a scanned document.)

Tamar Staff Member

View all posts by .