Home | SEO

Google uses OCR to index scanned documents.

Alex Christie

10 November 2008

When indexing content on the web, scanned documents like
academic papers and government reports were not included or picked up by the
spiders. The reason for this was because the scanned document would appear as
one large image instead of text.

Using Google’s latest technology called
Optical Character Recognition (OCR), Google
is able to read these documents and turn them into text which will then be
indexed in the search engine.
Before OCR Google was only able to search the
filename and limited Meta data associated with these files in order to include
them in search results.

Google’s technology will turn the scanned “images of
text” into computer readable text. As with PDF files you will be able to view
the original version or text only version Google created.  To view an example of this technology visit repairing
aluminum wiring
(the first result should be a scanned document.)

LEAVE A COMMENT

IF YOU LIKED THIS POST, YOU MIGHT ALSO LIKE THESE:

Google Docs going offline?

Google is finally taking its word processing software, Google Docs, offline. It's a move seen by many...

Google new Meta Tag ‘Unavailable’ now live!

Google now respond to 'Unavailable' meta tag After an announcement a couple of weeks ago Google will...

Google caffeine – why the fuss?

First off unless you are really into your SEO its unlikely that you will have heard too much if anything...

Google to index Flash?

Google has just announced that it has been developing a new algorithm for indexing textual content contained...

Microsoft to shift Office online

is Microsoft aiming to compete with Google Documents Online? Microsoft are trying to take on Google’s...

Grab This Widget