What lurks in the deep web?
I have seen a recent flurry around a new bit of lingo in the marketplace, the deep web. It sounds like a ghastly tentacled apparition, lurking beneath the tranquil web 2.0 waters of the Internet but the NY times calls it differently, according to them, the deep web consists of the trillions of pages that are not traditionally crawled by web spiders, they are either hidden behind forms or inside databases. All the search engines want a piece of this deep web but I don't think we have thought about the problems of crawling dynamic databases like this…
Lets construct a scenario: it would be nice if search engines could fill in the query forms for all potential flights in a airlines system. For example "All BA flights from London to Cape Town on the 29th Feb 2012" Will return a number of results, the problem arises when you think about the sheer number of combinations of queries most of them will not result in actual flights, they will be null returns, there may only be one flight per month between Cape Town and London – the time of the day is also another variable. It could potentially yield billions of results from one database. If I were a savvy webmaster I would block all web-bots from crawling my
database content almost immediately – I cannot imagine the load on your
servers doing pointless permutations that could result in other
companies serving up your data and making money out of it.
Another drain is volume of the data that will have to be indexed and stored in databases, essentially doubling up on the storage needs for the worlds data. In the traditional case most search engines only store the text part of the web page but databases only consist of text. For the most part, these databases have been very well designed as well, using structures and keys that are transparent to the web-bots. Unless the new crawlers are magicians, the crawled and indexed data will be bigger than the actual databases from which they come.
The issue arises of what to do with these massive volumes of data, if it is possible to link up all the data in such a way that it becomes meaningful, it could be very powerful, maybe then from the deep web will emerge the tentacled monster we dread, maybe you end up with servers full of useless boring data…