Tuesday, December 18, 2007

The Challenge of Duplicates on the Web


A few weeks back, Chad Walters, the Search Architect at Powerset told me an interesting anecdote from the time he headed the Runtime efforts at Yahoo!.
There is a fundamental problem in estimating the number of hits for search queries because of duplicates in Web Search results. Near Duplicates & Duplicates occur because of many reasons like mirroring, RSS feeds, track backs etc. These add a fair amount of noise to the estimates of Web Search Engines about the size of their indexes which is always a matter of some pride to the big guys (Google, Yahoo! & Microsoft) and rightly so.
Anyway, a french researcher, Jean Véroni did some analysis on the number of hits reported by search engines and writes an interesting article here.
Here are some charts from his blog (http://aixtal.blogspot.com/2006/07/search-crazy-duplicates-1.html),


Google with similar pages





Google without similar pages






Yahoo! with similar pages






Yahoo! without similar pages






The fluctuation in the estimates is apparent when similar pages are included.

To quote Jean Véronis from his blog,

It’s interesting to note that:
* once duplicates are removed, Google and Yahoo’s figures are about the same;
* Yahoo’s curves are much more stable than Google’s.


I found this interesting in the context of the battle for supremacy in Web Search.
Maybe Yahoo! does a better job than its given credit (or market share) for :)

I did some work on near duplicate detection on the Web as part of my graduate research while at Stanford with Andreas Paepcke of the Stanford InfoLab and that's one reason I found this interesting. My work can be found here.
(Thanks to Chad for this interesting tidbit)

No comments: