A few weeks back,
Chad Walters, the Search Architect at
Powerset told me an interesting anecdote from the time he headed the Runtime efforts at
Yahoo!.
There is a fundamental problem in estimating the number of hits for search queries because of duplicates in Web Search results. Near Duplicates & Duplicates occur because of many reasons like mirroring, RSS feeds, track backs etc. These add a fair amount of noise to the estimates of Web Search Engines about the size of their indexes which is always a matter of some pride to the big guys (
Google,
Yahoo! &
Microsoft) and rightly so.
Anyway, a french researcher, Jean Véroni did some analysis on the number of hits reported by search engines and writes an interesting article
here.
Here are some charts from his blog (http://aixtal.blogspot.com/2006/07/search-crazy-duplicates-1.html),
Google with similar pages
Google without similar pages
Yahoo! with similar pages
Yahoo! without similar pages
The fluctuation in the estimates is apparent when similar pages are included.
To quote Jean Véronis from his blog,
It’s interesting to note that:
* once duplicates are removed, Google and Yahoo’s figures are about the same;
* Yahoo’s curves are much more stable than Google’s.I found this interesting in the context of the battle for supremacy in Web Search.
Maybe Yahoo! does a better job than its given credit (or market share) for :)
I did some work on near duplicate detection on the Web as part of my graduate research while at Stanford with
Andreas Paepcke of the
Stanford InfoLab and that's one reason I found this interesting. My work can be found
here.
(Thanks to Chad for this interesting tidbit)