Thursday, December 27, 2007

Roll with the punches..tomorrow is another day..


The title of today's post has only a vague connection with the substance of today's post. Today's post is about a betting strategy for roulette. This strategy has an intrinsic flaw and I am going to leave it to my readers to ponder as to what the flaw is. I'll explain the flaw in a later post.
In roulette, the casino houses typically offer odds such that the expected return is negative (which is how they stay in business). This can be shown as a direct consequence of the law of large numbers.
The betting strategy is remarkably simple. If you lose a round, you double your bet. This is also called 'doubling up' or the Martingale strategy. The intuition is that it is very unlikely that you will have a string of losses and you hope to cover your losses when you eventually win. You have to win sometime right? And at that time you would have recovered all your losses.
Do you see the problem with this strategy?

Wednesday, December 19, 2007

The Monty Hall Puzzle

Here's a fun probability puzzle. This is called the Monty Hall Puzzle since its based off of an American Game Show, Let's make a deal.

Say you're in a game show where the host invites you to open one of 3 doors. One of the doors has a prize(a car) and the other 2 doors have goats. The host knows which door has the prize. After you pick a door, the host then opens one of the remaining 2doors which do not have the prize and reveals the goat. Now, he offers you the option of switching your choice from your current pick to the other remaining door.

The question is, what should you do?
A. Doesn't matter if you switch, the probability of winning is still the same
B. Always switch

Tuesday, December 18, 2007

The Challenge of Duplicates on the Web


A few weeks back, Chad Walters, the Search Architect at Powerset told me an interesting anecdote from the time he headed the Runtime efforts at Yahoo!.
There is a fundamental problem in estimating the number of hits for search queries because of duplicates in Web Search results. Near Duplicates & Duplicates occur because of many reasons like mirroring, RSS feeds, track backs etc. These add a fair amount of noise to the estimates of Web Search Engines about the size of their indexes which is always a matter of some pride to the big guys (Google, Yahoo! & Microsoft) and rightly so.
Anyway, a french researcher, Jean Véroni did some analysis on the number of hits reported by search engines and writes an interesting article here.
Here are some charts from his blog (http://aixtal.blogspot.com/2006/07/search-crazy-duplicates-1.html),


Google with similar pages





Google without similar pages






Yahoo! with similar pages






Yahoo! without similar pages






The fluctuation in the estimates is apparent when similar pages are included.

To quote Jean Véronis from his blog,

It’s interesting to note that:
* once duplicates are removed, Google and Yahoo’s figures are about the same;
* Yahoo’s curves are much more stable than Google’s.


I found this interesting in the context of the battle for supremacy in Web Search.
Maybe Yahoo! does a better job than its given credit (or market share) for :)

I did some work on near duplicate detection on the Web as part of my graduate research while at Stanford with Andreas Paepcke of the Stanford InfoLab and that's one reason I found this interesting. My work can be found here.
(Thanks to Chad for this interesting tidbit)

Saturday, December 15, 2007

What has Search done to the Web?


I came across this interesting video from 1996 where Marc Andreessen is interviewed about Netscape and the future of the Web in general.
Its fascinating to hear Marc's vision of the Internet and the inevitable question of how Netscape sees Microsoft. Ok, how does this relate to Web Search? Let me tell you.
When Marc is asked about the impact of the browser on the internet, he explains it as follows. (to paraphrase) The browser basically made it easier for more people to view the Web. Only as more people started viewing the Web, did it make sense for more people to create Web Pages. He adds a nice analogy here. Its just like we wouldn't have books if there weren't any readers.
I see Web Search as being fundamentally similar to this. The exposure and access to information that Web Search Engines like Google gave to the Web has definitely fueled the growth of the WWW. Although the impact of Web search on the quantity of Webpages is fairly clear, I am not too sure what the impact on quality is though. Then again, quality is too subjective an attribute for the most part.

Powerset (where I currently work) is building a Natural Language Search Engine. It is starting out with a Search Engine for Wikipedia, and then will move to the WWW. On the same note, I think Powerset has the potential to improve the quality of Wikipedia, by offering a better search for it (enabling more people to find what they want, edit what they want etc).
Google and other Web Search Engines have contributed a fair bit to Wikipedia's growth by showing Wikipedia results in the top search results for a lot of queries.
John Battelle's blog post cites Google and Yahoo! as showing Wikipedia results in 27% and 31% of search queries respectively.
Its going to be fun to see how Google's Knol plays into all of this..

Tuesday, December 4, 2007

Wikipedia Bias

I guess a lot has been said about the nerd bias in Wikipedia.
I found this hilarious website http://www.wikigroaning.com that tries to quantify the bias (I have no idea how they get their numbers although some quick checks indicate that it is not completely random).
My favorite was world vs world of warcraft. :)

Google at Stanford


This blog is primarily meant to be at the crossroads of A.I and Web Search. I guess the previous posts were more or less on the theory of probabilistic models. This time we digress a bit for some Stanford news.

Last week, Marissa Mayer was over at Stanford to give a talk at the Stanford IEEE Chapter to a packed audience. Marissa is the VP of Search Products and User Experience at Google. I have always been a fan of Marissa's and it was great to meet her in person.

Her talk revolved around the early Google days and the general philosophy that Google typically follows with its product launches and strategy. She explained these with interesting real life anecdotes. She fielded plenty of questions from all sides of the park and also stuck around for more questions after the talk.

Some of the points I remember are,
* launch early
* listen to the data
* its ok to be unconventional
* solve big problems

When someone in the audience asked her about her thoughts on the Semantic Web, she made the distinction between the 'semantic web' and 'understanding semantics'. The former being hand built ontologies to enforce some form of structure and the latter being understanding meaning and intent. She seemed to think the latter had promise.
This is exactly my stand on the topic. (Incidentally, I work at Powerset which does the latter as well)

She also touched upon Google's ambitions with respect to Books, Machine Translation, Earth etc.