Searching for Tomorrow: 2007

Thursday, December 27, 2007

Roll with the punches..tomorrow is another day..

The title of today's post has only a vague connection with the substance of today's post. Today's post is about a betting strategy for roulette. This strategy has an intrinsic flaw and I am going to leave it to my readers to ponder as to what the flaw is. I'll explain the flaw in a later post.
In roulette, the casino houses typically offer odds such that the expected return is negative (which is how they stay in business). This can be shown as a direct consequence of the law of large numbers.
The betting strategy is remarkably simple. If you lose a round, you double your bet. This is also called 'doubling up' or the Martingale strategy. The intuition is that it is very unlikely that you will have a string of losses and you hope to cover your losses when you eventually win. You have to win sometime right? And at that time you would have recovered all your losses.
Do you see the problem with this strategy?

Wednesday, December 19, 2007

The Monty Hall Puzzle

Here's a fun probability puzzle. This is called the Monty Hall Puzzle since its based off of an American Game Show, Let's make a deal.

Say you're in a game show where the host invites you to open one of 3 doors. One of the doors has a prize(a car) and the other 2 doors have goats. The host knows which door has the prize. After you pick a door, the host then opens one of the remaining 2doors which do not have the prize and reveals the goat. Now, he offers you the option of switching your choice from your current pick to the other remaining door.

The question is, what should you do?
A. Doesn't matter if you switch, the probability of winning is still the same
B. Always switch

Tuesday, December 18, 2007

The Challenge of Duplicates on the Web

A few weeks back, Chad Walters, the Search Architect at Powerset told me an interesting anecdote from the time he headed the Runtime efforts at Yahoo!.
There is a fundamental problem in estimating the number of hits for search queries because of duplicates in Web Search results. Near Duplicates & Duplicates occur because of many reasons like mirroring, RSS feeds, track backs etc. These add a fair amount of noise to the estimates of Web Search Engines about the size of their indexes which is always a matter of some pride to the big guys (Google, Yahoo! & Microsoft) and rightly so.
Anyway, a french researcher, Jean Véroni did some analysis on the number of hits reported by search engines and writes an interesting article here.
Here are some charts from his blog (http://aixtal.blogspot.com/2006/07/search-crazy-duplicates-1.html),

Google with similar pages

Google without similar pages

Yahoo! with similar pages

Yahoo! without similar pages

The fluctuation in the estimates is apparent when similar pages are included.

To quote Jean Véronis from his blog,

It’s interesting to note that:
* once duplicates are removed, Google and Yahoo’s figures are about the same;
* Yahoo’s curves are much more stable than Google’s.

I found this interesting in the context of the battle for supremacy in Web Search.
Maybe Yahoo! does a better job than its given credit (or market share) for :)

I did some work on near duplicate detection on the Web as part of my graduate research while at Stanford with Andreas Paepcke of the Stanford InfoLab and that's one reason I found this interesting. My work can be found here.
(Thanks to Chad for this interesting tidbit)

Saturday, December 15, 2007

What has Search done to the Web?

I came across this interesting video from 1996 where Marc Andreessen is interviewed about Netscape and the future of the Web in general.
Its fascinating to hear Marc's vision of the Internet and the inevitable question of how Netscape sees Microsoft. Ok, how does this relate to Web Search? Let me tell you.
When Marc is asked about the impact of the browser on the internet, he explains it as follows. (to paraphrase) The browser basically made it easier for more people to view the Web. Only as more people started viewing the Web, did it make sense for more people to create Web Pages. He adds a nice analogy here. Its just like we wouldn't have books if there weren't any readers.
I see Web Search as being fundamentally similar to this. The exposure and access to information that Web Search Engines like Google gave to the Web has definitely fueled the growth of the WWW. Although the impact of Web search on the quantity of Webpages is fairly clear, I am not too sure what the impact on quality is though. Then again, quality is too subjective an attribute for the most part.

Powerset (where I currently work) is building a Natural Language Search Engine. It is starting out with a Search Engine for Wikipedia, and then will move to the WWW. On the same note, I think Powerset has the potential to improve the quality of Wikipedia, by offering a better search for it (enabling more people to find what they want, edit what they want etc).
Google and other Web Search Engines have contributed a fair bit to Wikipedia's growth by showing Wikipedia results in the top search results for a lot of queries.
John Battelle's blog post cites Google and Yahoo! as showing Wikipedia results in 27% and 31% of search queries respectively.
Its going to be fun to see how Google's Knol plays into all of this..

Tuesday, December 4, 2007

Wikipedia Bias

I guess a lot has been said about the nerd bias in Wikipedia.
I found this hilarious website http://www.wikigroaning.com that tries to quantify the bias (I have no idea how they get their numbers although some quick checks indicate that it is not completely random).
My favorite was world vs world of warcraft. :)

Google at Stanford

This blog is primarily meant to be at the crossroads of A.I and Web Search. I guess the previous posts were more or less on the theory of probabilistic models. This time we digress a bit for some Stanford news.

Last week, Marissa Mayer was over at Stanford to give a talk at the Stanford IEEE Chapter to a packed audience. Marissa is the VP of Search Products and User Experience at Google. I have always been a fan of Marissa's and it was great to meet her in person.

Her talk revolved around the early Google days and the general philosophy that Google typically follows with its product launches and strategy. She explained these with interesting real life anecdotes. She fielded plenty of questions from all sides of the park and also stuck around for more questions after the talk.

Some of the points I remember are,
* launch early
* listen to the data
* its ok to be unconventional
* solve big problems

When someone in the audience asked her about her thoughts on the Semantic Web, she made the distinction between the 'semantic web' and 'understanding semantics'. The former being hand built ontologies to enforce some form of structure and the latter being understanding meaning and intent. She seemed to think the latter had promise.
This is exactly my stand on the topic. (Incidentally, I work at Powerset which does the latter as well)

She also touched upon Google's ambitions with respect to Books, Machine Translation, Earth etc.

Monday, November 5, 2007

Simpson's Paradox explained...

The apparent paradox arises because, taking the drug is correlated with Gender.
In the above example, say men are more likely to take the drug than women. Say 75% of men take the drug while only 25% of women take the drug.
Let me explain it with some numbers, say women and men are equally represented in the population. (say 100 men and 100 women) Then there will be 75 men who will take the drug and 25 women who will take the drug. From the statistics mentioned earlier among those who took the drug, 52.5 men (70%) will be cured and 5 women(20%) will be cured. So 57.5% of those who took the drug are cured. By similar reasoning, 20 men who did not take the drug are cured and 30 women who did not take the drug are cured. i.e. 50% of those who did not take the drug are cured which explain the surprising result.

The moral of the story is that the right probabilistic query to ask the model is not the observational query P(cure|drug) but the causal query P(cure|do(drug))

It is interesting how Simpson's paradox has at times been used to explain altruism in a Darwinian setting wherein natural selection inherently disadvantages individuals who confer benefits on their competitors. The Stanford Encyclopedia explains this fairly well. The summary is that although seemingly counter-intuitive, populations generally sustain altruistic individuals and do not get run over with selfish individuals thus relieving the lazy/inefficient/incapable individuals of some evolutionary stress to survive ;)

Sunday, October 28, 2007

Simpson's Paradox

I found Simpson's paradox fascinating the first time I came across it in a course by
Daphne Koller,
The paradox (loosely) states the following,
Say we are trying to decide whether a drug is beneficial in curing a disease in a population where males and females are equally represented. We are given statistics that 57.5% of patients who took the drug are cured whereas only 50% of the patients who did not take the drug are cured. This seems to suggest that the drug is beneficial in general. But, we are also told that 70% of the male patients who took the drug are cured whereas 80% of males who did not take the drug are cured. Among females, 20% who took the drug are cured and 40% who did not take the drug are cured. These numbers look surprising! Right? Thus despite the apparently harmful effect of the drug on both men and women, the overall effect of the drug seems beneficial.

How is this possible?

I'll explain the result in my next post! Stay tuned...

Wednesday, April 4, 2007

introduction

Hi,
I'm Jonathan and in this blog, I hope to publish my thoughts as to where I think
web search and information retrieval advances are headed and also to write about things that I think are cool and pushing the envelope in terms of innovative research in computer science
although with a primary bias on the above topics.

Jonathan
http://stanford.edu/~jonsid

Searching for Tomorrow