Date: Fri, 14 Jan 94 19:35:47 PST From: Ronny Kohavi Message-Id: <9401150335.AA26467@Starry.Stanford.EDU> To: George John , Karl Pfleger , Don Subject: Be a judge for a bet Cc: Scott Roy Background: Scott Roy and I met today to discuss MLC++ and Scott's learning algorithm. He seemed too optimistic and complained of my criticisms. In order to avoid such a disparate views and settle them, we made a bet. You three are asked to be the judges. We know your time is valuable, thus you will each get $5 from the loser (or $2 from each party in case of a draw -- a $1 incentive to avoid a draw). Ronny This bet is between Ron Kohavi (referred to as Ronny) and Howard Scott Roy (referred to as Scott). The bet involves a machine learning experiment. Three judges, George John, Karl Pfleger, and Don Geddis will determine who wins the bet, or whether it is a draw. If the judges find a winner, the loser will take the winner to a good Chinese dinner (good means that the cost of the meal for the winner is at least $30). Details: Ronny will run C4.5, a commercially available learning algorithm. Scott will announce that he has a machine learning program BEFORE May 1, 1994. After Scott's announcement that he is ready, the judges will announce 3 datasets that they think are appropriate for comparing Scott's algorithm with Quinlan's C4.5. The datasets should consist of one artificial domain problem on a discrete domain, and 2 datasets for real-world problems taken from the Irvine dataset. The attribute values and names will be changed so as not to reveal the real database name. Both Ronny and Scott will receive the same TRAINING set to adjust their learning algorithms. One day will be given, after which Scott, Ronny, and a representative from the judges will meet. Scott's program and C4.5 will be executed and tested on the TEST SET (unknown to Scott and Ronny). The judges will then decide who won, or whether it is a draw. The winner's algorithm must perform better on at least 2 out of the 3 datasets, where better means higher accuracy on the test set. If one of the program aborts (crashes) or for some reason does not classify instances in the test set, it counts as a loss for that dataset. If Scott does not make his announcement before May 1st, he loses. +-------------------------------------------------------------------+ | Ronny Kohavi - ronnyk@CS.Stanford.Edu | | | | "The one real object of education is to leave a man in a | | condition of continually asking questions" / Bishop Creighton | | | +-------------------------------------------------------------------+ _______________________________________________________________________________ Date: Fri, 29 Apr 94 09:55:20 -0700 From: H. Scott Roy Message-Id: <9404291655.AA03661@Schmendrick.Stanford.EDU> To: Ronny Kohavi Subject: Re: Bet Scott vs Ronny Cc: Don , George John , Karl Pfleger , Scott Roy Hi folks, I will indeed have a program ready. It will only be a shell of its full potential, but I'll hop on the woofing wagon early and confidently (?!) predict that C4.5 will go down to ignominious defeat. Garlic eggplant is my favorite, Ronny, just so you can start scouting the Chinese restaurants to find which one cooks up the best. So I am hereby postdating this message to 4/30, 11:59 pm, and announcing that my program, MultiClass, is ready and waiting in its corner. One question for Ronny: what, precisely, do you mean by accuracy? MultiClass generates probability distributions, so it can give the complete log likelihood of the test set based on its model. That measure has a distinct advantage in that a program gets penalized for making random guesses. I can, of course, also just determine the maximum likelihood class and measure how many of those guesses are correct. Which measure shall we use? Judges? Scott _______________________________________________________________________________ Date: Tue, 3 May 94 14:33:13 -0700 From: H. Scott Roy Message-Id: <9405032133.AA11132@Schmendrick.Stanford.EDU> To: Ronny Kohavi Subject: Game Time Cc: Don , George John , Karl Pfleger , Scott Roy Hi folks, The hour is at hand. I've just unearthed the last bug I care to correct in my program and am ready to get underway. Shall we convene at a central site to run things? I'll hereby put on my most optimistic face and confidently predict a sound thrashing for C4.5. Scott _______________________________________________________________________________ Date: Tue, 3 May 94 14:50:59 -0700 From: H. Scott Roy Message-Id: <9405032150.AA11161@Schmendrick.Stanford.EDU> To: Ronny Kohavi Subject: Re: The Last Dataset Cc: Geddis@cs.stanford.edu, kpfleger@cs.stanford.edu, gjohn@cs.stanford.edu, hsr@cs.stanford.edu | C4.5 runs fine on the datasets, so I guess Scott will have to pay up. | Scott, don't forget that I'm a vegetarian when you scout for that | restaurant. Italian is good, and Mondays and Fridays I'm going | folk-dancing. This Thursday is fine. Er, 'scuse me Ronny, but I haven't conceded just yet. My latest assessments leave me somewhat more optimistic than I was this morning (my training and testing routines were reading the data differently, with rather spectacular consequences on the accuracy results). Don't count your eggplants until we see which program wins. But win or lose, Thursday should be fine. Scott _______________________________________________________________________________ Date: Wed, 4 May 1994 14:46:22 +0800 From: Ronny Kohavi Message-Id: <9405042146.AA28631@starry.Stanford.EDU> To: kpfleger@hpp.stanford.edu Cc: Scott Roy , George John , Don Subject: Re: so what's the result? Karl> So what happened? My guess is that there weren't enough Karl> instances in either George's or my training sets to constrain Karl> the large number of degrees of freedom os Scott's models, Karl> preventing his stuff from finding reasonable solutions, but I've Karl> been expecting to see an announcement from the winner.... George's dataset: C4.5 78.7, MultiClass: 65.6, baseline: 55.3 Karl's dataset: C4.5: 74.2, MultiClass: 65.8, baseline: 63.3 Let me point out that preparing C4.5 flags on a 10-CV, I wrote down that I should use -m1 for George, which would have decreased the accuracy to 76, and -m40 for Karl, which would have increased the accuracy to 75 (both insignificant variations on accuracy). On the -m40, the tree has only 11 nodes, so the model is very comprehensible. Since every leaf has at least 40 instances, it seems like the Pima dataset has enough instances, but not enough discriminatory power. I'll leave it to Scott to explain what happened. BTW, I offered Scott another bet on June 1 (same conditions), but he declined. +---------------------------------------------------------------------+ | Ronny Kohavi - ronnyk@CS.Stanford.Edu | | | | Picture a robot on a psychiatric couch: | | Doc, my intelligence may be artificial, but my problems are real. | | | +---------------------------------------------------------------------+