Nautilus Systems, Inc. logo and menu bar Site Index Home
News Books
Button Bar Menu- Choices also at bottom of page About Nautilus Services Partners Case Studies Contact Us
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]

Re: DM: Looking for three datasets

From: Ronny Kohavi
Date: Thu Dec 2 09:41:14 1999
  • Organization: Blue Martini Software

Jarek Sacha wrote: 
> I am trying to locate three datasets: corral, m-of-n-3-7-10, and 
> shuttle-small (3866 test, 1934 train). 
> The first two are synthetic. The last one is probably a 
smaller version of 
> the Statlog shuttle dataset. 

They're all in
Note that for many datasets we provided a default "train" and 
"test" sets, in case you're not doing cross-validation. 

Corral is an artificial example designed to show that decision
trees might pick a really bad attribute for the root.  It's 
explained in    John, G; Kohavi, R; and Pfleger, K., Irrelevant 
features and the subset selection problem. 
   In Machine Learning:Proceedings of the Eleventh International
 Conference, 1994, available off

and in my thesis (off the above web page at the top). 

The m-of-n-3-7-10 dataset represents the concept that at least 
three bits 
of bits numbered three to nine are set to one (bits one, two, 
and ten are irrelevant). Such target concepts are common in 
medical domains where 
a patient needs to exhibit at least m of a set of n symptoms
to be diagnosed with 
some disease (Spackman 1988).  The most interesting thing about
this concept is that Naive-Bayes is unable to learn it even 
though it can be represented as a hyperplane 
and that performance improves if you hide a relevant feature
(page 107 in my thesis). 

   -- Ronny ------------- 

[ Home | About Nautilus | Case Studies | Partners | Contact Nautilus ]
[ Subscribe to Lists | Recommended Books ]

logo Copyright © 1999 Nautilus Systems, Inc. All Rights Reserved.
Mail converted by MHonArc 2.2.0