Re: DM: Classification problem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Subscribe]

Re: DM: Classification problem

From: Earl S. Harris, Jr.
Date: Tue, 30 May 2000 08:10:43 -0400

Organization: The MITRE Corporation

"T.S. Lim" wrote:
 >
 >  >From: "Yannis Kopanas" <ikopanas@ee.upatras.gr>
 >  >To: <datamine-l@nautilus-sys.com>
 >  >Subject: DM: Classification problem
 >  >Date: Thu, 25 May 2000 07:17:18 +0300
 >  >Reply-To: datamine-l@nautilus-sys.com
 >  >
 >  >
 >  >My problem has to do with the data set. I have two classes (the good guys
 >  >and the bad guys) unfortunatelly the bad guys are only 20 when the 
good guys
 >  >are 99980. Anybody who knows how to deal with it?
 >  >Thanks in advance.
 >  >     Yannis
 >
 > Your case is very extreme. Usually, I'd suggest playing with the prior

Extremely uneven? Yes.  Extremely uncommon? That depends on your domain.

 > probabilities and misclassification costs. How important are those 20 "bad
 > guys"?
 >

Also, if your learner doesn't allow you to set prior probabilities or
misclassification costs, you might try adding 50 copies of each bad guy
to your training sample.  I wouldn't remove good guys from your sample,
because your sample isn't insanely large (and I believe this practice
encourages over fitting).

Basically, you want to tell the learner that classifying the bad guys is
important.

Lastly, accuracy isn't an applicable metric in this domain.  By saying
everyone is a good guy, you get high accuracy, but no insight on
catching bad guys.  Consider using precision and recall as your metrics for
measuring the effectiveness of your rules. Informally, if some rule identifies
X members as bad and Y of them were actually bad, the rule's precision
is Y/X. And if your sample has Z bad guys, that same rule's recall is Y/Z.

I hope this helps.

Earl Harris Jr.

 > --
 > T.S. Lim
 > tslim@recursive-partitioning.com
 > www.Recursive-Partitioning.com
 >
 > ------------------------------------------------------------
 > Get paid to write review! http://recursive-partitioning.epinions.com

Prev by Date: DM: Special Issues of JASS journal
Next by Date: Re:DM: Anyone doing applied Ontology Discovery via DM?
Prev by thread: RE: DM: Classification problem
Next by thread: DM: CFP: CIA 2000 Cooperative Information Agents
Index(es):
- Date
- Thread