Building classifiers with unrepresentative training instances: Experiences from the kdd cup 2001 competition

Anurandha, B. ; Janakiraman, Anand ; Sarawagi, Sunita ; Haritsa, Jayant (2002) Building classifiers with unrepresentative training instances: Experiences from the kdd cup 2001 competition Workshop on Data Mining Lessons .

PDF
252kB

Abstract

In this paper we discuss our experiences in participating in the KDD Cup 2001 compe-tition. The task involved classifying organic molecules as either active or inactive in their binding to a receptor. The classification task presented three challenges: highly skewed class distribution, large number of features exceeding training set size by two orders of magnitude, and non-representative training instances. Of these, we found the third chal-lenge the most interesting and novel. We present our process of experimenting with a number of classification methods before fi-nally converging on an ensemble of decision trees constructed using a novel attribute par-titioning method. Decision trees provided partial shield from the differences in data dis-tribution and the ensemble provided stabil-ity by exploiting the redundancy in the large set of features. Finally, we employed semi-supervised learning to incorporate character-istics of the test set into the classification model. We were second-runner's up in the competi-tion. We followed up the competition with further research in semi-supervised learning and obtained an accuracy higher than that of the winning entry.

Item Type:	Article
Source:	Copyright of this article belongs to ResearchGate GmbH
ID Code:	128417
Deposited On:	20 Oct 2022 09:02
Last Modified:	14 Nov 2022 11:43

Repository Staff Only: item control page