Anurandha, B. ; Janakiraman, Anand ; Sarawagi, Sunita ; Haritsa, Jayant (2002) Building classifiers with unrepresentative training instances: Experiences from the kdd cup 2001 competition Workshop on Data Mining Lessons .
PDF
252kB |
Abstract
In this paper we discuss our experiences in participating in the KDD Cup 2001 compe-tition. The task involved classifying organic molecules as either active or inactive in their binding to a receptor. The classification task presented three challenges: highly skewed class distribution, large number of features exceeding training set size by two orders of magnitude, and non-representative training instances. Of these, we found the third chal-lenge the most interesting and novel. We present our process of experimenting with a number of classification methods before fi-nally converging on an ensemble of decision trees constructed using a novel attribute par-titioning method. Decision trees provided partial shield from the differences in data dis-tribution and the ensemble provided stabil-ity by exploiting the redundancy in the large set of features. Finally, we employed semi-supervised learning to incorporate character-istics of the test set into the classification model. We were second-runner's up in the competi-tion. We followed up the competition with further research in semi-supervised learning and obtained an accuracy higher than that of the winning entry.
Item Type: | Article |
---|---|
Source: | Copyright of this article belongs to ResearchGate GmbH |
ID Code: | 128417 |
Deposited On: | 20 Oct 2022 09:02 |
Last Modified: | 14 Nov 2022 11:43 |
Repository Staff Only: item control page