New rule induction algorithms with improved noise tolerance and scalability
AbstractAs data storage capacities continue to increase due to rapid advances in information technology, there is a growing need for devising scalable data mining algorithms able to sift through large volumes of data in a short amount of time. Moreover, real-world data is inherently imperfect due to the presence of noise as opposed to artificially prepared data. Consequently, there is also a need for designing robust algorithms capable of handling noise, so that the discovered patterns are reliable with good predictive performance on future data. This has led to ongoing research in the field of data mining, intended to develop algorithms that are scalable as well as robust. The most straightforward approach for handling the issue of scalability is to develop efficient algorithms that can process large datasets in a relatively short time. Efficiency may be achieved by employing suitable rule mining constraints that can drastically cut down the search space. The first part of this thesis focuses on the improvement of a state-of-the-art rule induction algorithm, RULES-6, which incorporates certain search space pruning constraints in order to scale to large datasets. However, the constraints are insufficient and also have not been exploited to the maximum, resulting in the generation of specific rules which not only increases learning time but also the length of the rule set. In order to address these issues, a new algorithm RULES-7 is proposed which uses deep rule mining constraints from association learning. This results in a significant drop in execution time for large datasets while boosting the classification accuracy of the model on future data. A novel comparison heuristic is also proposed for the algorithm which improves classification accuracy while maintaining the execution time. Since an overwhelming majority of induction algorithms are unable to handle the continuous data ubiquitous in the real-world, it is also necessary to develop an efficient discretisation procedure whereby continuous attributes can be treated as discrete. By generalizing the raw continuous data, discretisation helps to speed up the induction process and results in a simpler and intelligible model that is also more accurate on future data. Many preprocessing discretisation techniques have been proposed to date, of which the entropy based technique has by far been accepted as the most accurate. However, the technique is suboptimal for classification because of failing to identify the cut points within the value range of each class for a continuous attribute, which deteriorates its classification accuracy. The second part of this thesis presents a new discretisation technique which utilizes the entropy based principle but takes a class-centered approach to discretisation. The proposed technique not only increases the efficiency of rule induction but also improves the classification accuracy of the induced model. Another issue with existing induction algorithms relates to the way covered examples are dealt with when a new rule is formed. To avoid problems such as fragmentation and small disjuncts, the RULES family of algorithms marks the examples instead of removing them. This tends to increase overlapping between rules. The third part of this thesis proposes a new hybrid pruning technique in order to address the overlapping issue so as to reduce the rule set size. It also proposes an incremental post-pruning technique designed specifically to handle the issue of noisy data. This leads to improved induction performance as well as better classification accuracy.
Shehzad, Khurram 2009. New rule induction algorithms with improved noise tolerance and scalability. PhD Thesis, Cardiff University. file </54945/1/U585334.pdf>