Chapter 1
Introduction
Data mining is a tool of data analysis and recovering the essential patterns from the data. In order to enhance the accuracy of data analysis a new kind of data model is introduces as the domain of study and their overview in terms of objectives and the motivations. In addition of that how the remaining document is organized is also reported in this chapter.
1.1 Introduction
Data mining is a tool for analysing the input data and extraction of hidden and informational patterns. Therefore the relativity between available attributes is established among them. There are mainly two kinds of algorithms are available first transparent type in such kind of algorithms any kind of data structure or rule sets are developed for identifying the class labels. Second type of data models are known as the opaque data models in such kind of data models the relation between attributes are computed in hidden manner and not clearly visual at first sight. But the accuracy of opaque data models is much higher than the transparent data models.
The decision tree algorithm is much popular classification technique of machine learning and data mining. In this presented study the decision tree algorithms are investigated and performance improvement is performed using the available techniques. Basically the decision tree algorithms are transparent data model development algorithms which are used to pattern recognition and learning. That is different from the problem solving trees but can be used for problem solving with small set of changes. In the proposed decision tree algorithm includes the small set of modifications for improving the classification accuracy and computational cost of decision growth.
This chapter includes the basic overview of the study domain. Additionally the motivation, objectives and machine learning issues are discussed.
1.2 Motivation
Data mining or Knowledge discovery is seen as an important tool for modern business to transform data into an informational advantage. Mining is a process of finding correlations among dozens of fields in relational databases and extracts useful information that can be used to increase revenue, cuts costs, or both. Classification is a supervised machine learning process and an interesting algorithm in data mining. Scalability is a major issue for mining large data set. This paper Thangaparvathi B et al [1] presents a scalable decision tree algorithm which requires only one pass over huge dataset for classification and also efficient as compared to available methods such as SLIQ, SPRINT, Rain Forest. It overcomes the drawback of Rainforest algorithm which majorly addresses scalability issue and requires a pass over the dataset in each level of decision tree construction. The experimental results show that given algorithm outperforms then Rain Forest algorithm for decision tree construction – in time dimension. Additionally algorithm significantly reduces the sorting cost and hence the whole execution time.
1.3 Objectives
The main objective of the proposed study is to develop an efficient and effective decision tree algorithm for enhancing accuracy and computational complexity of the traditional algorithms. Therefore the following works are included for the proposed study.
Study of different decision tree algorithms: in this phase different and frequently used algorithms are learned and their functions are studied.
Study of different issues in decision tree algorithms: in this phase the disadvantages of these algorithms are evaluated.
Design and implement a new decision tree algorithm: in this phase using the available literature collection a new algorithm designed and implemented using JAVA technology.
Performance study of the proposed algorithm: in this phase the performance of the proposed algorithm is evaluated and compared with traditional algorithms.
This section demonstrates the work involved in the proposed study. In next section the background of the machine learning and their issues are provided.
1.4 General Issues of Supervised Learning Algorithms
Inductive machine learning is the process of learning a set of rules from instances (examples in a training set), or more generally speaking, creating a classifier that can be used to generalize from new instances. The process of applying supervised ML to a real-world problem is described in Figure 1.
The first step is collecting the dataset. If a requisite expert is available, then s/he could suggest which fields (attributes, features) are the most informative. If not, then the simplest method is that of ‘brute-force,’ which means measuring everything available in the hope that the right (informative, relevant) features can be isolated. However, a dataset collected by the ‘brute-force’ method is not directly suitable for induction. It contains in most cases noise and missing feature values, and therefore requires significant pre-processing [2].
The second step is the data preparation and data pre-processing. Depending on the circumstances, researchers have a number of methods to choose from to handle missing data. Hodge & Austin (2004) have recently introduced a survey of contemporary techniques for outlier (noise) detection. These researchers have identified the techniques’ advantages and disadvantages. Instance selection is not only used to handle noise but to cope with the infeasibility of learning from very large datasets. Instance selection in these datasets is an optimization problem that attempts to maintain the mining quality while minimizing the sample size. It reduces data and enables a data mining algorithm to function and work effectively with very large datasets. There is a variety of procedures for sampling instances from a large dataset. Feature subset selection is the process of identifying and removing as many irrelevant and redundant features as possible. This reduces the dimensionality of the data and enables data mining algorithms to operate faster and more effectively. The fact that many features depend on one another often unduly influences the accuracy of supervised ML classification models. This problem can be addressed by constructing new features from the basic feature set. This technique is called feature construction/transformation. These newly generated features may lead to the creation of more concise and accurate classifiers. In addition, the discovery of meaningful features contributes to better comprehensibility of the produced classifier, and a better understanding of the learned concept [3].
Figure 1.1 process of supervised ML
1.5 Decision Trees Improvement issues
The structures of the decision tree are simple and can easily generate rules, therefore are the favoured technique for building understandable models. Because of this clarity they also allow for more complex profit for business applications.
There are various Decision tree tools are available having its own advantages and disadvantages. Some of the decision tree tools are Mind tree, AC2 etc.
Handling of present software is not much easy task. They are too complex to work and required expertise hands to get required analysis. Following are the issues with the present decision tree tools [4].
1.5.1 Requires Experience
Business owners and managers must have a certain level of experience to complete decision-tree analysis, a technique often used in finance or management accounting-related business functions. Decision trees typically require certain knowledge of quantitative or statistical experience to complete the process accurately. Failing to accurately understand decision trees can lead to a garbled outcome of business opportunities or decision possibilities [4].
1.5.2 Incomplete Information
Decision trees typically require internal and external information relating to the problem domain and its operating environment. Business owners must be able to gather the basic pieces of information to assess accurately the opportunities listed on the decision tree. It can also be difficult to include variables on the decision tree, exclude duplicate information or express information in a logical, consistent manner. They must also decide whether the decision tree should represent dollars, percentages or a combination. The inability to complete the decision tree using only one set of information can be somewhat difficult [4].
1.5.3 Too Much Information
While incomplete information can create difficulties in the decision-tree process, too much information can also be an issue. Rather than making a decision and advancing their company’s mission or vision, owners and managers spend more time looking at decision trees. Decision trees can require more analysis than other analysis methods and slow down the decision-making process. Keeping all the problems in mind after evaluation of the above tools we found that all the important parameters for evaluation of a model for business applications is not found and to enhance the performance of models ensemble is merged [4].
1.6 Document Organization
This section introduces about the thesis organization and their contents overview in a brief manner. The remaining document is organized as follows.
The introduction to the thesis is given in Chapter 1. This section describes the general introduction, the need and theory.
Chapter 2 reviews different existing and emerging technologies that are related to the work presented in this thesis under the title literature survey.
In the next section, Chapter 3 Analysis of the whole research work is done by explaining the detailed work performed in the research work.
In chapter 4 we give the implementation of the solution and the suggested approach of solving it.
Chapter 5 describes the various results that are obtained after the complete implementation of the project.
Chapter 6 provides conclusions from the work described in previous chapters and discusses possibilities for future development.
‘
Chapter 2
Literature survey
This chapter provides deep understanding about the research background and the exiting technology which support the proposed work. In addition of that recently made efforts and the contributions on the decision tree based approaches are also given in this chapter.
2.1 Decision Trees
The most common data mining algorithms and decision support systems are neural networks, logistic regression and Decision trees. Among these classification algorithms decision tree algorithms is the most commonly used because of it is easy to understand and cheap to implement. It provides a modelling technique that is easy for human to understand and simplifies the classification practice.
Decision tree classifiers are used successfully in many diverse areas such as radar signal classification, character recognition, remote sensing, medical diagnosis, expert systems, and speech recognition, to name only a few. Perhaps, the most important feature of
Decision tree classifiers are their capability to break down a complex decision-making process into a collection of simpler decisions, thus providing a solution which is often easier to interpret.
Decision tree is one of the classification methods and supervised learning algorithms, Supervised learning algorithm (like classification) is chosen to unsupervised learning algorithm (like clustering) because its prior knowledge of the class labels of data records makes feature/attribute selection easy and this leads to good prediction/classification accuracy. The early decision tree algorithms are CLS and ID3, improved version of ID3 is C4.5, It introduced some new methods and functions, such as adopting information gain ratio, disposing of the continuous attributes, validating the model by k-fold cross-validation, and so on mentioned in [5, 6].It has been broadly applied in information extraction from remote sensing image, disaster weather forecasting, correlation analysis of environmental variables, and so on.
During the training process, the Decision Tree algorithm must repeatedly find the most efficient way to split a set of cases (records) into two child nodes. The most common methods are gini and entropy, for calculating the splits.
The information gain ratio is the base for choosing the split attribute of the decision tree in C4.5, and the attribute that has the maximum gain ratio will be chosen as the root node of the decision tree. When we want to construct the decision tree for the training set T, which will be divided into n subset in accordance with the gain ratio calculated. If all the classifications of the tuple contained in sub-set Ti are of the same group, the node will become a leaf node of decision tree and stop splitting. The other sub-set of T which does not satisfy this condition mentioned above will be split recursively and to construct the branch for the tree as described above, until all the tuple contained in the subset belongs to the same category. After generating a decision tree, we can extract the rules from the tree, and to classify the new data set.
2.2 Advantages of Decision Tree
Decision trees have various advantages [7]:
Simple to understand and construct. It’s easy to understand decision tree models after a short explanation.
Requires little data training. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed.
Able to handle both numerical and categorical data. Other techniques are usually specialized in analysing datasets that have only one type of variable. Ex: relation rules can be used only with nominal variables while neural networks can be used only with numerical variables.
Uses a white box model. If a given situation is visible in a model the justification for the circumstance is easily explained by Boolean logic. An example of a black box model is an artificial neural network since the explanation for the results is complex to understand.
Probable to validate a representation using statistical tests. That makes it possible to explanation for the reliability of the model.
Robust. Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.
Performs well with large data in a short time. Large amounts of data can be analysed using standard computing resources.
2.3 Disadvantages of Decision Tree
Decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree.
Decision-tree learners can create over-complex trees that do not generalize the data well. This is called over fitting. Mechanisms such as pruning are necessary to avoid this problem.
There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems. In such cases, the decision tree becomes prohibitively large. Approaches to solve the problem involve either changing the representation of the problem domain (known as propositionalisation) or using learning algorithms based on more expressive representations (such as statistical relational learning or inductive logic programming).
For data including categorical variables with different number of levels, information gain in decision trees are biased in favour of those attributes with more levels. [7]
2.4 Decision Tree Algorithms
In this section the different decision tree algorithms are included by which the issues and solutions are obtained.
2.4.1 Supervised Learning in Quest (SLIQ) Algorithm
SLIQ is a decision tree classifier that can handle both numerical and categorical attributes it builds compact and accurate trees. It uses a pre-sorting technique in the tree growing phase and an inexpensive pruning algorithm. It is suitable for classification of large disk-resident datasets, independently of the number of classes, attributes and records [8].
Tree Building
Make Tree (Training Data T)
Partition (T)
Partition (Data S)
If (all points in S are in the same class)
Then return;
Evaluate Splits for each attribute A;
Use best split to partition S into S1 and S2;
Partition (S1);
Partition (S2);
The gini-index is used to evaluate the ‘goodness’ of the alternative splits for an attribute
If a data set T contains examples from n classes, gini(T) is defined as
gini(T)=1-”’P_j’^2
where pj is the relative frequency of class j in T. After splitting T into two subset T1and T2 the gini index of the split data is defined as
‘gini(T)’_split=|T1|/|T| gini(T1) |T2|/|T| gini(T2)
The first technique implemented by SLIQ is a scheme that eliminates the need to sort data at each node It creates a separate list for each attribute of the training data. A separate list, called class list, is created for the class labels attached to the examples. SLIQ requires that the class list and (only) one attribute list could be kept in the memory at any time.
2.4.2 The Rain-Forest Framework
We ‘rst introduce the well-known greedy top-down decision tree induction schema. Then we show how this schema can be re’ned to the generic Rain-Forest Tree Induction Schema and detail how the separation of scalability issues from quality concerns is achieved [9].
Decision tree algorithms build the tree top-down in the following way: At the root node r, the database is examined and the best splitting criterion crit(r) is computed. Recursively, at a non-root node n,f(n) is examined and from it crit(n) is computed. (This is the well-known schema for top-down decision tree induction; for example, a speci’c instance of this schema for binary splits). This scheme is given in table 2.1. A thorough examination of the algorithms in the literature shows that the greedy schema can be re’ned to the generic RainForest Tree Induction Schema shown in table 2.1. Most decision tree algorithms (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, Sprint and QUEST) proceed according to this generic schema and we do not know of any algorithm in the literature that does not adhere to it. In the remainder of the paper, we denote by CL a representative classi’cation algorithm.
Note that at a node n, the utility of a predictor attribute as a possible splitting attribute is examined independent of the other predictor attributes: The information about the class label distribution for each distinct attribute value of a is suf’cient. We de’ne the AVC-set of a predictor attribute at node to be the projection of f(n) onto a and the class label whereby counts of the individual class labels are aggregated. We de’ne the AVC-group of a node n to be the set of all AVC-sets at node n. (The acronym AVC stands for Attribute-Value, Class label.) Note that the size of the AVC-set of a predictor attribute a at node n depends only on the number of distinct attribute values of a and the number of class labels in f(n).
Input : node n,partition D,classification algorithm CL
Output : decision tree for D rooted at n
Top-Down Decision Tree Induction Schema:
BuildTree(Node n, data partition D, algorithm CL)
(1) Apply to CL to ‘nd crit(n)
(2) let k be the number of children of n
(3) if (k>0)
(4) Create k children c_1,’,c_k of n
(5) Use best split to partition D into D_1,’,D_k
(6) for (i=1;i’k;i++)
(7) Build_Tree(c_i,D_i)
(8) end for
(9) end if
Rain-Forest Re’nement:
(1a) for each predictor attribute p
(1b) Call CL find best partitioning (AVC-set of p)
(1c) end for
(2a) k= CL; decide partitioning
Table 2.1 rain forest algorithm
2.5 Related Study
In this section the different contributions and modified algorithms are listed which are providing guidelines for new data model design and development.
The Main objective of this paper is to compare the classification algorithms for decision trees for data analysis. Classification problem is important task in data mining. Because today databases are becomes rich with hidden information that can be used for making intelligent business decisions. To comprehend that information, classification is a form of data analysis that can be used to extract models describing important data classes or to predict future data trends. Several classification techniques have been proposed over the years e.g., neural networks, genetic algorithms, Naive Bayesian approach, decision trees, nearest-neighbour method etc. In this paper, Sanjay Kumar Malik et al [10] attention is restricted to decision tree technique after considering all its advantages compared to other techniques. There exist a large number of algorithms for inducing decision trees like CHAID, FACT, C4.5, CART etc. But in this paper, these five decision tree classification algorithms are considered ‘ ID3, SLIQ, SPRINT, PUBLIC and RAINFOREST. Aimed at the problem of huge computation, large tree size and over-fitting of the testing data for multivariate decision tree (MDT) algorithms, Dianhong Wang et al [11] proposed a novel roughset-based multivariate decision trees (RSMDT) method. In this paper, the positive region degree of condition attributes with respect to decision attributes in rough set theory is used for selecting attributes in multivariate tests. And a new concept of extended generalization of one equivalence relation corresponding to another one is introduced and used for construction of multivariate tests. They experimentally test RSMDT algorithm in terms of classification accuracy, tree size and computing time, using the whole 36 UCI Machine Learning Repository data sets selected by Weka platform, and compare it with C4.5, classification and regression trees (CART), classification and regression trees with linear combinations (CART-LC), Oblique Classifier 1 (OC1), Quick Unbiased Efficient Statistical Trees (QUEST). The experimental results indicate that RSMDT algorithm significantly outperforms the comparison classification algorithms with improved classification accuracy, relatively small tree size, and shorter computing time.
Relational databases are the most popular repository for structured data, and are thus one of the richest sources of knowledge in the world. In a relational database, multiple relations are linked together via entity-relationship links. Classification is an important task in data mining and machine learning, which has been studied extensively because of its usefulness development of classification across multiple database relations, becomes important. Multi relational classification is the procedure of building a classifier based on information stored in multiple relations and making predictions with it. There are many popular approaches for finding patterns in data. T. Hemalatha et al [12] provides an insight into various classification methods including ILP (Inductive Logic Programming), Relational database, emerging patterns and associative approaches. Their characteristics and comparisons in detail have also been provided.
S. S. Thakur et al [13] a classification technique (or classifier) is a systematic approach used in building classification models from an input data set. The model generated by the learning algorithm should fit both the input data well and correctly predict the class labels of records it has never seen before. Therefore, a key objective of the learning algorithm is to build models with good generalization capability i.e. models that accurately predict the class labels of previously unknown records. The accuracy or error rate computed from the test set can also be used to compare the relative performance of different classifiers on the same domain. However, the results obtained for accuracy is good and average error rate obtained is equally acceptable for the test records in which, the class labels of the test records was known to us, using decision tree classifier.
Small Footprint LiDAR (Light Detection And Ranging) has been proposed as an effective tool for measuring detailed biophysical characteristics of forests over broad spatial scales. However, by itself LiDAR yields only a sample of the true 3D structure of a forest. In order to extract useful forestry relevant information, this data must be interpreted using mathematical models and computer algorithms that infer or estimate specific forest metrics. For these outputs to be useful, algorithms must be validated and/or calibrated using a subsample of ‘known’ metrics measured using more detailed, reliable methods such as field sampling. In this paper Lei Wang et al [14] describe a novel method for delineating and deriving metrics of individual trees from LiDAR data based on watershed segmentation. Because of the costs involved with collecting both LiDAR data and field samples for validation, they use synthetic LiDAR data to validate and assess the accuracy of algorithm. This synthetic LiDAR data is generated using a simple geometric model of Loblolly pine (Pinus taeda) trees and a simulation of LiDAR sampling. Given results suggest that point densities greater than 2 and preferably greater than 4 points per m2 are necessary to obtain accurate forest inventory data from Loblolly pine stands. However the results also demonstrate that the detection errors (i.e. the accuracy and biases of the algorithm) are intrinsically related to the structural characteristics of the forest being measured. They argue that experiments with synthetic data are directly useful to forest managers to guide the design of operational forest inventory studies. In addition, of that the development of LiDAR simulation models and experiments with the data they generate represents a fundamental and useful approach to designing, improving and exploring the accuracy and efficiency of LiDAR algorithms.
Zeynep Akata et al [15] benchmark several SVM objective functions for large-scale image classification. They consider one-vs-rest, multi-class, ranking, and weighted approximate ranking SVMs. A comparison of online and batch methods for optimizing the objectives shows that online methods perform as well as batch methods in terms of classification accuracy, but with a significant gain in training speed. Using stochastic gradient descent, they can scale the training to millions of images and thousands of classes. Experimental evaluation shows that ranking-based algorithms do not outperform the one-vs-rest strategy when a large number of training examples are used. Furthermore, the gap in accuracy between the different algorithms shrinks as the dimension of the features increases. Additionally also show that learning through cross-validation the optimal rebalancing of positive and negative examples can result in a significant improvement for the one-vs-rest strategy. Finally, early stopping can be used as an effective regularization strategy when training with online algorithms. Following these ‘good practices’, that was able to improve the state-of-the-art on a large subset of 10K classes and 9M images of ImageNet from 16.7% Top-1 accuracy to 19.1%. Ekapong Chuasuwan et al [16] presents a novel application of Genetic Algorithm for the feature selection. The main purpose is to provide proper subset features for decision tree construction in the classification task. New method with the use of “Significant Matrix” on genetic algorithm is presented. The main function is to calculate the relationship between the feature and class label assigned to a fitness value for the population. The algorithm presented important features selected by considering the class of the data and number of features for the least amount in the Significant Matrix. The next step will then update the feature number and the record number to repeat the process until a stop condition is met. Classification by decision tree is used to verify the importance of the features selected by the proposed method. The model tested with 11 different datasets. The results show that the method yields high accuracy of the classification and higher satisfaction compared to classification using artificial neural network. Experimental results show that the proposed method not only provides a higher accuracy, but also reduce the complexity by using less features of the dataset.
Yael Ben-Haim et al [17] propose a new algorithm for building decision tree classi’ers. The algorithm is executed in a distributed environment and is especially designed for classifying large data sets and streaming data. It is empirically shown to be as accurate as a standard decision tree classi’er, while being scalable for processing of streaming data on multiple processors. These ‘ndings are supported by a rigorous analysis of the algorithm’s accuracy. The essence of the algorithm is to quickly construct histograms at the processors, which compress the data to a ‘xed amount of memory. A master processor uses this information to ‘nd near-optimal split points to terminal tree nodes. Analysis shows that guarantees on the local accuracy of split points imply guarantees on the overall tree accuracy.
Automatic electrocardiogram (ECG) heart beat classification is significant for diagnosis of heart failures. The purpose of this study is to evaluate the effect of C4.5 decision tree method in creating the model that will detect and separate normal and congestive heart failures (CHF) on the long-term ECG time series. The research was conducted in two stages: feature extraction using autoregressive (AR) module and classification by applying C4.5 decision tree method. The ECG signals were obtained from BIDMC Congestive heart failure database and classified by applying different experiments. In experimental results Zerina Ma??eti?? et al [18] showed that the proposed method reached 99.86% classification accuracy (sensitivity 99.77%, specificity 99.93%, area under the ROC curve 0.998) and has potential in detecting the congestive heart failures.
P. Rajendran et al [19] provides a classification scheme for images according to their description the main focus of image mining in the proposed method is concerned with the classification of brain tumor in the CT scan brain images. The major steps involved in the system are: pre-processing, feature extraction, association rule mining and hybrid classifier. The pre-processing step has been done using the median filtering process and edge features have been extracted using canny edge detection technique. The two image mining approaches with a hybrid manner have been proposed in this paper. The frequent patterns from the CT scan images are generated by frequent pattern tree (FP-Tree) algorithm that mines the association rules. The decision tree method has been used to classify the medical images for diagnosis. This system enhances the classification process to be more accurate. The hybrid method improves the efficiency of the proposed method than the traditional image mining methods. The experimental result on pre-diagnosed database of brain images showed 97% sensitivity and 95% accuracy respectively. The physicians can make use of this accurate decision tree classification phase for classifying the brain images into normal, benign and malignant for effective medical diagnosis.
2.4 Chapter Summary
In the previous section we include the different research articles and papers that are provide the guidance to identify the problem domain and relevant solution for the proposed study. in this section we provide the key of the study additionally here we provide the summary of the study.
Figure 2.1 simple decision tree
In machine learning user train the machine to perform intellectual work using the previous knowledge and able to make decisions according to the previous experience, for that purpose machine learning is broadly classified in two categories first opaque models and second the transparent models.
An opaque data type is a data type that is incompletely defined in an interface, so that its values can only be manipulated by calling subroutines that have access to the missing information. The concrete representation of the type is hidden from its users. A data type whose representation is visible is called transparent.
In opaque data models the intermediate stages understandable form of data or in other words the data patterns are in hidden form and we cannot find the visual or logical patterns with proper analysis. on the other hands unlike the opaque model data is found in a data structure pattern and user easily evaluate visually.
In the above given diagram user can easily find the decisions using the paper and pencil by traversing the visualized data structure.
But due to analysis of different kind of data model we found that the model accuracy for any decision tree or transparent data model is too less than the opaque data models, such as neural network, SVM SVR etc. moreover it there are various issues also available related to the performance such as the accuracy, over fitting, ambiguity, test cases and others. Thus required to make effort for optimize the performance and resolve the various issues in the select domain of study.
In the next section we provide the problem formulation and design of the proposed model for solving the different issues in the problem space.’
Chapter 3
Proposed solution
This section provides the detailed understanding the accepted issues and challenges for solution development. In addition of that the complete solution design and proposed algorithm is also included in this chapter.
3.1 System Overview
Data mining and knowledge discovery is a technique for finding the targeted information from the databases. In this type of systems the data is consumed using the algorithms and the relativity between attributes of data is discovered. Among various data analysis task two major areas of data analysis first classification and second clustering schemes. The classification is a supervised process of data analysis where the data attributes and their class labels are available for learning. In addition of that in clustering technique the data is only contains the attribute set and using the algorithm process the data is automatically grouped. In this presented work the supervised data mining algorithms are investigated.
There are a number of supervised classification techniques are available but the decision tree algorithm is much popular and classical approach of data analysis. In this classification scheme the data model is prepared in form of tree data structure where each node in tree represents the attributes and the edges between the nodes are representing the relationship between attributes. The leaf node of the decision trees are shows the class labels.
In this chapter design approach of a new efficient and effective decision tree is presented which is used to improve the classification accuracy and minimizes the number of rules set for comparison and prediction. The proposed data model consumes the pre-processing techniques, voting mechanism and n cross fold validation process for enhancing the performance of classification and the use of voting technique helps to improve the number of rules generated by the system.
3.2 Problem Domain
This section provides the detailed issues and challenges that are accepted for solving the issues in decision tree for enhancement.
In any machine learning algorithm or model the main ingredient is training data, using this training data models are build according to the defined data structure form and able to evaluate their patterns according to need, there are mainly three kinds of data set available nominal, numerical and mixed data set.
In nominal data sets the tree recognize each data as symbol and find the patterns, in numerical data model find the relationship between the data and in mix data set they use combine approach to satisfy the data structure.
If any attribute is uniquely identify each data set instances then tree form simple rules according to this attributes.
If any attribute contains same attribute values in each instance then there are not any classification found simple tree generate a single node.
Handling missing attribute values, sometimes data set contains null values or unknown values for in the attributes. These attributes simulate the unwanted behaviour of the data structure.
Testing methods for the evaluation of the decision model because of different methods of model testing is used to evaluate the performance of the system.
3.3 Solution Domain
According to the discussed issues in the above section an advantageous solution is presented in this section by including the following improvement in the proposed algorithm.
1. Include the pre-processing of unique and similar value elimination
If the data set contains the unique attribute then under fitting condition is occurred and no classification is taken place and if all the attributes are similar then also not classification is taken place thus the following steps are used for pre-process data.
1. Find number of instances I
2. Find number of unique attributes A
3. If (I==A)
4. Remove attributes
5. Else if (A==1)
6. Remove attribute
7. End if
2. Voting scheme for optimum learner selection
In this phase the optimum learners are selected for proving the minimum size of tree this process is taken place in the following manner.
Let a base learner L applied on the dataset D for producing the classification tree and to optimize the performance of the classification the following steps are taken place.
1. Apply the L on the dataset D
2. Find the instances that are misclassified using L
3. Create a new instance of learner L as Li
4. Classify the data using Li
5. If all instances classified
6. Return performance
7. Else
8. For each learners findconfidence(misclassified)
9. If confidence >75
10. Aggregate tree
11. End if
12. End for
3. N cross validation
Cross-validation is also called rotation estimation that is a model validation technique for assessing how the results of an analysis will simplify to a data set. It is mainly used in settings where the goal is prediction or classification, and wants to estimate how accurately a model will perform in experiments. Round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset, and validating the analysis on the other subset. To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.
In k-fold cross-validation, the original sample is randomly partitioned into k equal size sub samples. Of the k sub samples, a single sub sample is retained as the validation data for testing the model, and the remaining k ‘ 1 sub-sample is used as training data. The cross-validation process is then repeated k times (the folds), with each of the k sub samples used exactly once as the validation data. The k results from the folds then can be averaged (or otherwise combined) to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used, but in general k remains an unfixed parameter.
In our case of model evaluation process we use 4 folds to validate the model, and here we explain the accuracy as: suppose during experiments in different folds accuracy is found 82, 84, 83, 81. Then the system generates the averages classification accuracy for the system is 82.
3.4 Methodology
The proposed decision tree algorithm works in three main phases:
1. Pre-processing
In straight words we can say that required designing such a decision tree that provide the higher performance and consumes fewer resources like building time and memory uses. For that purpose we create a model that is pre-process the data first and then consumes with C4.5 decision tree.
C4.5 decision tree is an algorithm that is derived from the ID3 algorithm and their performance of the system is higher than ID3.
For the pre-processing just do one thing that changes may be reflects the changes over tree building structure and their performance too.
Pre-processing algorithm:
1. Read all data from data set
2. Create groups of data for one target class. For example we have a data set with three classes A, B and C. then we creates three groups.
3. Apply one group data to design a tree and after that with B and in same way with C.
2. C4.5 model improvement
To achieve better performance we make a small change over the C4.5 algorithm. C45 decision tree the pseudo code for that algorithm is described below.
INPUT: Experimental data set D which is showed by discrete value attributes.
OUTPUT: A decision tree T which is created by giving experimental dataset.
Create the node N;
If instance is belong to the same class
Then return node N as the leaf node and marked with CLASS C;
IF attribute List is null, THEN
Return the node N as the leaf node and signed with the most common CLASS;
Selecting the attribute with highest information gain in the attribute List, and signing the test_attribute;
Signing the node N as the test_attribute;
FOR the known value of each test_attribute to divide the samples;
Generating a new branch which is fit for the test_attribute= ai from node N;
Suppose that Ci is the set of test_attribute = ai in the samples;
IF Ci is null THEN
Adding a leaf node and signed with the most common CLASS;
ELSE we will add a leaf node return by the Generate_decision_tree.
Table 3.1 proposed algorithm
In this phase the proposed methodology and their modelling are demonstrated in the next section the entire work summary is provided.
3.5 Chapter Summary
The given chapter provide the detailed understanding about the classification issues and their optimum solution that are feasible. In addition of that a new algorithm design and development strategy is also reported in detail.’
Chapter 4
Implementation
In this chapter the implementation of the proposed decision tree algorithm is provided. Therefore first the required tools and techniques are discussed then after the code implementation and GUI development of the system is provided.
4.1 Environmental study
Implementation of the required system utilizes software and hardware for successfully implementation is listed in this section.
Recommended hardware
‘ 2.0 GHz Processor required (Pentium D and above)
‘ Minimum 2 GB RAM
‘ 25 GB hard disk space
Recommended Software
‘ Operating System (Windows XP and above)
‘ Netbeans 6.7.1
NETBEANS IDE 6.7.1
Net Beans IDE is a modular, standards-based integrated development environment (IDE), which provides the tools and graphical interface for developing windows and web solutions at basic and enterprise level. The Netbeans IDE is written in the Java programming language. The Net Beans project comprises of an open source IDE written in the Java programming language and an application platform, which can be used as a generic structure to build any kind of application. In addition of that the net beans supports more than one programming language such as C++, PHP and others.
JDK 1.6
Java Platform, Standard Edition (Java SE 6) which code-named ‘Mustang’ is the new name and future version for what before known as J2SE. The new statement will be identified as product version 6, and developer version 1.6.0. Java SE 6 has two products provided under the term of the platform, that’s Java SE Development Kit 6 (JDK 6) and Java SE Runtime Environment 6 (JRE 6).Java SE 6 has superior level of maturity, stability, scalability and security of Java implementation. Java SE 6 has several new features, enhancements and developments especially better GUI performance and better handling of behaviour of GUI applications, plus developments and new features in server-side core and Java core.
4.2 Implementation using Code
This section describing the reference classes and libraries, implemented classes with their designed function and methods.
4.2.1 Reference Classes
This section contains the reference classes and libraries information which is utilized for implementation of the system.
S. no Class name Description
1 java.util.ArrayList The java.util.ArrayList class provides resizable-array and implements the List interface. Following are the important points about ArrayList: It implements all optional list operations and it also permits all elements, includes null. It provides methods to manipulate the size of the array that is used internally to store the list. The constant factor is low compared to that for the LinkedList implementation.
2 java.util.HashMap Hash table based implementation of the Map interface. This implementation provides all of the optional map operations, and permits null values and the null key.
3 java.sql.* Provides the API for accessing and processing data stored in a data source (usually a relational database)
4 java.util.Random In order to guarantee this property, particular algorithms are specified for the class Random. Java implementations must use all the algorithms shown here for the class Random, for the sake of absolute portability of Java code. However, subclasses of class Random are permitted to use other algorithms, so long as they adhere to the general contracts for all the methods.
5 java.io.File The Java.io.File class is an abstract representation of file and directory pathnames. Instances may or may not denote an actual file-system object such as a file or a directory. If it does denote such an object then that object resides in a partition. A partition is an operating system-specific portion of storage for a file system
6 java.awt.image.BufferedImage The BufferedImage subclass describes an Image with an accessible buffer of image data. A BufferedImage is comprised of a ColorModel and a Raster of image data. The number and types of bands in the SampleModel of the Raster must match the number and types required by the ColorModel to represent its color and alpha components.
7 javax.swing.JTable The JTable is used to display and edit regular two-dimensional tables of cells. See How to Use Tables in The Java Tutorial for task-oriented documentation and examples of using JTable.
Table 4.1 Reference classes
4.2.2 Implemented Classes
This section describes the classes that are implementing in order to find the solution of the proposed decision tree algorithm.
S. no Class name Description
1 Decision_T that class is initial project class which contains the main method for initiate execution of algorithms
2 Form1 That is an GUI implementation class which contains the navigation options and algorithm selection for process the input dataset to generate and test the data models
3 NCrossValidation This is user defined class used to implement the n cross validation technique for classification performance assessment
4 Rain_Forest This class implements functions and methods for constructing the decision tree using rain forest concept
5 SLIQ This class is an implementation of SLIQ algorithm which includes methods and functions for process the data and construct decision tree
6 StringTokenizer This class is used to pre-process the data set and producing the tokens of dataset attributes
7 Attribute That is a class used to list the input dataset attributes and enable a programmer to find and manipulate them
Table 4.2 implemented classes
4.2.3 Methods and Signatures
This section includes the essential methods and functions that are implemented to execute desired task in implemented system.
S. No. Methods and function Description
1 DoClassification This function is used to call the selected classifier for process the input dataset
2 NCrossValidation This function accepts the instances of data and number of folds for validating the generated data model
3 getPre This function is used to predict the class label of data instance provided as input
4 NumLeaves This function is used to count the number of leaf nodes in developed tree
5 ClassifyInstance This function is used to classify the instances using the trained data model
6 MakeTree This function is used to create tree using input instances
7 ComputeInfoGain This function is used to compute the information of selected attributes
Table 4.3 methods and signatures
4.3 GUI navigation
This section provides the information regarding the navigation of the current system and their implemented GUI.
Figure 4.1 opening dataset
Figure 4.2 algorithm selection
Figure 4.3 Rain Forest Execution
In order to simulate the execution and performance of the algorithm first the input dataset is provided to the system as given in figure 4.1 then after the algorithm selection performed where the implemented algorithms are listed using the dropdown list. Then the button start executes the algorithm for process data as given in figure 4.3 and figure 4.4. The given screen also includes the performance parameters and the obtained performance values of the algorithms for comparative and relative observations.
Figure 4.4 SLIQ Algorithm’
Chapter 5
Result analysis
This chapter provides the performance evaluation of the proposed data model. In addition of that for justifying the proposed solution the comparative study is made with respect to the SLIQ and traditional rain-forest algorithm.
5.1 Accuracy
Accuracy of any classification algorithm is a measurement of total accurately identified patterns over the given samples for classification. The accuracy of the classification can be evaluated using the following formula.
accuracy=(total correctly calssified samples)/(total samples as input) X100
Figure 5.1 accuracy
The comparative accuracy of all three algorithms are given using figure 5.1 in this diagram blue line shows the performance of SLIQ, red line shows the performance of traditional rain forest algorithm and green line shows the performance of the proposed rain forest algorithm. In order to provides the demonstration of performance the X axis represents data size and the Y axis shows the percentage accuracy of system. According to the evaluated results the performance of the proposed algorithm is much adoptable as compared to other two algorithms.
5.2 Error Rate
Error rate shows the measures the total amount of data which is misclassified during classification task. That can also calculate using the following formula.
error rate=100-accuracy
Or
error rate=(total misclassified data)/(total input samples) X100
Figure 5.2 error rate
The comparative error rate of algorithms are given using figure 5.2 according to the diagram blue line shows the performance of SLIQ, red line shows the performance of traditional rain forest algorithm and green line shows the performance of the proposed rain forest algorithm. In order to provides the demonstration of performance the X axis represents data size and the Y axis shows the percentage error rate. According to the obtained results the error rate of the system is less then other two algorithms and continuously reducing as the size of data increases.
5.3 Memory Used
Memory consumption is also termed as the memory used or space complexity. The memory consumption of three algorithms are given using figure 5.3 in this diagram blue line shows the performance of SLIQ, red line shows the performance of traditional rain forest algorithm and green line shows the performance of the proposed rain forest algorithm. In order to provides the demonstration of performance the X axis represents data size and the Y axis shows the memory consumed in terms of KB. According to the obtained results the proposed rain forest algorithm consumes less memory as compared to the given two traditional algorithms. The less memory consumption means less tree size and high performance decisional capability.
Figure 5.3 memory consumption
5.4 Time Consumption
The amount of time required developing data model using algorithm and input data is known as time consumption of time complexity of the algorithm. The comparative time complexity of algorithms are given using figure 5.4 in this diagram blue line shows the performance of SLIQ, red line shows the performance of traditional rain forest algorithm and green line shows the performance of the proposed rain forest algorithm. In order to provides the demonstration of performance the X axis represents data size and the Y axis shows time consumption in terms of Milliseconds. According to the obtained results the proposed algorithm consumes less time for model building and difference is increases as the amount of data is increases as input.
Figure 5.4 time consumption
‘
Chapter 6
Conclusion and future work
This chapter draws the conclusion of entire study about the decision tree algorithms and their methods of performance enhancement. Based on the different experimentations and design aspects some essential facts are observed which provided as conclusion of research work additionally some future extension of the presented work is also provided in this chapter.
6.1 Conclusion
Decision tree algorithm is classical approach of supervised machine learning and data mining. There are a number of decision tree algorithms are available such as ID3, C4.5 and others. The decision tree algorithms are able to develop a transparent data model. In order to maintain the transparency and relativity between attributes these algorithms are computationally expensive in terms of memory and time. Therefore a number of approaches are developed in recent years by which the classifiers are claimed to provide much efficient classification accuracy in less complexity.
Among these classification performance enhancement techniques the OVA and rain-forest algorithms are played an essential role. In this presented work the rain forest framework is investigated and in comparison of the traditional modelling a new rain-forest algorithm is proposed and implemented. The implemented technique consumes the C4.5 algorithm and voting schemes for enhancing more the classification accuracy, reducing the size of tree and minimizing the ambiguity in data.
The proposed model is implemented using JAVA technology and the comparative study is performed with respect to the traditional rain-forest algorithm and SLIQ algorithm. The compressions among these algorithms are performed in terms of accuracy, error rate, space complexity and time complexity. The comparative performance is summarized in table 6.1.
S. No. Parameters Proposed RF Traditional RF SLIQ
1 Accuracy High Low Low
2 Error rate Low High High
3 Memory consumed Low High High
4 Time consumed Low High High
Table 6.1 performance summary
According to the obtained performance the proposed algorithm produces high accuracy, low error rate and consumes less time and memory as compared with traditional rain forest algorithm and SLIQ algorithm. Thus proposed data model provides efficient and effective results during classification.
6.2 Future Work
The proposed data model is efficient and accurate data model which provides effective results as compared to the traditional data models. In near future the performance of classification is optimized more in terms of memory consumption and training time. In addition of that the real world application of the presented work is also provided.’
References
[1] Thangaparvathi B, Anandhavalli. D, ‘An Improved Algorithm of Decision Tree for Classifying Large Data Set Based on Rain Forest Framework’, 978-1-4244-7770-8/10/$26.00 ??2010 IEEE
[2] S. B. Kotsiantis, ‘Supervised Machine Learning: A Review of Classification Techniques’, Informatica 31 (2007) 249-268 249
[3] Khushboo Sharma and Manisha Rajpoot, ‘Comparative Study of Supervised Learning Algorithm For Sensor Data’, International Journal of Advanced Technology & Engineering Research (IJATER), Volume 2, Issue 4, July 2012
[4] Smitha .T, V. Sundaram, ‘Comparative Study of Data Mining Algorithms for High Dimensional Data Analysis’, International Journal of Advances in Engineering & Technology, Sept 2012. Vol. 4, Issue 2, pp. 173-178
[5] Brain Decoding of FMRI Connectivity Graphs Using Decision Tree Ensembles, 978-1-4244-4126-6/10/$25.00 ??2010 IEEE
[6] Time-Recurrent HMM Decision Tree to Generate Alerts for Heart-Guard Wearable Computer, S Keskar, R Banerjee – Computing in Cardiology, 2011, 2011 – ieeexplore.ieee.org
[7] Anshu Tiwari, Dr. Vijay Anant Athavale, ‘A Survey on Frequently Used Decision Tree Algorithm and There Performance Analysis’, International Journal of Engineering and Innovative Technology (IJEIT) Volume 1, Issue 6, June 2012
[8] Prof. Pier Luca Lanzi, ‘SLIQ A fast Scalable Classifier for Data Mining’, http://home.deib.polimi.it/lanzi/msi/presentazioni/03%20SLIQ.pdf
[9] Johannes Gehrke, Raghu Ramakrishnan, Venkatesh Ganti, ‘RainForest – A Framework for Fast Decision Tree Construction of Large Datasets’, Proceedings of the 24th VLDB Conference New York, USA, 1998
[10] Sanjay Kumar Malik, Sarika Chaudhary, ‘Comparative Study of Decision Tree Algorithms For Data Analysis’, International Journal of Research in Computer Engineering and Electronics, Page 1, ISSN 2319-376X, VOL :2 ISSUE : 3 (June: 2013)
[11] Dianhong Wang, Xingwen Liu, Liangxiao Jiang, Xiaoting Zhang, Yongguang Zhao, ‘Rough Set Approach to Multivariate Decision Trees Inducing’, Journal of Computers, VOL. 7, NO. 4, APRIL 2012
[12] T.Hemalatha, J. Samatha, Ch. Vani Priyanka, A.Lavanya, Ch.Ranjith Kumar, ‘Classification Approaches on Relational Databases’, IJCSNS International Journal of Computer Science and Network Security, VOL.13 No.4, April 2013
[13] S. S. Thakur, and J. K. Sing, ‘Mining Customer’s Data for Vehicle Insurance Prediction System using Decision Tree Classifier’, Int. J. on Recent Trends in Engineering and Technology, Vol. 9, No. 1, July 2013
[14] Lei Wang, Andrew G. Birt, Charles W. Lafon, David M. Cairns, Robert N. Coulson, Maria D. Tchakerian, Weimin Xi, Sorin C. Popescu, James M. Guldin,’Computer-based synthetic data to assess the tree delineation algorithm from airborne LiDAR survey’, Accepted: 2 November 2011, Springer Science Business Media, LLC 2011
[15] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid, ‘Good Practice in Large-Scale Learning for Image Classification’, IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE), 2014, 36 (3), pp.507-520
[16] Ekapong Chuasuwan, Narissara Eiamkanitchat, ‘The Feature Selection for Classification by Applying the Significant Matrix with SPEA2’, 2013 International Computer Science and Engineering Conference (ICSEC): ICSEC 2013 English Track Full Papers
[17] Yael Ben-Haim, Elad Tom-Tov, ‘A Streaming Parallel Decision Tree Algorithm’, Journal of Machine Learning Research 11 (2010) 849-872 Submitted 2/09; Revised 12/09; Published 2/10 [18] Zerina Ma??eti??, Abdulhamit Subasi, ‘Detection of congestive heart failures using C4.5 Decision Tree’, Southeast Europe Journal of Soft Computing Vol. 2 No. 2 Sep. 2013 (74-77)
[19] P.Rajendran, M. Madheswaran, ‘Hybrid Medical Image Classification Using Association Rule Mining with Decision Tree Algorithm’, Journal of Computing, Vol 2, Issue 1, January 2010, ISSN 2151-9617
PUBLICATION
[1] Aashoo Bais, Manish Shrivastava ‘ An Enhanced Approach of Data Mining using Decision Tree’, National Journal of Engineering Science and Management, Vol-4, Issue-1, June 2012.
[2] Aashoo Bais, Manish Shrivastava ‘ Implementation Of A Decision Tree’, International Journal of Engineering and Advanced Technology, Vol-2, Issue-2, December 2012.
Essay: Developing a decision tree algorithm for enhancing accuracy and computational complexity of traditional algorithms
Essay details and download:
- Subject area(s): Computer science essays
- Reading time: 31 minutes
- Price: Free download
- Published: 6 October 2015*
- Last Modified: 23 July 2024
- File format: Text
- Words: 8,990 (approx)
- Number of pages: 36 (approx)
Text preview of this essay:
This page of the essay has 8,990 words.
About this essay:
If you use part of this page in your own work, you need to provide a citation, as follows:
Essay Sauce, Developing a decision tree algorithm for enhancing accuracy and computational complexity of traditional algorithms. Available from:<https://www.essaysauce.com/computer-science-essays/essay-developing-a-decision-tree-algorithm-for-enhancing-accuracy-and-computational-complexity-of-traditional-algorithms/> [Accessed 19-12-24].
These Computer science essays have been submitted to us by students in order to help you with your studies.
* This essay may have been previously published on EssaySauce.com and/or Essay.uk.com at an earlier date than indicated.