Artificial Intelligence
Scenario 2-AI Classifier Development, Optimisation and Selection
Philip Margetson
14054452
Contents
1. Introduction 3
2. AI Classifiers
2.1 Decision Trees
2.2. Artificial Neural Networks
3. Data Set
3.1 Data Distribution & Normality
3.2 Predictive
3.3 Outliers
3.4 Attribute Measurement Scales
4. Prediction of Classifier Performance
4.1 Strengths and Weaknesses of Decision Trees
4.2 Strengths and Weaknesses of ANN
4.3 Prediction/conclusion
5. Pre-processing
5.1 Strategy for Missing Values
5.2 Strategy for Outliers
5.3 Decision Tree Test of Data
6. Main Experiments
6.1 Decision Tree
6.1.1 Confidence Factor
6.1.2 Minimum Number of Objects
6.2 Artificial Neural Networks
6.2.1 Learning Rate
6.2.2 Momentum
6.2.3 Single Layer ANN
6.2.4 Multi-Layer ANN
7. Conclusions
7.1 Best Pruned Tree
7.2 Diagram of Best Pruned Tree
7.3 Best Artificial Neural Network
7.4 Diagram of Best Artificial Neural Network
7.5 Table of Best Classifiers
7.6 Reflection of Results
8. References
1. Introduction
For this task, I have been asked to design and develop a decision making system based on a historical dataset for mammographic records that will be used to train the system. The provided dataset has some missing attributes and doesn’t show any trends or correlation between data points meaning the outcome is of a non-linear nature. In order to improve the accuracy of the classifier, some pre-processing of data will happen.
I will be exploring two different classifier approaches- Decision Trees and ANN (Artificial Neural Networks). I will be comparing and contrasting the two types of approaches and analysing the results that they produce in order to find the best possible type of classifier for the situation.
2. AI Classifiers
2.1 Decision Trees
A decision tree is a mathematical model used in order to make decisions- this tree contains branch nodes which represent a choice between a number of alternatives, until a leaf node is reached. This leaf node then represents a decision.
Decision trees are mostly used in classification problems and can work for both categorical/nominal and continuous inputs. The decision trees are refined or “learn” from a provided data set which contains past examples of input data and the given classification for that particular set of data. As a tree is then trained on a data set, it can be used to predict for unseen circumstances given that it has the necessary input data [1].
Fig 2.1.1 inserted below is an example decision tree of whether someone should play tennis or not given the input weather data.
Fig 2.1.1 Decision tree diagram [2]
Decision Tree Parameters
Parameters are used to tweak a decision tree classifier in order to produce the best possible results. In this section I will outline some of the parameters I will be looking at in detail in order to create the best decision tree based classifier for the mammographic dataset.
Confidence Factor
Confidence factor influences the amount of pruning on the decision tree, the smaller the confidence factor, the greater the amount of pruning occurs. This factor decides whether to replace a near-leaf node and its child leaves by a single leaf node. In more detail- J48 compares the upper limits of the error confidence intervals for two trees. For the unpruned tree, the upper error estimate is calculated as a weighted average over its child leaves. Whichever tree out of this comparison has a lower estimated upper limit on the error rate “wins” and is selected [3].
Max depth
The name of this parameter is fairly self-explanatory- it defines the actual depth of the decision tree from the root node to its leaves. The deeper the tree, the more splits it has and it captures more information about the data. A deeper tree will usually provide greater accuracy on the training set, however not on the testing set. A deep tree lets the model over fit to the data better leading to a more closely fitted decision boundary which may not correspond to an actual boundary between classes.
Min Number of Objects
This parameter defines the minimum number of instances that each leaf contains at each split, at least 2 (doesn’t have to be more) of the branches. In other words, it is the minimum amount of data separation per branching [4].
Number of Folds
The number of folds parameter in Weka will remain unchanged throughout my experiments. As an example if there are 100 cases of labelled data, which is then divided into 10 sets of equal size. Each set is then further divided into 2 groups- 90 sets for training the tree then 10 sets of data is used for testing the tree [5]. The classifier accuracy (CA) is determined by the performance average of the 10 sets.
Unpruned/Pruned
Pruning is used as a method to reduce the size of a decision tree. It works by removing nodes from a decision tree in order to reduce complexity to help generalisation. Pruning can cause a reduction of accuracy on training data, but also leads to an increase on the previously unseen testing data. [5]
Within the J48 classifier, there are two different types of post pruning methods. The first method is called subtree replacement- this works by replacing nodes in the tree with a leaf node (terminal node) this leads to a smaller tree causing in a reduction in the number of tests in a specific path down the tree. A second type of pruning is called subtree raising- this is when a node is moved upwards towards the top of the tree replacing other nodes in its path. This can often have a very small effect on the decision tree model as well as it being a computationally complex task meaning it can take a long time to complete.
2.2 Neural Networks
An Artificial Neural Network (ANN) is a mathematical model that is inspired by the way that biological neural networks in the human brains work in order to process information. The fundamental concept of the neural network is called a “neuron” also known as a “node”. It receives input from other nodes within the network or external input sources. Each input has a weight associated with it which is assigned based on the connections relevant importance in order for the network to be effective.
The diagram inserted below (Fig 2.2.1) shows 2 inputs for a neuron in a network along with a bias(b) value of 1 in order to produce an output (Y). The function (known as the activation function) in order to produce this output is non-linear and is the sum of input values multiplied by their associated weights, the value of the bias is added to this summation in order to produce the output of the neuron (Y)[6].
Fig 2.2.1 Neuron Diagram [6]
These neurons are connected to other neurons across layers. These layers between the input layer and output layer are known as “hidden layers”
Fig 2.2.2 Hidden layer diagram [6]
There are two types of neural network, the simplest form of ANN is the single layer perceptron. This form does not contain any hidden layer. Only an input and an output layer.
The multi-layer perceptron contains one or more hidden layers. The multi-layer perceptron is far more practical in real world applications as they allow for extremely complex problems as they can be used to create mathematical models regression analysis [6].
Neural Network Parameters
Learning rate
The learning rate dictates how fast the neural network will be trained from the training set. A low rate indicates that the network will be trained more slowly than if the learning rate is a higher value.
The diagram below outlines the effects that a change in learning rate can have on the classifier. As detailed- if the learning rate is too low, it will take many updates before finding the global minima due to each update only being slightly different to the previous one[6]. If the changes in learning rate are too large, it becomes much more difficult to locate the minima as it can be “overshot” with each update.
Fig 2.2.3 Learning rate diagram [7]
Hidden layers
Hidden layers are used for data that is not linearly separable. The size of the hidden layer is usually somewhere in between the size of the input and output layers. By using hidden layers, it allows approximation for functions that are non-linear.
In a neural network, (with multiple layers) each layer can apply any function to the previous layer.
Momentum
A descent gradient is used in order to reach the global minima. However, as this is trying to be found, the algorithm may encounter a local minimum which it may think is the global minima. This can be avoided by changing the value of the momentum, this parameter is linked to the learning rate so as momentum increases, learning rate should remain a smaller value to avoid the global minima from being missed.
Training Time
This parameter in Weka is used to control the number of epochs that the back propagation algorithm will compute (where an epoch is a measure of one forward and backward movement of all training examples).
3. Data set
3.1 Data Distribution & Normality
A test that I’ve performed on the mammographic dataset was the fat pencil test. This test is used to quickly get an understanding of whether the data follows a normal distribution. As well I have provided a histogram for each attribute to give a visual understanding of the data distribution.
I have performed these tests in order to give an understanding of the data and find what methods will be suitable for replacing the missing values.
Bi-RADS
Age
Shape
Margin
Density
Severity
3.2 Predictive
Mammography is a type of breast imaging through x rays used with the aim to detect breast cancer at an early stage (where the cancer is most treatable with the greatest effectiveness). With this method of detection there can be both false positives and false negatives, with estimates of over diagnosis at an estimated 1% to 54%[
https://www.cancer.org/healthy/find-cancer-early/cancer-screening-guidelines/american-cancer-society-guidelines-for-the-early-detection-of-cancer.html
]. These over diagnosis meet the definition of cancer but will never progress to cause any symptoms or death. As well as this, the false negatives can be extremely dangerous. Samuel S. Epstein claims that in women ages 40 to 49, 1 in 4 of cancer is missed at each mammography.
These errors may be able to be reduced with the use of an appropriate AI classifier, by taking various input data to help come to a decision of whether or not its severe.
According to breastcancer.org there is a clear link between the probability of developing invasive breast cancer (IBC) increasing as the person’s age increases. Using the statistics available at [http://www.breastcancer.org/symptoms/understand_bc/risk/understanding] I have created a graph to show this link (inserted below).
As well as the link regarding age and breast cancer, according to the NHS, there is also a link between the density of breast disuse and the risk of developing breast cancer due to there being more cells that can become cancerous.[ https://www.nhs.uk/conditions/breast-cancer/causes/#dense-breast-tissue]
3.3 Outliers
An outlier is a data point that is vastly different from the rest of the data in that particular set. An example in the supplied mammographic dataset would be within BI-RADS, there is a single data point with a value of 55
3.4 Attribute Measurement Scales
The attributes within the mammographic dataset are composed of different measurement scales, I have outlined each of these below.
Bi-Rads
BI-RADS is an acronym for “Breast Imaging Reporting and Data System”, in the data set it has an ordinal scale between 0 and 5 (5 being having the highest suspicion of malignancy) corresponding to the following:
0- incomplete
1-negative
2-benign findings
3-probably benign
4-suspicious abnormality
5-highly suspicious of malignancy
Age
This attribute is self-explanatory and refers to the patient’s age in years.
Shape
The shape attribute is a Nominal value (Nominal scales are used for labelling variables, without any quantitative value) each number corresponding as follows:
1-Round
2-Oval
3-Lobular
4-Irregular
Margin
Margin attribute depicts the characteristics of the “edge” of the tissue[http://breast-cancer.ca/mass-chars/], this again is a nominal unit of measure as follows:
1-Circumscribed
2-Microlobulated
3-Obscured
4-Ill-defined
5-Speculated
Density
Density refers to the amount of fat cells present and the density of suspicious cells, it again is a nominal scale and is shown as follows:
High=1
Iso=2
Low=3
Fat-containing=4
Severity
Severity is a binominal attribute refers to if the mass is benign (0) or malignant (1)
Within the dataset I have replaced these values for true and false in order to make this class suitable for processing with Weka.
4. Prediction of classifier performance
4.1 Strengths and weaknesses of decision tree when applied
Using a decision tree brings numerous benefits. Firstly, DTs are much more easy to interpret when compared to ANNs as DTs are easily displayed to the user as what is essentially a flowchart of decisions based on the input data. In contrast, ANNs are only visualised as sets of neurons in layers connected by many neurons- meaning that DTs are much easier to interpret. DTs are also better in comparison than ANNs as they can be used to generate IF THEN ELSE statements meaning that they can create rules which can be applied in other areas of programming/ in general life.
A drawback of using a DT for a dataset could be that decision trees can over fit to data[http://www.saedsayad.com/decision_tree_overfitting.htm]. This happens when the tree develops a hypothesis that will reduce the error on the training set, however when testing it will result in a cost to CA due to the data being “learned” rather than deriving rules and applying these to the testing set.
4.2 Strengths and weaknesses of ANN when applied
An advantage of using an ANN for this data set would be that they perform well for data that is not linearly separable meaning that its suitable for more arbitrary functions where the input data less “black and white”.
A disadvantage of using an ANN for the dataset is that it is a black box experiment how it gives results without a clear explanation (unlike DTs) meaning its difficult for a user to see how decisions were made based on the input data. As well as this it can take long periods of time to train when compared to a DT[https://towardsdatascience.com/introduction-to-neural-networks-advantages-and-applications-96851bd1a207].
https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume11/opitz99a-html/node13.html
4.3 Prediction/conclusion
I predict that the most suitable classifier for the provided dataset will be a DT due to its ability to deduct IF THEN ELSE statements meaning rules as well as the DT being able to be understood visually resulting in an understanding of the data which can then be further applied to other problems/challenges.
5. Pre-processing
5.1 Strategy for missing values
BI-RADS
For the BI-RADS attribute, I have decided to replace it with the median of the known values (4) due to its ordinal data type.
Age
As the age attribute has a relatively normal symmetrical distribution, I have decided to replace any missing values with the mean value of all the recorded ages rounded to 2 decimal places (55.49).
Shape
In order to replace the missing values for the Shape attribute, I have decided to use the modal value (4). My justification for this is that the data type is of a nominal type meaning this is the only suitable measure of central tendency.
Margin
This attribute, again, is a nominal type. Similar to how I replaced the missing values in Shape I will be doing so again for Margin by using the modal value of 1.
Density
For this attribute the missing values will be replaced by using the median values as the Density is measured on an ordinal scale.
Severity
This attribute is the outcome of the input data, whilst it has no missing values, I have replaced the 1 and 0 values with true and false respectively (true corresponding to malignant). My reasoning for this is that Weka does not allow numeric inputs for the class attribute.
5.2 Strategy for outliers
The only outlier I have chosen to deal with within the dataset is within the BI-RADS attribute (which can be easily seen in the probability plot for BI-RADS which is shown above). As the data for BI-RADS is of an ordinal type between the values of 0 and 5, this is most likely due to human error on data input, so in this case, I will be replacing the value of 55 with 5 as this is most likely the intended data for this particular record.
5.3 Decision tree test of data
As Weka is capable of replacing missing values itself, I decided to test the value of classification accuracy produced when a J48 tree is ran with default values with the missing values left in the dataset, then compare this to a J48 tree that is constructed after I have replaced the values myself as detailed above. The following screenshot is when Weka replaced the values, as you can see it produces a classification accuracy of 82.102%
The next screenshot is when I replaced the values using appropriate measures of central tendency, as you can see it now produces a CA of 83.0385%.
As shown in the screenshots, the J48 classifier performs slightly better with the values that I have replaced, because of this, I shall now be using this dataset for my main experiments in order to produce the best classifier.
6. Main experiments
6.1 Decision Tree
6.1.1 Confidence Factor
Plan for Confidence Factor
In order to find the classifier with the best confidence factor for the dataset, my plan is to start with the default value of 0.25 and explore values below and above it in increments of 0.05.
After this initial test I will explore values that produce a higher CA (Classification Accuracy) in more detail in order to find any peak in results (this time changing the increment to 0.01.
Then after the second test I will run a third, this test will be run in even greater detail, by changing the increment between CFs again down to 0.001.
Confidence Factor Test Results
Test 1
CF
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.4
0.45
0.5
%
81.71
81.80
81.83
81.83
82
82.12
82.12
81.17
82.01
81.83
After this first test, I noticed a peak in CA around 0.3-0.4. Because of this, I decided to investigate this area further
Test 2
CF
0.30
0.31
0.32
0.33
0.34
0.35
0.36
0.37
0.38
0.39
0.40
%
81.71
81.80
81.83
81.83
82
82.12
82.12
81.17
82.01
81.83
82.17
During the second test, the values reached a peak CA at 0.38. I then decided to further investigate the values between 0.37 and 0.39 in even more detail.
Test 3
CF
0.377
0.378
0.379
0.380
0.381
0.382
0.383
0.384
0.385
%
82.23
82.23
82.25
82.25
82.25
82.23
82.24
81.24
82.24
When I ran test 3, I found the highest CA to be 0.380. After this peak the CA decreased slightly as the CF increased. However, it did slightly increase around 0.384 and 0.385. because of this increase, to be sure that I have indeed found the best CF, I decided to look at values >0.385 to ensure that the CA doesn’t increase again.
Test 4
CF
0.386
0.387
0.378
0.389
%
82.24
82.22
82.21
82.21
During test 4, the values decreased further after CF 0.386. This leads me to my conclusion that the best performing CF is 0.38 which provides a classification accuracy of 82.25%.
6.1.2 Minimum Number of Objects
Plan of Experiments
For this experiment, I will start with the value of 2 for minimum number of objects, and then increase this value up to 50 initially. If the CA values are showing an increase around this value, I will then carry on increasing this MNO value in later experiments until there’s a marked decrease in CA in order to find the most highly pruned tree.
Test 1
MNO
2
5
10
15
20
30
40
50
60
70
80
%
82.00
82.09
81.98
82.23
82.62
82.23
83.08
83.09
82.27
81.31
80.03
During the first test I noticed a trend towards better results the higher the MNO was, resulting in a peak around 50 and then a steady decline afterwards. Because of this, I decided to investigate the values between 40 and 50.
Test 2
MNO
40
42
44
46
48
%
83.08
82.04
83.12
83.14
83.14
As the CA values within this test increased as it reached 48, I decided that I would run one more test in order to see if I can reach an even more pruned tree without sacrificing the classification accuracy.
Test 3
MNO
47
48
49
50
%
83.12
83.14
83.14
83.09
During this test, I found 49 to be the highest value of CA whilst also being the most pruned tree.
To conclude, in this set of experiments, the most highly pruned tree I found contained 49 as the value for MNO giving a CA of 83.14
6.2 ANN
Plan
The first step in order to find the best performing ANN with a single layer of neurons is to conduct experiments to find the optimal learning rate and momentum values. The reason for doing this is to find the right trade-off between training time and finding the lowest possible error.
After this is completed, I will then be conducting further experiments to find what number of neurons in a single layer produces the neural network with the highest classification accuracy.
After this has been determined, in order to find the highest CA in an ANN with multiple layers, I will be selecting the best performing number of neurons of the single layer and experimenting by adding layers to this to see if I can find an even better performing classifier when the ANN contains multiple layers.
6.2.1 Learning Rate
As stated in my plan, the first step was to determine the best learning rate of the ANN for the dataset.
Learning Rate Test
Learning Rate
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
%
82.44
82.27
82.22
82.27
82.06
81.9
82.06
81.99
81.86
82.95
After this test, I concluded that the optimal learning rate was 0.1 as this produced the highest CA. Now that the learning rate is determined, the next step is to find the best rate of momentum for the ANN.
6.2.2 Momentum
Momentum Test 1
Momentum
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
%
82.38
82.44
82.44
82.34
82.27
82.26
82.33
82.08
82.14
67.00
From the results of this first test to find the correct momentum, I found a peak of values between 0.2 and 0.3. I decided that there may be an even higher amount between these values so I conducted further testing between 0.2 and 0.3 in increments of 0.01.
Momentum Test 2
Momentum
0.21
0.22
0.23
0.24
0.25
0.26
0.27
0.28
0.29
%
82.43
82.43
82.44
82.46
82.45
82.43
82.47
82.46
82.48
As you can see as the results from the second test, there was indeed a further increase in values between 0.2 and 0.3 resulting in the best possible choice for the momentum being 0.29 as this produces a CA of 82.46.
Now that the best choice of learning rate and momentum have been found (0.1 and 0.24 respectively), I could now feed in these parameters in order to find the best number of neurons in a single layer.
6.2.3 Single Layer ANN
Number of Neurons Test
Neurons
1
2
3
4
5
6
7
8
9
%
83.42
82.95
82.48
82.22
82.48
81.87
82.32
82.35
82.23
The highest CA produced by the ANN with a learning rate of 0.1 and momentum of 0.29 was 83.42 with 1 neuron in a single layer.
As I determined that the best number of neurons in 1 layer was 1, I will be using this in the first layer and then adding subsequent neurons and layers in order to find the best CA possible. The test below shows experiments to find the best CA using 2 layers with different numbers of neurons.
6.2.4 Multi-Layer ANN
2 Layer ANN Test
Layers
1,1
1,2
1,3
1,4
1,5
%
83.33
83.28
83.42
83.41
83.42
During this experiment I found that when I was using 2 layers, the best CA was produced when using both 3 and 5 neurons on this second layer. Due to this, I decided to add a third layer and experiment using both 3 and 5 neurons on the second layer. The test results for this are in the following tables.
3 Layer ANN Tests
Layers
1,3,1
1,3,2
1,3,3
1,3,4
%
80.09
75.25
83.42
83.39
Layers
1,5,1
1,5,2
1,5,3
1,5,4
1,5,6
%
55.93
83.32
83.38
83.45
83.32
By adding a third layer, I found a new best CA with 1 neuron on the first layer, 5 on the second and 4 on the third. This produced a CA of 83.45% which was my highest result throughout my ANN experiments. As adding a third layer caused an increase in CA, I decided to further experiment by adding a 4th layer to see if this causes any further increase.
4 Layer ANN Test
Layers
1,5,4,1
1,5,4,2
1,5,4,3
1,5,4,4
1,5,4,5
%
52.94
52.94
52.94
52.94
52.94
The results of this experiment caused a drastic decrease in CA, leading me to confirm that the best arrangement was 1,5,4.
7. Conclusions
7.1Best Pruned tree
The screenshot below shows the best pruned decision tree I could produce for the mammographic dataset. As you can see, it has a size of 9 with 5 leaves (number of leaves corresponds to the amount of terminal node which holds the outcome of that particular path down the tree. In order to generate this tree, I gave the tree a confidence factor of 0.38 whilst setting the minimum number of objects to 49. The tree that was produced gives a classification accuracy of 82.93%
7.2 Diagram of Best Pruned Tree
The screenshot below is a diagram of the best pruned tree that was detailed in the previous subsection.
7.3 Best Artificial Neural Network
During my experiments I found the best ANN to be constructed of 3 hidden layers, with 1 neuron in the first layer, 5 in the second and 3 in the third layer. I used 0.1 and 0.29 for the learning rate and momentum parameters which gave a classification accuracy in Weka Explorer of 83.96%.
7.4 Diagram of Best Neural Network
7.5 Table of Best Classifiers
Classifier Type
Parameters
Classification Accuracy
Single Layer ANN
LR-0.1
Momentum-0.29
83.42%
Multi-Layer ANN
LR-0.1
Momentum-0.29
Layers-1,5,4
83.45%
Decision Tree
CF-0.38
MNO-49
83.14%
7.6 Reflection of Results
To conclude, all of the different classifiers I have tried have given similar values of CA, the strongest of which was the multi-layer perceptron with 3 layers providing a CA of 83.45%.
For the client that requested a decision making system for the provided dataset, I would recommend the Decision tree (with parameters listed above in 7.5) that provided 83.14% accuracy. Although this is 0.31% less accurate than the multi-layer perceptron, the ability to view IF THEN ELSE rules based on the input parameters and also ability to see exactly how each decision is made meaning that knowledge can then be extracted from this and can also be applied in other situations.
9. References
[1] dzone.com. 2018. An Introduction to Machine Learning With Decision Trees – DZone AI. [ONLINE] Available at: https://dzone.com/articles/machine-learning-with-decision-trees. [Accessed 15 April 2018]
[2] Null Pointer Exception. 2018. A Tutorial to Understand Decision Tree ID3 Learning Algorithm – The Null Pointer Exception. [ONLINE] Available at: https://nullpointerexception1.wordpress.com/2017/12/16/a-tutorial-to-understand-decision-tree-id3-learning-algorithm/. [Accessed 15 April 2018]
[3] CS345, Machine Learning, Entropy-Based Decision Tree Induction (ID3) . 2018. CS345, Machine Learning, Entropy-Based Decision Tree Induction (ID3) . [ONLINE] Available at: http://www.cs.bc.edu/~alvarez/ML/statPruning.html. [Accessed 15 April 2018].
[4] WEKA – Details of J48 pruning parameters. 2018. WEKA – Details of J48 pruning parameters. [ONLINE] Available at: http://weka.8497.n7.nabble.com/Details-of-J48-pruning-parameters-td42456.html. [Accessed 15 April 2018].
[5]
[6] the data science blog. 2018. A Quick Introduction to Neural Networks – the data science blog. [ONLINE] Available at: https://ujjwalkarn.me/2016/08/09/quick-intro-neural-networks/. [Accessed 1 March 2018].
[7] Jeremy's Blog. 2018. Setting the learning rate of your neural network.. [ONLINE] Available at: https://www.jeremyjordan.me/nn-learning-rate/. [Accessed 15 April 2018].
http://gautam.lis.illinois.edu/monkmiddleware/public/analytics/decisiontree.html
http://www.six-sigma-material.com/Normality-Assumption.html
https://support.minitab.com/en-us/minitab/18/help-and-how-to/graphs/how-to/probability-plot/interpret-the-results/key-results/