1. How many features are there for the iris dataset? How many examples? How many labels?
2. Why is it important to split the dataset into training and test set? Why a classification model needs to be trained on the training set and the prediction performance needs to be measured on the test set?
3. How correlation analysis can help identify the best features for the classification task? What are the best features for the iris data based on correlation analysis results?
4. Which class is easier to identify than the other two classes for the iris dataset? How can you tell it?
5. Which classification model produces better test result for the iris data? Linear SVM trained on all features or linear SVM trained on the two best features? What does this tell you?
6. Why linear SVM does not produce good result for the two moons example?
7. Compare linear and kernel SVM in terms of predictive performance and training speed. What conclusions can you make?
8. Why do we need to perform parameter selection in training classification models for predictive modelling?
9. Why can't we choose the classifier parameter that produces the best training performance?
10. What is cross validation and why it is an effective technique for parameter selection in classifier training?
Answer:
1. There are four features in the iris dataset. These features are measured in centimetres.
The features are:
- Sepal length
- Sepal width
- Petal length
- Petal width
Each column is a feature (also known as: Predictor, attribute, Independent Variable, input, regressor, Covariate)
There are 50 samples for each specie (Iris Setosa, Iris virginica and Iris versicolor) of Iris flower. This results in 150 records (examples) where each observation will have 4 features, as stated above. Each row is an observation (also known as: sample, example, instance, record)
Labels are also known as targets. Each value that we predict is the response (also known as: target, outcome, label, dependent variable.
Classification is a supervised learning where label is cate
gorical. There are 150 labels in iris dataset falling under 3 categories:
0= Setosa
1= Versicolor
2= Virginica
2. In Machine Learning, we make a model which is nothing but an algorithm where some parameters needs to be modified such that it is able to perform good at the application i.e. it is able to predict values of one wants to.
We can train the model using data which we call as training data or training set. The training data is the one which already has the actual value that the model should have predicted and thus the algorithm changes the value of parameters to account for the data in the training set.
To know after training the model is overall good or not, we have test data/test set which is basically a different data for which we know the values but this data was never shown to the model before. Thus, if the model after training is performing good on test set as well then, we can say that the Machine Learning model is good.
It is important to learn the predictive model (i.e. the classifier) on the training set and test its performance on the test set. The purpose of predictive modelling is to create models that are able to predict on future data. Hence it is important to keep training and test data separate and do not use test data for learning predictive models.
A classification model can be used to predict the class label of unknown records. A classification technique is a systematic approach to building classification models from an input set. The model generated by a learning algorithm should both fit the input data well and correctly predict the class labels of records it has never seen before.
First, a training set consisting of records whose class labels are known must be provided. The training set is used to build a classification model, which is subsequently applied to the test set, which consists of records with unknown class labels.
Evaluation of the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model.
3.Data correlation is the way in which one set of data may correspond to another set. For the classification problem, feature selection aims to select subset of highly discriminant features. In other words, it selects features that are capable of discriminating samples that belong to different classes.
For the problem of feature selection for classification, due to the availability of label information, the relevance of features is assessed as the capability of distinguishing classes.
For example, a feature fi is said to be relevant to a class cj if fi and cj are highly correlated.
Classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known.
Based on the correlation analysis results, we can see that features petal_length and petal_width are the best features for iris classification. As per the pair plot graphs, petal_length and petal_width is highly correlated.
If you try to train a model on a set of features with no or very little correlation, it will give inaccurate results.
4. As per the Correlation analysis results, the class Setosa with target value 0 is easier to identify than the other two classes (1-Versiocolor, 2-Virginica) for the iris dataset.
As evident in the plotted graph, Setosa (represented by blue color) is easily separable and can be distinguished by the other two classes of species of iris dataset. Setosa is easy to classify and has an easily separable boundary around it and helps to eliminate it from the other two classes.
5.As per the classifier test performance, we see that linear SVM helps the classification results by visually plotting the decision boundaries. Different colored regions correspond to different classes.
As per the linear SVM classifier test performance, linear SVM trained on all features has better result compared to linear SVM trained on the two best features for the iris dataset because the test accuracy has gone up to 95% compared to initial level of 85%.
6.Linear SVM does not produce good result for the two moons example because this is a binary classification problem and the targets from this dataset will not be well separated with a linear classifier.
7. Kernel SVM achieves better performance in terms of higher accuracy than linear SVM.
Accuracy of Linear SVM = 86.0 %
Accuracy of Kernel SVM = 93.4 %
Kernel SVM produces a nonlinear decision boundary (a curve) to separate points from two classes, showing different regions in different colors while Linear SVM produces a linear decision boundary (a line) to separate points from two classes, which is not appropriate for this case.
Though kernel SVM is effective yet it is slower than linear SVM in training. When we measured the average time by training both linear and kernel SVM classifiers 3 * 100 times, the results are as follows:
Linear SVM:
100 loops, best of 3: 11 ms per loop
Kernel SVM:
100 loops, best of 3: 22.8 ms per loop
8. We need to perform parameter selection in training classification models for predictive modelling because it helps in further improving the performance of the model, particularly training performance.
Based on the test results, we can see that both training and testing performances are affected by the choice of parameter. For Example, we have used the regularisation parameter C to see the effect on performance.
Increasing value of C shows improvement in training performance but not in testing performance due to overfitting of the model.
9. We can’t choose the classifier parameter that produces the best training performance because maximizing training accuracy rewards overly complex models which overfit the training data.
There is an effective approach called cross validation for parameter selection on the training dataset.
10. Cross Validation is used to assess the predictive performance of the models and to judge how they perform outside the sample to a new data set also known as test data.
The motivation to use cross validation techniques is that when we fit a model, we are fitting it to a training dataset. Without cross validation we only have information on how our model performs to our in-sample data. Ideally, we would like to see how the model performs when we have a new data in terms of accuracy of its predictions. In science, theories are judged by its predictive performance.
k-fold cross-validation is mostly suggested in machine learning.
Cross validation is an effective technique for parameter selection in classifier training because it uses data more efficiently as every observation is used for both training and testing and it provides more accurate estimate of out-of-sample accuracy.
In our example, we can see the accuracy value achieved by kernel SVM classifier trained with optimal parameter is higher than that produced with kernel SVM classifier trained using default parameter value. This validates the importance and effectiveness of parameter selection.
References:
Jason Brownlee (2016) Your First Machine Learning Project in Python Step-By-Step. Tate [online].
Available from: https://machinelearningmastery.com/machine-learning-in-python-step-by-step/ [Accessed 21 May 2018].
Karlijn Willems (2017) Python Exploratory Data Analysis Tutorial. Tate [online].
Available from: https://www.datacamp.com/community/tutorials/exploratory-data-analysis-python [Accessed 21 May 2018].
Roberto Lopez (2018) Iris flowers classification. Tate [online].
Available from: https://www.neuraldesigner.com/learning/examples/iris_flowers_classification [Accessed 21 May 2018].
Ritchie Ng (2018). Cross-Validation. Tate [online].
Available from: https://www.ritchieng.com/machine-learning-cross-validation/ [Accessed 21 May 2018].
We can conclude that kernel SVM is a good classifier in terms of predictive performance while linear SVM is better classifier in terms of training speed.