BUS5PA Predictive Analytics-Model Performance Comparison
1) Create a SAS library for the provided data set ORGANICS as a data source for the project.
2) Set the roles for the analysis variables as shown above.
3) Examine the distribution of the target variable. What is the proportion of individuals who purchased organic products?
4) The variable DemClusterGroup contains collapsed levels of the variable DemCluster. Presume that, based on previous experience, you believe that DemClusterGroup is sufficient for this type of modeling effort. Set the model
role for DemCluster to Rejected.
5) Finish the ORGANICS data source definition.
6) Add the ORGANICS data source to the Organics diagram workspace.
b. As noted above, only TargetBuy is used for this analysis, and it should have a role of Target. Can TargetAmt be used as an input for a model that is used to predict TargetBuy Why or why not Please explain with justifications.
Answer:
- SAS Library, has been created named Project, and data source has been created using SAS dataset ORGANICS which has been shown it Fig. 2 and Fig. 3.
- As mentioned in the business case assignment, roles have been set for the analysis variables., all the roles have also been defined for the data source ORGANICS, which has been shown in Fig. 4.
- “TargetBuy” has been defined as target variable. In percentage terms, 24.77% individuals have purchased organic products and rest i.e.75.23% have not purchased organic products. Percentage distribution has been shown in Fig. 5.
- Demcluster has been set rejected, which has been shown in Fig. 4.
- In Fig. 3, data source named ORGANICS has been defined.
- In Fig. 6, it has been shown that ORGANICS data source has been added to Organics diagram workspace.
- TargetAmtcan never be used as the predictor of TargetBuy. The individuals have purchased the organic item or not, that is indicated by TargetBuy, whereas TargetAmt indicates the number of organic amounts bought. TargetAmt will only be recorded when Targetbuy is Yes i.e. for those who have purchased any organic products. Hence, in this model, to predict TargetBuy, TargetAmt cannot be used as an input. The objective of supermarket’s is to develop a loyalty model by understanding whether customers have purchased any of the organic products. So, TargetBuy is the perfectly appropriate as target variable.
Decision tree based modeling and analysis
- From Sample Tab, data partition node has been added to the diagram and it has been connected to the data source node (ORGANICS). As mentioned in the assignment, 50% of the data for training and 50% for validation have been assigned (Fig. 7 and Fig. 8.)
- In Fig. 9, it has been shown that the Decision Tree node has been added to the workspace and it has been connected to the Data partition node.
- Decision Tree model has been created autonomously, and sub tree model assessment criteria has been chosen by using average square error which has been depicted in Fig. 10 and 11.
- Using average square error method, there are 29 leaves in the optimal tree, which has been shown in Fig. 12.
- For the first split, age variable has been used. It has divided the training data in two parts, first subset was for the age less than 44.5. In this subset, TargetBuy= 1 has higher than average concentration. Second subset is for age greater than or equals to 44.5, In this subset, TargetBuy = 0 has higher than average concentration. Using average square error assessment, Decision Tree model has been created autonomously, which has been shown in Fig. 13.
- Second Decision Tree node has been added to the diagram, and it has been connected to the Data Partition node, which has been depicted in Fig. 14.
- In the Properties panel of the new Decision Tree node, maximum number of branches have been set to 3 to allow three-way splits, which has been shown in Fig. 15.
- Decision tree model has been created using average square error, which has been depicted in Fig. 16.
- There are 33 leaves in the optimal tree, as per average square error. Subtree assessment plot has been shown in Fig. 17. In C, there were 29 leaves in the optimal tree. With the decision tree, Tree 2, misclassification rate (Train:0.1848) of the model is very marginally lower than the model with the decision tree, Tree 1(Train: 0.1851) and average square error of the model with the decision tree, Tree 1 (Train: 0.1329) is lower than the model with the decision tree, Tree 2 (Train: 0.1330). Hence, it can be said that in terms of average square error, tree with 29 leaves performs marginally better and in terms of misclassification rate, tree with 33 leaves performs marginally better. However complexly increases with the higher number of leaves, and tree with lower number of leaves is less complex and more reliable.
- Based on average square error, the decision tree model which has the smallest average square error among the actual class and predicted class, i.e. Tree1 appears better than model with Tree 2, as Average square error (Tree 1 Model) < Average square error (Tree 2 Model) i.e. 0.1329 < 0.1330. lower average squared error indicates the model performs better as a predictor because it is “wrong” less often.
Regression based modeling and analysis
- In the Organic diagram, StatExplore tool has been added to the ORGANICS data source and run that, this has been shown in Fig. 18.
- Missing value imputation is needed for regression, as in SAS Miner regressions models ignore observations which contain missing values, that usually reduces the size of the training data. Less training data can strikingly weaken the predictive power of these models. In this case, we will use the imputation to overcome the obstruction of missing data, impute missing values before fit the models are required. It is also required to impute missing values before fitting a model that overlooks observations with missing values while comparing those models with a decision tree.
- For class variables, Default Input Method has been set to Default Constant Value and Default Character Value has been set to U, and for interval variables Default Input Method is Mean (shown in Fig. 19).
To create imputation indicators for all imputed inputs, Indicator Variable Type has been set as Unique, and Indicator Variable Role has been set to Input (shown in Fig. 20)
- Regression node has been added to the diagram and connected to the Impute node which has been depicted in Fig. 21.
- Selection Model has been set as Stepwise, and Selection Criteriahas been set to use Validation Error, which has been shown in Fig. 22.
- Result of the regression model has been shown in Fig. 23. The selected model, based on the error rate for the validation data, is the model trained in Step 6. Which consists of the following variables:IMP_DemAffl, IMP_DemAge, IMP_DemGender, M_DemAffl, M_DemAge and M_DemGender (i.e. Affluence grade, Age and Gender) (Fig. 24). Hence, for the supermarket management Affluence grade, age and gender would be main parameters to understand loyalty of consumers and to formulate a predictive model. The result of odds ratio estimates shows that important parameter for this model will be imputed value of gender (Female), gender (Male), Affluence grade and age.
The average squared error (ASE) of prediction is used to estimate the error in prediction of a model fitted using the training data shown below in Equation 1 (Atkinson, 1980)
…………………………………………Equation 1
Here is the ith observation in the validation data set and is its predicted value using the fitted model and n is the validation sample size. The ASE output SAS has been shown in Fig. 26. ASE for this model is 0.138587 (train data) and 0.137156 (validation data). In a modeling context, a good predictive model produces values that are close to these ASE values. An overfit model produces a smaller ASE on the training data but higher values on the validation and test data. An underfit model exhibits higher values for all data roles.
Open ended discussion
- Three models, i.e. Decision Tree 1, Decision Tree 2 and Regression model have been compared using Model Comparison, Fig. 27 and 28 shows the model comparison process and model comparison result respectively. Tree 1 has been selected in the Fit statistics.
Model performance comparison
As per Fit statistics (Fig. 31), Tree 1 is the selected model, valid misclassification rate is lowest in Tree 1, i.e. 0.185. From Table 1, it can be denoted that, Kolmogorov-Smirnov Statistic and ROC Index area under the curve are effectively same for both the decision Tree 1 and Tree 2 model, and perform slightly better than regression model. In case of average squared error, Tree 1 and Tree 2 are also effectively same, and performs better than Regression model. Hence, it can be concluded that Tree 1 is better performer than other two models.
- After analyzing the three models, decision tree 1 is the best performer for this business case. The decision tree method is a popular data mining technique (Hastie, Tibshirani, & Friedman, 2009), which is very easy to use, a robust model with missing data and have better interpretability. Usually, decision trees are flexible, while regression models are comparatively inflexible, for example, for adding additional terms, i.e. interaction terms, polynomial terms. And decision trees can deal with missing values without any imputation, whereas regression model usually needs to impute missing values before building a model, Decision trees are nonparametric and highly robust in nature, while regression models are parametric and sensitive to influencing points (Berry & Linoff, 1997). Hence, decision trees are also frequently used in pre-processing phase for a logistic regression.
- Advantage of Decision Trees are Implicit variable screening and selection – the top nodes of the tree are the most important variables in the dataset; Less data preparation– data does not need to be normalized, and decision trees are less sensitive to missing data and outliers; Decision trees do not require assumptions of linearity; Decision tree output is graphical and easy to explain i.e. decision based on cut points.
While the regression model estimates relationship among variables, it identifies key patterns in large data sets and is often used to determine how independent variables are related to the dependent variables, and to explore the forms of the relationships.
High dimension increases the risk of overfitting due to correlations between redundant variables, increases computation time, increases the cost and practicality of data collection for the model, and makes interpretation of the model difficult. For the organic data, misclassification rate is lowest in decision tree, and The ROC chart window shows that the both decision tree and regression model have good predictive accuracy. In this case, Decision Trees to consumer loyalty analysis will be valuable for predictive modeling to understand the consumer segments.
Extending current knowledge with additional reading
The supermarket’s objective is to develop loyalty model by whether customers have purchased any of the organic products. Hence the model needs to be fit in the real world.
Just getting things wrong Problem should be identified, without clear objective, model will be fail. In this business case, there were two target variables, TargetBuy, and TargetAmt. TargetAmt was just a product of TargetBuy, and TargetAmt was not a binary variable. Hence selection of target variable is one of the most important thing, the model could go wrong if TargetAmt be selected as Target Variable.
Overfitting For the training data, when model becomes more complex, with more leaves of the decision tree, due to more iterations of training for a neural network, it appears to be fit the training data. But, in actual scenario, it fits noise as well as signal. In this business case, the decision tree, Tree 2, misclassification rate (Train:0.1848) of the model is very marginally lower than the model with the decision tree, Tree 1 (Train: 0.1851) and average square error of the model with the decision tree, Tree 1 (Train: 0.1329) is lower than the model with the decision tree, Tree 2 (Train: 0.1330). Hence it can be said that tree with 29 leaves performs marginally better in terms of average square error and tree with 33 leaves performs marginally better in terms of misclassification rate. But as with the higher number of leaves complexity increases, a less complex and reliable tree i.e. Tree 1 will be more suitable for the model.
Sample bias For this analysis, this sample covers across 5 geographical regions and 13 television regions across the world. Hence, there will be different set of consumers and sample bias will not be present in the data.
Future not being like the past In this business case, the model has been created using the past data of consumers of super market, it will not always be true, that the consumer who purchased the product will buy in the future. Various extraneous may affect the loyalty and consumer’s purchase.
References
- Berry, M.J. and Linoff, G., 1997. Data mining techniques: for marketing, sales, and customer support. John Wiley & Sons, Inc.
- Hastie, T., Tibshirani, R. and Friedman, J., 2009. Overview of supervised learning. In The elements of statistical learning (pp. 9-41). Springer New York.
Buy BUS5PA Predictive Analytics-Model Performance Comparison Answers Online
Talk to our expert to get the help with BUS5PA Predictive Analytics-Model Performance Comparison Answers to complete your assessment on time and boost your grades now
The main aim/motive of the management assignment help services is to get connect with a greater number of students, and effectively help, and support them in getting completing their assignments the students also get find this a wonderful opportunity where they could effectively learn more about their topics, as the experts also have the best team members with them in which all the members effectively support each other to get complete their diploma assignments. They complete the assessments of the students in an appropriate manner and deliver them back to the students before the due date of the assignment so that the students could timely submit this, and can score higher marks. The experts of the assignment help services at urgenthomework.com are so much skilled, capable, talented, and experienced in their field of programming homework help writing assignments, so, for this, they can effectively write the best economics assignment help services.