31005 Advanced Data Analytics - Exploration of the Dataset
The main thing is to choose a project that you’re interested in and passionate about.
Choice 1: Programming ID3
The first option is to program ID3 using the algorithm described in class. You need to develop software to solve a supervised learning problem (ie. to build a model against a training set), then run the software against a test dataset and report the accuracy of the model. Your program should do the following things:
- Read a training dataset and a test dataset. The datasets are in the form of text files. See below.
- Build a model using the training data as
- Print out a representation of the model (ie. the tree orsimilar).
- Run the test data against the model, work out the accuracy of the model (ie. How many samples it classified correctly) and print out a confusion matrix to summarise the results.
The ID3 algorithm
You should build a decision tree using the ID3 algorithm given in the 3rd lecture (it is a pretty simple algorithm, feel free to learn it yourself if you choose to start this assignment before Week 3). This algorithm uses the information gain measure to calculate the splits. You should build the decision tree using the training data supplied, then calculate the error on the supplied test/validation data. Since the mushroom dataset is categorical, you will not need to consider the complexities added with real–valued attributes. There is missing data in the mushroom dataset (flagged by “?” values). Don’t treat the missing data specially. Just pretend that “?” is just another value for the attribute in question. Also, do not worry about pruning the tree.
The program must display a text representation of the decision tree. You are free to display the tree in any way you think makes sense, so long as it shows what attributes are tested at each node in the tree. It is acceptable to utilise diagnosis tools provided by machine learning packages for the display of the tree ** as long as the tree is built by your own program, i.e. it is NOT acceptable to form a 2nd tree using the package, and display the 2nd tree directly **.
Hint #1: The trick with building the decision tree is not really the ID3 algorithm which is fairly straightforward. The tricky bit is managing the dataset. Remember that you need to be able to easily split the dataset based on the value of a specific attribute. That means you need to devise a suitable data structure to easily do this split and to work out class frequencies.
Hint #2: Think carefully about the entropy function you need to use when calculating information gain. It’s not quite so simple as in our theoretical discussion. Specifically, what happens when all of the dataset you’re looking at has only one of the two class values? ie. all the mushrooms are edible or all are poisonous? How will you deal with this?
Hint #3: Follow carefully the online learning materials provided Week 3.
Choice 1-alternative: Programming an algorithm of your choice
The second option allows you to choose another algorithm to program, so long as you seek approval from me. One potential method is a multilayer perceptron neural network. You may use a supporting mathematical library to help with the details so long as you code the machine learning algorithm part yourself. Note: It is not acceptable to simply write code to call the Java Weka algorithm or the Python scikit-learn code for the algorithm. I expect you to write the main algorithm yourself. The dataset to be used for the classification (or regression) problem will need to be determined in consultation with me, but as a default we would probably use the mushroom dataset from choice 1 if it makes sense.
Choice 2: Doing a data mining project
The third choice is to use an existing package to solve a data mining problem. If you want to do this it will not be enough to just use one classification algorithm and copy the output. You need to explore the data, systematically try several algorithms and parameter settings to find the best (by evaluating the quality of the classifiers) and then provide a recommendation.
Answer
Introduction
With the emergence of big data, it becomes important for different industries to churn the available data about the business processes and customers in order to improve the performance. In this way, the organizations can gain competitive advantage in their respective markets.
Now a day the retail sector became one of the most competitive industries. In order to survive the organization in the market mostly utilizes undirected mass marketing comprehensively.
Every potential customer receives similar catalogues, advertising mails, pamphlets and announcements. In this way most of the customers gets annoyed due to huge number offers while the response rate for those campaigns drops for every organization.
The following report contributes to the exploration of a dataset for retail organization, modelling of the selected dataset and implementation of different classifier algorithms. In addition to that, different sections of this report contributes to the interpretations of different insights from the exploration of the selected dataset.
Exploration of the dataset
For the BigMart selected dataset, the data is collected from the below link;
The test dataset contains the 8523 rows for twelve columns that represents different attributes for the different products. The columns are given by; 'Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type', 'Item_MRP', 'Outlet_Identifier’, ‘Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales'.
Following is the statistical information for the selected dataset;
Item_Weight |
Item_Visibility |
Item_MRP |
Outlet_Establishment_Year |
Item_Outlet_Sales | |
count |
7060.000000 |
8523.000000 |
8523.000000 |
8523.000000 |
8523.000000 |
mean |
12.857645 |
0.066132 |
140.992782 |
1997.831867 |
2181.288914 |
std |
4.643456 |
0.051598 |
62.275067 |
8.371760 |
1706.499616 |
min |
4.555000 |
0.000000 |
31.290000 |
1985.000000 |
33.290000 |
25% |
8.773750 |
0.026989 |
93.826500 |
1987.000000 |
834.247400 |
50% |
12.600000 |
0.053931 |
143.012800 |
1999.000000 |
1794.331000 |
75% |
16.850000 |
0.094585 |
185.643700 |
2004.000000 |
3101.296400 |
max |
21.350000 |
0.328391 |
266.888400 |
2009.000000 |
13086.964800 |
The above table provides statistical description about the selected dataset about all the numerical dataset. For the Item weight there are total 7060 rows, and for Item visibility, Item MRP and item outlet sales there are 8523 rows of data. From the above table it is evident that the minimum, mean and maximum value for the Item_MRP is given by, 31.290,140.9927 and 266.8884.
For the item weight, the minimum, mean and the maximum value is given by 4.5550, 12.857645 and 21.3500.
For the attribute Item_Visibility it is found that the minimum value of it is zero. This result does not make any sense as whenever a product is sold from a store, therefore the value of visibility cannot be 0.
On the other hand, the Outlet_Establishment_Years ranges from 1985 to 2009. Therefore, with the age of the store it will be possible to find out the impact on the sales of the products from the specific stores.
The lower number of Item_Weight compared to the Item_Outlet_Sales indicates that there are missing values in the selected dataset for the analysis.
Modelling
For this project the linear regression classifier model is used. The Simple linear regression is a method that helps in summarization and define the relationships among two continuous variables that are included in the test. One is denoted by x, which is considered as independent or explanatory variable. Another variable is usually denoted y, which is considered as the dependent variable.
There are mainly two types of regression models. One is Simple Linear Regression model and the other is multiple Linear Regression. The Simple Linear Regression is considered by one independent variable. On the other hand, the Multiple Linear Regression model considers more than one independent variables for the prediction process. While finding best fit line, you can fit a polynomial or curvilinear regression. And these are known as polynomial or curvilinear regression.
In implementing the linear regression model the main objective is to fit a straight line among the distribution of the selected dataset or feature. The best fit line among the distributions will be nearest of all the points. This helps in reducing the error in the process of prediction from the available data points from the fitted line.
Following are the some of the properties of the linear regression line;
- The plotted regression line goes through the mean of the considered independent variable) and the mean of the dependent variables in the process.
- The Linear Regression method is also known as Ordinary Least Square (OLS) as the line minimizes the total of the of Square of Residuals in the process.
- The coefficient of relationship between the two variables describes the change in Y along the change in X. In other words, it can have said that it depicts the trend of change in Y with the increase and decrease of the value of ‘X’ in the process.
Implementing algorithm
In order to implement the linear regression model which is the part of the Machine learning in artificial intelligence. In this way the developed algorithm that enables the computer sysems to adapt with the behaviour and predict for a specific data point in the view of available empirical data used to train the dataset (Chen and Zhang 2014). A critical focal point of machine learning is on finding data by recognizing examples and settling on shrewd choices in light of data.
Machine learning can assist retailers with becoming more exact and granular in making predictions (Davenport, 2013). Models of machine learning incorporate normal dialect preparing, affiliation run learning and group learning (Chen and Zhang 2014).
Thematic examination of the dataset was chosen as the information investigation strategy for this investigation, as it was fitting for deciding examples and subjects identifying with the utilization of huge information investigation. Examples and subjects were distinguished by following the way toward performing topical examination down into six stages as quickly talked about beneath. 1. Getting comfortable with the information that had been assembled from the semi organized meetings. The key way to deal with this was in the interpretation and examination of this to the first meetings for exactness. 2. Perusing the majority of the translations and producing codes which depicted intriguing highlights of the reactions from respondents. 3. down subjects included perusing the majority of the codes and doling out them a typical topic. Precedents of beginning topics were: use, challenge, definition, obstructions to selection, future, design, esteem, challenge, innovations, development, discernment and inspiration. 4. Arranging the codes under their separate subjects. Rehashing the coded removes under each subject with a specific end goal to distinguish subthemes. The way toward recognizing subthemes gathered regular codes together and to expel a portion of the codes which did not shape some portion of a coherent gathering.
Findings
For the selected dataset, the unique values for each of the columns are investigated which is listed below;
Attribute |
count |
Item_Identifier |
1559 |
Item_Weight |
416 |
Item_Fat_Content |
5 |
Item_Visibility |
7880 |
Item_Type |
16 |
Item_MRP |
5938 |
Outlet_Identifier |
10 |
Outlet_Establishment_Year |
9 |
Outlet_Size |
4 |
Outlet_Location_Type |
3 |
Outlet_Type |
4 |
Item_Outlet_Sales |
3493 |
dtype: int64 |
From the above table it can be stated that, the dataset contains 10 unique outlets with unique identifiers and mainly 1559 different items.
In the next stage of this project, at first it is investigated that most important factors that impacts on the sales of the products at the store. In this analysis we found the relation between the different factors through the correlation among the factors. Following is the depiction of relation through the heat map
From the above heat map, it is evident that the item_MRP has the most significant impact on the outlet sales of the products from the stores.
In this stage the, the distribution of the different type of items according to the contents and types are segregated. This are given by;
Frequency of Item_Fat_Content in the dataset;
Low Fat |
8485 |
Regular |
4824 |
low fat |
522 |
LF |
195 |
reg |
178 |
For different outlet sizes the frequency of the stores are given by;
Medium |
2793 |
Small |
2388 |
High |
932 |
Further findings
In further investigation, the comparison between the different item types and item weight were investigated. In order to do that, box plot is used to find the relationship which lead to the following plots;
From the above range it is evident that, the house hold products and the seafood is available in wide range of weight which shows the range of the weights with different values.
Now in the following plot that includes different variables are depicted that helps in determine the relationships between them.
From the above plot it can be said that that most of the stores that were established before the year 1990 have the larger number of sales compared to the newer stores of big mart. Moreover, the items with weights between the values 10-15 are mostly sold by the stores.
Difficulties encountered
In this data mining project, it is found that there are numerous rows in the dataset that contains missing values which may lead to the wrong modelling of the classifier of the dataset. Thus, in order to remove or imputing the missing values in the dataset in order to make the dataset and the generated model a reliable one.
The examination of enormous information to pick up bits of knowledge is another idea. Huge information examination has been characterized in various routes and there seems, by all accounts, to be an absence of agreement on the definition. Enormous information investigation has been characterized as far as the advancements and systems which are utilized to dissect substantial scale complex information to help enhance the execution of a firm (Kwon et al., 2014) characterizes huge information investigation as the use of cutting edge logical methods on enormous informational indexes. Fisher et al. (2012) characterize enormous information examination as a work process which distils terabytes of low esteem information down into more granular information of high esteem. For the reasons for this paper, enormous information examination was characterized as the utilization of logical strategies and advances to break down huge information so as to acquire data which is of incentive for deciding.
For the majority of Big Data mainly available in unstructured state that does not provide any value for the business organizations. From the unstructured dataset through the use of the right set of tools and analytics it is possible to find out the relevant insights for the organizations.
With the specifically crafted model it is possible to make prediction for the desired feature selected from the database. Use of the predictive model from the selected Dataset can be helpful in finding trends and insights about the sales and business processes that help in driving operational efficiencies for the business organizations. in this way the organization can create and launch new products as well as gaining competitive advantages against other competitors. In this way the exploitation of the value from the Big Data is helpful in removing tremendous effort required for sampling.
Furthermore, analysis of Big Data can bring other benefits for the organizations. This benefits includes launch of new products and services with the customer centric product recommendations while better meeting customer requirements of the customers. In this way the data analytics can facilitate growth in the market.
Previously getting the insights from huge amount of dataset were too costly to process.
Evaluation
The underlying concept to carry out this project was to knowledge discovery through the use of the different packages in python languages known as data mining. The core of this process is machine learning by defining the features for classification process.
Spellbinding examination is the arrangement of systems which are utilized to depict and give an account of the past (Davenport and Dyché 2013). Retailers can utilize unmistakable investigation to portray and outline deals by area and stock levels. Models of techniques incorporate information representation, expressive measurements and a few information mining strategies (Camm et al., 2014).
Prescient investigation comprises of an arrangement of procedures which utilize measurable models and experimental techniques on past information keeping in mind the end goal to make exact expectations about the future or decide the effect of one variable on another.
retail industry, prescient examination can extricate designs from information to make forecasts about future deals, rehash visits by clients and probability of making an online buy (Camm et al. 2014). Precedents of prescient investigative systems which can be connected to enormous information incorporate information mining strategies and direct relapse.
Conclusion
In order to avoid decreased rate of response as well as success it is important to provide personalized recommendation for the customers that are specific to their needs. In order to do that, it is important to determine the factors that have the significant impact in the plotting of the regression line. For this, the feature engineering is the most important stage where the variables are selected for the classifier modelling process.
For this project we have investigated with multiple models with different perspectives with the selected dataset using different classification models. The selected data set is a small in size that may not be fruitful for the organizations large scale sales model. Throughout the project it is found that the dataset has missing values for the different attributes in the numerous rows which were managed in the data cleansing process to have better classification model.
Even though, altered combinations of feature sets for the modelling were used due to the noisy dataset the results deviated from each other. Where as in case of the results from the models which yielded higher rate of accuracies can help in concluding that the dataset contains a demonstrative amount of information.
In order to remove the noise from the dataset that may have caused due to the e random split method of the complete dataset. The results from the project depicted that linear models are not actually suitable to use for this kind of project as this classification model was introduced in order to predict other attributes than the categorical data.
References
Davenport, T.H. and Dyché, J., 2013. Big data in big companies. International Institute for Analytics, 3.
Buy 31005 Advanced Data Analytics - Exploration of the Dataset Answers Online
Talk to our expert to get the help with 31005 Advanced Data Analytics - Exploration of the Dataset Answers to complete your assessment on time and boost your grades now
The main aim/motive of the management assignment help services is to get connect with a greater number of students, and effectively help, and support them in getting completing their assignments the students also get find this a wonderful opportunity where they could effectively learn more about their topics, as the experts also have the best team members with them in which all the members effectively support each other to get complete their diploma assignments. They complete the assessments of the students in an appropriate manner and deliver them back to the students before the due date of the assignment so that the students could timely submit this, and can score higher marks. The experts of the assignment help services at urgenthomework.com are so much skilled, capable, talented, and experienced in their field of programming homework help writing assignments, so, for this, they can effectively write the best economics assignment help services.