Ict110 Data Setup Answers Assessment Answers

Submit your assignment to Blackboard Task 2. Please follow the submission instructions on Blackboard. The assignment will be marked out of a total of 100 marks and forms 30% of the total assessment for the course. ALL assignments will be checked for plagiarism by SafeAssign system provided by Blackboard automatically. Refer to your Course Outline or the Course Web Site for a copy of the “Student Misconduct, Plagiarism and Collusion” guidelines. Assignment submission extensions will only be made using the official Faculty of Arts, Business and Law Guidelines. Requests for an extension to an assignment MUST be made to the course coordinator prior to the date of submission and requests made on the day of submission or after the submission date will only be considered in exceptional circumstances. ICT110 Introduction to Data Science Assignment 2 Page 3 of 7 Background A research team planned to study the heath development of the world in the past 15 years. The team retrieved the dataset from World Bank about Health and Population Statistics between 2001 and 2015.

The dataset contains the following attributes:

• Birth rate, crude (per 1,000 people)

• Fertility rate, total (births per woman)

• Adolescent fertility rate (births per 1,000 women ages 15-19)

• Death rate, crude (per 1,000 people)

• Cause of death, by communicable diseases and maternal, prenatal and nutrition conditions (% of total)

• Cause of death, by injury (% of total)

• Cause of death, by non-communicable diseases (% of total)

• Mortality caused by road traffic injury (per 100,000 people)

• Health expenditure per capita (current US$)

• GNI per capita, Atlas method (current US$)

• Health expenditure, private (% of GDP)

• Health expenditure, public (% of GDP)

• Health expenditure, total (% of GDP)

• Maternal mortality ratio (national estimate, per 100,000 live births)

• Immunization, BCG (% of one-year-old children)

• Life expectancy at birth, male (years)

• Life expectancy at birth, female (years)

• Life expectancy at birth, total (years)

• School enrollment, primary (% gross)

• School enrollment, secondary (% gross)

• School enrollment, tertiary (% gross)

• School enrollment, tertiary, female (% gross)

• Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)

• Unemployment, female (% of female labor force) (modeled ILO estimate)

• Unemployment, male (% of male labor force) (modeled ILO estimate)

• Unemployment, total (% of total labor force) (modeled ILO estimate) More details about the data attributes and data content can be found in the attached documents. Assignment Task You are a member of the team, and need to perform data analysis on countries in the region of East Asia & Pacific. The team has not set any specific goal for the analysis. Therefore, you have the freedom to explore the data, and dig out anything you feel interesting or significant.

ICT110 Introduction to Data Science Assignment 2 Page 4 of 7 You have been requested to prepare a data analysis report about your work and explain your findings. The potential audiences include other researchers, business representatives, and government agencies. They may have limited ICT or mathematical knowledge. To prepare the report, please follow the following outline:

1. Introduction Provide an introduction to the problem. Include background material as appropriate: who cares about this problem, what impact it has, where does the data come from.

2. Data Setup Describe how to load the data, and the libraries needed. Provide an overview of the data about its dimensions and structures.

3. Exploratory Data Analysis Perform 3 one-variable analysis. Plot at least one graph for each variable. Explain why the selected graph is appropriate. Perform 2 two-variable analysis. Plot at least one graph for each variable. Explain why the selected graph is appropriate The analysis can be performed on all years and all countries, or on a subset of your interest.

4. Advanced Analysis

4.1 Clustering Briefly explain the concept of clustering and k-means. Try to do a clustering analysis to group countries according to some selected attributes.

4.2 Linear Regression Briefly explain the concept of linear regression. Try to do 2 linear regression analysis. Plot the learned models. The analysis can be performed on all years and all countries, or on a subset of your interest.

Answer:

In this research light has been shed on the health and development conditions of New Zealand. A subset of the original dataset collected from World Bank has been used for the analysis. The subset has been chosen for the simplicity of the analysis. New Zealand has been chosen as it is a moderately populated country and it was of interest to understand the health conditions of a moderately populated country. The factors that has been considered for analysis are total unemployment, gross national income and life expectancy at birth. It is known that unemployment is factor to directly affect the gross national income of a country. It might also be affecting the life expectancy of birth of an infant. The relationships between these variables have been considered in this research.

2. Data Setup

The data has been extracted from World Bank. The format of the data was comma delimited (.csv). As the whole analysis has been performed in R Studio, the data has been imported to R from excel. There are 26 attributes in the data over 15 years from 2001 to 2015 and on the countries of East Asia and the Pacific. The necessity of the analysis has been kept in mind and thus the data was extracted as per the needs. 2015 has been eliminated from the data as there were a lot of missing information for that particular year. Table 2.1 shows the R-Codes for data Extraction.

Various libraries have been used in R to run the analysis. These libraries are listed as follows:

dplyr: This library is used for filtering the data
ggplot2: This is a library that is used to plot the data
reshape2: With the help of this library, the data can be reshaped
cluster: This library is used for the k-mean clustering analysis

Table 2.1: R-Codes for the Libraries Used and the Extraction of the Data

# Libraries necessary for analysis

library(dplyr)

library(ggplot2)

library(reshape2)

library(cluster)

# Extracting the data

data <- read.csv(file.choose(), sep = ”,”, header = TRUE, na.strings = “..”)

attach(data)

data <- filter(data, Country.Code == "NZL")

data <- subset(data, select=-c(Country.Name, Country.Code, ï..Series.Name, X2015..YR2015.))

data <- na.omit(data)

data <- melt(data, Series.Code = "Country.Code")

3. Exploratory Data Analysis

Descriptive analysis has to be conducted at first for the chosen variables or attributes. At first, analysis has been conducted on the gross national income of New Zealand. It has been obtained from the analysis that the standard deviation of the GNI for New Zealand is 8741.171, which is quite high. Thus, it indicates that the gross national income of the country is not close to the average GNI and are quite scattered. The distribution of the income is shown with the help of a boxplot in figure 3.1. Negative skewness is observed from the figure. Thus, it can be said that the GNI is higher when the population is high.

Table 3.1: R-Codes to Obtain Summary Statistics for Gross National Income

# summary statistics

summary(NY.GNP.PCAP.CD)

sd(NY.GNP.PCAP.CD)

var(NY.GNP.PCAP.CD)

# boxplot

boxplot(NY.GNP.PCAP.CD, main = "Boxplot for Gross National Income", xlab = "Gross National Income", col = 5)

Table 3.2: Results of the Summary Statistics for Gross National Income

Min.: 13800

1st Qu.: 22470

Median : 28370

Mean : 27500

3rd Qu. : 31610

Max. : 41670

St. Dev. : 8741.171 Variance : 76408069

The second variable on which analysis has been conducted is the total unemployment of New Zealand. It has been obtained from the analysis that the standard deviation of the total unemployment for New Zealand is 1.136, which is quite less. Thus, it indicates that the total unemployment of the country is close to the average total unemployment and are not scattered. The distribution of the total unemployment is shown with the help of a histogram in figure 3.2. Symmetricity is observed from the figure.

Table 3.3: R-Codes to Obtain Summary Statistics for Total Unemployment

# Summary

summary(SL.UEM.TOTL.ZS)

sd(SL.UEM.TOTL.ZS)

var(SL.UEM.TOTL.ZS)

# histogram

hist(SL.UEM.TOTL.ZS, data=data, main = "Histogram for Total Unemployment", xlab = "Total Unemployment", col = 5)

Table 3.4: Results of the Summary Statistics for Total Unemployment

Min.: 3.700 1st Qu. : 4.050 Median : 5.350 Mean : 5.207 3rd Qu. : 6.175 Max. : 6.900 St. dev. : 1.136435 Variance : 1.291483

The third variable on which analysis has been conducted is the Life Expectancy at Birth of an Infant in New Zealand. It has been obtained from the analysis that the standard deviation of the Life Expectancy at Birth of an Infant in New Zealand is 0.9044, which is quite less. Thus, it indicates that the Life Expectancy at Birth of an Infant in the country is close to the average Life Expectancy at Birth of an Infant and are not scattered. The distribution of the Life Expectancy at Birth of an Infant is shown with the help of a boxplot in figure 3.3. Symmetricity is observed from the figure.

Table 3.5: R-Codes to Obtain Summary Statistics for Total Life expectancy at Birth

# summary

summary(SP.DYN.LE00.IN)

sd(SP.DYN.LE00.IN)

var(SP.DYN.LE00.IN)

# boxplot

boxplot(SP.DYN.LE00.IN, main = "Boxplot for Total Life Expectancy at Birth", xlab = "Total Life Expectency at Birth", col = 5)

Table 3.6: Results of the Summary Statistics for Total Life expectancy at Birth

Min.: 78.69 1st Qu. : 79.62 Median : 80.25 Mean : 80.21 3rd Qu. : 80.85 Max. : 81.41 St. dev. : 0.9043788 Variance : 0.8179011

Table 3.7: R-Codes for Scatterplot and Correlation between GNI and Unemployment

# scatterplot

plot(SL.UEM.TOTL.ZS~NY.GNP.PCAP.CD, data=data, main="Scatterplot of Gross National Income and Total Unemployment",xlab="Gross National Income (Per Capita)", ylab="Total Unemployment (% of total labor force)", col=2, pch=19)

# correlation coefficient

cor(SL.UEM.TOTL.ZS, NY.GNP.PCAP.CD)

Table 3.7: R-Codes for Scatterplot and Correlation between Life expectancy at birth of an infant and Unemployment

# scatterplot

plot(NY.GNP.PCAP.CD ~ SP.DYN.LE00.IN, data=data, main="Scatterplot of Gross National Income and Total Life Expectancy at Birth",xlab="Gross National Income (Per Capita)", ylab=" Total Life Expectancy at Birth (years)", col=2, pch=19)

# correlation coefficient

cor(NY.GNP.PCAP.CD, SP.DYN.LE00.IN)

After the exploratory analysis, an advanced analysis will be conducted on the variables. Thus, clustering analysis and regression analysis will be conducted further for the purpose of the study. Relationship between GNI and life expectancy has been very high and that with GNI and unemployment was moderate as seen from the analysis conducted so far.

4.1 Cluster Analysis

The relation obtained above is the main cause to run the clustering analysis. The stronger variables such as GNI and life expectancy has been chosen for clustering with k-means. The values of a data frame are grouped into different clusters according to the closeness to the cluster means (Guha and Mishra 2016).

Table 4.1 provides the R-Codes that has been used to conduct the k-means clustering analysis (Celebi, Kingravi and Vela 2013). The analysis is represented diagrammatically in figure 4.1.

Table 4.1: R-Codes for Clustering Analysis

# Data extraction for k-means clustering

data <- read.csv(file.choose(), sep = ",", header = TRUE, na.strings = "..")

data3 <- filter(data, Series.Code %in% c("NY.GNP.PCAP.CD" , "SP.DYN.LE00.IN"))

data3 <- subset(data3, select = -(X2015..YR2015.))

data3 <- melt(data3, Series.Code = c("Series.Code","Country.Name","Country.Code"))

data4 <- dcast(data3, formula = Country.Code ~ Series.Code, mean)

data4 <- na.omit(data4)

View(data4)

# Clustering

grpdata <- kmeans(data4[,c("NY.GNP.PCAP.CD" , "SP.DYN.LE00.IN")],centers = 3, nstart = 10)

grpdata

o = order(grpdata$cluster)

data.frame(data4$Country.Code[o], grpdata$cluster[o])

# plotting data

plot(data4$NY.GNP.PCAP.CD, data4$SP.DYN.LE00.IN, type="n", xlim=c(0,50000), main="k- means Clustering" ,xlab="Gross National Income", ylab="Life Expectancy at Birth")

text(x=data4$NY.GNP.PCAP.CD,y=data4$SP.DYN.LE00.IN,labels=data4$Country.Code,col=grp data$cluster+1)

To establish the nature of the strength of the relationship obtained in the correlation analysis, the regression analysis has been performed. The value of the dependent variable is predicted with the help of the independent variable with the help of this analysis (Montgomery, Peck and Vining 2015). The relationship is denoted with the help of the following formula:

y = β₀ + β₁x + ε

Here, x and y are the independent and the dependent variables respectively. The scale parameter β₀ represents the value of y in the absence of x and β₁indicates the amount of increase or decrease in the value of y for increase in the value of x (Kabacoff 2015).

Regression between GNI and Unemployment are established at first with unemployment as the dependent variable. The results show that 17.74 percent of the variability can be explained by GNI (R-Square). The relationship is expressed with the following equation:

Total Unemployment = 3.701 + (0.00006 * GNI) + Error

# Regression for unemployment on GNI

Reg1 <- lm(formula = SL.UEM.TOTL.ZS ~NY.GNP.PCAP.CD, data = data)

summary(Reg1)

# plotting line

plot1 <- ggplot(data, aes(x= NY.GNP.PCAP.CD, y= SL.UEM.TOTL.ZS)) + geom_point(shape=1) + scale_x_continuous(name = "Gross National Income") + scale_y_continuous(name = "Total Unemployment")+ geom_smooth(method=lm) +theme_bw()+ ggtitle("Regression of Total Unemployment on Gross National Income")

plot1

Residuals:

Min 1Q Median 3Q Max

-1.5421 -1.0199 0.2389 0.9129 1.1836

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.701e+00 9.790e-01 3.780 0.00262 **

NY.GNP.PCAP.CD 5.477e-05 3.404e-05 1.609 0.13360

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.073 on 12 degrees of freedom

Multiple R-squared: 0.1774, Adjusted R-squared: 0.1089

F-statistic: 2.589 on 1 and 12 DF, p-value: 0.1336

Life Expectancy at Birth = (7.741e+01) + (0.0001 * GNI) + Error

# Regression of Life Expectancy at birth on GNI

Reg2 <- lm(formula = SP.DYN.LE00.IN~SL.UEM.TOTL.ZS, data = data)

summary(Reg2)

# plotting line

plot8 <- ggplot(data, aes(x=SL.UEM.TOTL.ZS, y=SP.DYN.LE00.IN)) + geom_point(shape=1) + scale_x_continuous(name = "Total Unemployment") + scale_y_continuous(name = "Total Life Expectancy at Birth")+ geom_smooth(method=lm) +theme_bw()+ ggtitle("Regression of Life Expectancy at Birth on Total Unemployment")

plot8

Residuals: Min 1Q Median 3Q Max -0.24688 -0.10782 -0.02588 0.02552 0.27936 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.741e+01 1.496e-01 517.33 < 2e-16 ***NY.GNP.PCAP.CD 1.019e-04 5.202e-06 19.58 1.78e-10 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.164 on 12 degrees of freedomMultiple R-squared: 0.9697, Adjusted R-squared: 0.9671 F-statistic: 383.5 on 1 and 12 DF, p-value: 1.783e-10

From the analysis, GNI and life Expectancy at birth has shown an extremely strong positive relationship. Thus, GNI has to be kept high in order to protect the infants and keep them healthy. The relationship between the other two variables were not significant.

6. Reflections

I faced a lot of problem in handling the large dataset with a lot of missing values for conducting the research. A lot of extraction had to be done to obtain a proper subset that was fit for the analysis.

References

Celebi, M.E., Kingravi, H.A. and Vela, P.A., 2013. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications, 40(1), pp.200-210.

Guha, S. and Mishra, N., 2016. Clustering data streams. In Data Stream Management (pp. 169-187). Springer Berlin Heidelberg.

Kabacoff, R., 2015. R in action: data analysis and graphics with R. Manning Publications Co..

Montgomery, D.C., Peck, E.A. and Vining, G.G., 2015. Introduction to linear regression analysis. John Wiley & Sons.

Buy Ict110 Data Setup Answers Assessment Answers Online

Talk to our expert to get the help with Ict110 Data Setup Answers Assessment Answers to complete your assessment on time and boost your grades now

The main aim/motive of the management assignment help services is to get connect with a greater number of students, and effectively help, and support them in getting completing their assignments the students also get find this a wonderful opportunity where they could effectively learn more about their topics, as the experts also have the best team members with them in which all the members effectively support each other to get complete their diploma assignments. They complete the assessments of the students in an appropriate manner and deliver them back to the students before the due date of the assignment so that the students could timely submit this, and can score higher marks.Â The experts of the assignment help services at urgenthomework.com are so much skilled, capable, talented, and experienced in their field of programming homework help writing assignments, so, for this, they can effectively write the best economics assignment help services.

Get Online Support for Ict110 Data Setup Answers Assessment Answers Assignment Help Online

); }

Not the Exact Question you were looking for ? Post your question for assignment help and get instant help on your homework and assignment questions from our experts