FIT5197 Statistical Data Modelling
Build a linear regression model using the specific “auto mpg train.csv” provided with the assignment to predict mpg (mile per gallon). The second file “auto mpg test.csv”will be used for evaluation.
There are some missing values listed as “?”. Describe your strategy for treating missing values and update (edit by hand) the file accordingly.
What does this imply about the predictors for your model?
Can you improve your model with different predictors?
Try out some different ratios or products of the better predictor variables. How will you evaluate the different alternative predictors on your existing model (not using the test set)?
There are some missing values listed as “?”. Describe your strategy for treating missing values, but note sometimes it is OK to leave missing value as a separate categorical value (we call this “missing informative”).
Consider a binomial distribution with n=500 and θ=0.001. Use the appropriate CDF functions in R to compute p(k<10 | n=500, θ=0.001)
(a) the exact value
(b) the value according to the Gaussian approximation in lectures
(c) the value according to the Poisson approximation in lectures
(d) write down a consise formula for the exact value.
IQ is supposed to Gaussian with a mean of 100 and a standard deviation of 15. At a high-school reunion, where everyone attends, 2 of your classmates out of 40 claim to have IQs greater than 150. What is the probability that 2 or more would have an IQ greater than 140. Represent your solution as an expression of θ=p(IQ>140) and give θ.
Answer:
Question one
- P(A U B U C) = P(A) + P(B) + P(C) – P(A ∩ C) – P(A ∩ B) – P(B ∩ C) + P(A ∩ B ∩ C)
- e to eliminate double counting, we add the probabilities of the individual events then subtract the intersection of each two events. Some outcomes that cuts across all the events are also removed in the process hence the intersection of all the events is added back.
P(A U B U C U D) = P(A) + P(B) + P(C) + P(D) – P(A ∩ B) – P(A ∩ C) – P(A ∩ D) – P(B ∩ C) – P(B ∩ D) – P(C ∩ D) + P(A ∩ B ∩ C) + P(A ∩ B ∩ D) + P(A ∩ C ∩ D) + P(B ∩ C ∩ D) - P(A ∩ B ∩ C ∩ D).
i.e the reasoning is the same as in the previous case but we now subtract the intersection of all the four events as it is already covered in the intersections per 3 events.
Question two
P(x) = 1 -
Q(P) = = since we are only interested in the functional form
This is also the inverse CDF since it is not invertible.
10% quantile, Q(P) = = 0.1
2x – x^2 = 0.2
x =
= 1.894 and 0.1056.
Question three
H(G) = 1/k i.e since the probability of the groups is the same.
H(I) = 1/nk
H(I/G) = H(G) * H(I)
H(G,I) = H(G) + H(I).
Question four
Mean (x) =
= q + 1-q
Variance(x) = E(x^2) –(E(x))^2
= q + 1-q –
Question five.
Check the required R code attached.
Question six
First we construct the Z score as:
Z =
= 2.67
From the Z table, this corresponds to a probability of 0.0038 hence probability that IQ > 140.
Θ ~ bin(40, 0.0038)
Θ =
= 0.00097146.
Hence the probability of more than two people having an IQ of more than 140 is 0.00097146
Question seven
∼N(0,)
∼N(0,1)
∴ ∼χ2 (1) = Γ(1/2,2)
∼Γ(1/2,2) = Γ(1/2,2)
E ~ = Γ(3/2,2)
Mean =( 3/22)
= 3
Variance = 3/2
= 6
Question eight
The samples are from normally distributed populations hence are also normally distributed.
We test the hypothesis for the difference in mean.
(A) =
= 11.17
(B) =
= 11.9875
n(A) = 10
n(B) = 8
H? :
H1:
At 5% level of significance, the critical values of Z are -1.96 and +1.96 hence we reject the null hypothesis if is not met.
Z =
Z =
= - 28. 19.
We reject the null hypothesis since -28.19 is less than -1.96 hence we conclude that the level of contamination between the lakes are different.
Question nine
Since there more heart attacks from the placebo taking people than aspirin taking ones, it is clear that taking aspirin does not increase chances of heart attack.
Question ten
Computing the correlation coefficient,
= 0.4838( refer to the excel sheet attached as regards where the data is obtained.)
Height and salary are correlated as shown by the above result. To confirm whether the relationship is by chance or significant, we use t test as follows
t = r
= 0.4838
= 1.748
At t =1.75, p< 0.05 hence we conclude that a lawyers height is related to his salary
References
- Craig, L. H. Mathematical Statistics.
- Dennis, W., William, M., & Richard, L. (2007). Mathematical Statistics with Applications.W.S publishers.
- Effron, B. (n.d.). The Efficiency of Logistic Regression compareed to normal discriminantanalysis. Vol 70.
- Mood, & Crawbill. introduction to the theory of Statistical Inference.
- R, V. H., J, W. M., & A, T. C. (2003). Introduction to Mathematical Statistics, 6th edition.Prentice Hall.
- S, J. S., & E, H. (1995). Calculus: one and several variables, 7th ed.
- (2004). Calculus concepts and contexts : Multivariate and Single variable.Brooks/Cole Pub.
- Thomas, & Finney. (1988). Calculus and analytic geometry 7th ed.Addison-Wisley.
Buy FIT5197 Statistical Data Modelling Answers Online
Talk to our expert to get the help with FIT5197 Statistical Data Modelling Answers to complete your assessment on time and boost your grades now
The main aim/motive of the management assignment help services is to get connect with a greater number of students, and effectively help, and support them in getting completing their assignments the students also get find this a wonderful opportunity where they could effectively learn more about their topics, as the experts also have the best team members with them in which all the members effectively support each other to get complete their diploma assignments. They complete the assessments of the students in an appropriate manner and deliver them back to the students before the due date of the assignment so that the students could timely submit this, and can score higher marks. The experts of the assignment help services at urgenthomework.com are so much skilled, capable, talented, and experienced in their field of programming homework help writing assignments, so, for this, they can effectively write the best economics assignment help services.