Simple Linear Regression and Predictive Modeling The data for Assignment #10 is
Simple Linear Regression and Predictive Modeling
The data for Assignment #10 is
Simple Linear Regression and Predictive Modeling
The data for Assignment #10 is the Nutrition Study data.It is a 16 variable dataset with n=315 records that you have seen and worked with previously. The data was obtained from medical record information and observational self-report of adults. The dataset consists of categorical, continuous, and composite scores of different types. A data dictionary is not available for this dataset, but the qualities measured can easily be inferred from the variable and categorical names for most of the variables. As such, higher scores for the composite variables translate into having more of that quality. The QUETELET variable is essentially a body mass index. It can be googled for more detailed information. It is the ratio of BodyWeight (in lbs) divided by (Height (in inch))^2. Then the ratio is adjusted with an adjustment factor so that the numbers become meaningful. Specifically, QUETELET above 25 is considered overweight, while a QUETELET above 30 is considered obese. There is no other information available about this data.
1) Download the Nutrition Dataand read it into R-Studio. We will work with the entire data set for this assignment.
2) There are 11 variables that are clearly continuous variables. For this assignment, you should consider the Quetelet variable to be the dependent response variable (Y). All other continuous variables should be considered independent or explanatory variables. Make a scatter plot of each continuous variable (X) with Y. You should have 10 different scatterplots. Obtain Pearson Product Moment Correlations for each X variable with Y. You can do this in a table form or individually. It does not matter. Stil, combine the scatterplot with the correlation information and discuss the appropriateness of simple linear regression for each scatterplot. Which variable seems most predictive of Quetelet (Y)?
3) Often times, the explanatory variables are correlated amongst themselves. Obtain, a standard correlation matrix for all of the explanatory variables. Then, obtain a heat matrix of the correlations (see the correlation classroom for an example of this). Are there groups, or subsets, of explanatory variables that seem to clump together in that they are highly correlated amongst themselves?
4) Use the explanatory variable that is most highly correlated with Y, and fit a simple linear regression model. Call this Model 1. Report the prediction equation for Model1, interpret the coefficients, report the R-squared statistics as a measure of goodness of fit. Set up and report the results of the hypothesis test for the slope parameter (beta1).
5) Pick one of the remaining explanatory variables. Add that variable into the regression Model 1 from task 4). Re-fit the linear regression model (note, it is now a multiple regression model – why?). Call this Model 2. Report the prediction equation for Model 2, interpret the coefficients, report and interpret the R-squared statistic. How much has R-squared changed from Model 1 to Model 2? What is this change in R-squared uniquely attributable to? Does this change seem to have a practical meaning or value? Discuss.
6) For the remainder of the explanatory variables, add them into Model 2 one at a time so that the model becomes 1 variable larger at each step. Note the R-squared value and the change in R-squared between each subsequent model. Which explanatory variables seem to contribute alot (or a practical amount) to predicting Y and which explanatory variables contribute little or nothing?
7) Re-fit a multiple regression model using only those explanatory variables from task 6 that seem to contribute alot or a practical amount to predicting Y. Call this the Final Model. Report the prediction equation for the final model, interpret the coefficients, report the R-squared statistic. Does this model seem to be meaningful, in a larger medical scope of things, for predicting Quetelet? Remember, a regression model is also information about the relationships between variables – so it should have meaning and be part of the data’s story. Discuss. Is this modeling done? Or, is there something else you would want to do to model this data? Write up your synthesis description of what this data set seems to be saying (up to this point) and where we should go from here.