By Kuya Kevin
R can be pretty useful in doing statistics. It’s advisable for most classroom instructors to integrate this tool because it’s free and open-sourced. One of the most used tool in R is the lm() function to run a linear regression. (You can run the code ?lm for a full documentation of the function.
Say, that there have been a chemical factory accident and radiation levels (measured in Sievert) are measured at 7 separate distances from the factory. We wish to find out whether there is a linear relationship between the distance and radioactive levels; and create a prediction tool where we can forecast the radioactive levels for a given distance.
We have this data:
We begin by writing the data in R. We can set the vector names as x and y for us to simply call those vectors:
x <- c(10,20,50,100,200,500,1000) y <- c(155,121,110,84,45,10,4)
It has been of practice that we assign the linear regression to a variable. In our case, we name our variable “model”. The code for running the linear regression will be:
This basically tells R to find the least squares regression model and store it in “model”. To print the summary of the regression, we simply code:
And the results are:
Call: lm(formula = RadiationLevels ~ Distance) Residuals: 1 2 3 4 5 6 7 45.019 12.350 5.342 -14.004 -39.696 -34.774 25.763 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 111.3112 16.5231 6.737 0.00109 ** Distance -0.1331 0.0383 -3.475 0.01776 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 34.21 on 5 degrees of freedom Multiple R-squared: 0.7072, Adjusted R-squared: 0.6486 F-statistic: 12.07 on 1 and 5 DF, p-value: 0.01776
Interpreting the results
The first thing we look for in running the regression is the model itself. It can be done by using the coefficients estimate. With this, the model will be:
Radioactive level = 111.3112 - 0.1331*Distance
Testing the Overall Model
Once we have established the model, the second thing we need to assess is the validity of the overall model. This can be done by testing the F-stat of the overall model. Since we are using R, we simply look at the p-value and check whether it is lower than our set level of significance.
In this case, the p-value 0.01776 is lower than our significance level 0.05. With this fact, we have gathered sufficient evidence to conclude that the model is predicting significant proportion of the variation of radioactive levels.
The next thing that needs to be done if we find the model as significant is to interpret the coefficient of determination. It gives us the proportion of variation accurately explained by the model in all our observations. In this case, we interpret R-squared as “70.72% of the variation in the dependent variable are accurately predicted by the model”.
After interpreting the results of the entire regression and testing whether the model is significant, we proceed by analyzing the residuals. We can do this by simply plotting the residuals via plot() function.
The first result is the “Residual vs Fit Plot”. The thing we need to check here is that there should be no discernible pattern with the plots. Given that residuals represent the randomness in a regression, it should not form any correlation with any predictor. If a pattern is noticed, we will need a transformation in any of the variable involved since this indicates non-linearity of the relationship between the dependent and independent variable.
The second plot shows the normal QQ plot of the residuals. As an assumption in running a regression, the residuals should be approximately normal. If the points deviate significantly in the normal QQ line, this assumption will be violated making the regression results problematic.
The third plot will show the scale location. This tests the homoskedasticity assumption of regression. The main indicator is the scale line. If it’s approximately flat, we can conclude that the variation is distributed almost equally along the regression. A sloped line indicates that the distribution is not homogenous implying problems on heteroskedasticity.
The last plot will show the effects of any outlier. If some points are outside the leverage plots, we can say that an outlier significantly affected the whole model.
To show all graphs all at once, we can use the par(mfrow=c(2,2)) script:
Once all this test are satisfied, it is only by then that we conclude the results as valid.