Simple Regression and Correlation Blog

In this blog post, you’ll learn how to implement or perform a simple linear regression and correlation coefficient.

SIMPLE LINEAR REGRESSION

Simple Linear Regression is a statistical method that is used to estimate or predict the relationship between two variables:

• Dependent variable (criterion variable) is the main factor you are trying to understand and predict.
• Independent variables (explanatory variables, or predictors) are the factors that might influence the dependent variable.

ASSUMPTIONS ASSOCIATED WITH A LINEAR REGRESSION:

1. Linearity: The relationship between X and the mean of Y is linear.

2. Homoscedasticity: The variance of residual is the same for any value of X.

3. Independence: Observations are independent of each other.

4. Normality: For any fixed value of X, Y is normally distributed.

CORRELATION COEFFICIENT

Correlation Coefficient is a statistical method that is used to test and measures the statistical relationship, or association, between two continuous variables.

ASSUMPTIONS ASSOCIATED WITH CORRELATION COEFFICIENT:

1. Two variables should be measured on a continuous scale.

2. Two continuous variables should be paired.

3. There should be independence of cases.

4. There should be a linear relationship between your two continuous variables.

EXPLORATION THROUGH SCATTER PLOT

A scatter plot is a special type of graph used to visually inspect the data to see whether X and Y are linearly related. It is designed to show the relationship between two variables in a simple regression analysis.

1. Linear – a relationship is linear when the points on a scatterplot follow a somewhat straight line pattern.

2. Non-linear – a relationship is non-linear when the points on a scatterplot follow a pattern but not a straight line.

3. No Correlation – a relationship has no correlation when the points on a scatterplot do not show any pattern.

A correlation exists between two variables when one of them is related to the other in some way. A scatterplot can identify several different types of relationships between two variables

1. Positive Correlation – as the x variable increases, so does the y variable.

2. Negative Correlation – as the x variable increase, the y variable decreases.

3. No Correlation – there is no evident relationship between the two variables.

IMPLEMENTING SIMPLE LINEAR REGRESSION

To perform the Simple Linear Regression in Excel, you need a free Microsoft add-in called the Analysis ToolPak. (Load the Data Analysis ToolPak in Excel).

Example: Car Plant Electricity Usage

The manager of a car plant wishes to investigate how the plant’s electricity usage depends upon the plant’s production. Calculate the relationship between the variable.

The scatterplot for Car Plant Electronic Usage shows a positive linear relationship, with no extreme outliers or potentially influential observations.

Step 1: On the Data tab, in the Analysis group, click the Data Analysis button. Select Regression and click OK.

Step 2: In the Regression dialog box, configure the following settings:

• Select your Input Y Range:  which is your dependent variable, You can do this in two ways: either select the data in the worksheet or type the location of your data into the “Input Y Range box.
• Select your Input X Range:  which is your independent variable, select the data in the worksheet or type the location of your data into the “Input X Range box.”
• Check the Labels box if there are headers at the top of your x and y ranges.
• Select the location where you want your Output range

Step 3: Click OK and observe the regression analysis output created by Excel.

Regression analysis output: Regression Statistics

1. Multiple R –  is the absolute value of the correlation coefficient of the two variables (X and Y) being evaluated. The larger the absolute value, the stronger the relationship. The Multiple-R value represents the simple correlation and is 0.896, which indicates a high degree of correlation.

2. R-Square –  It is the Coefficient of Determination, it indicates how well the model or regression line “fits” the data. The higher the R-squared, the better the model fits your data. In this case, 80.21% is calculated, which is very large.

3. Adjusted R Square – Adjusted R-Squared is best for analyzing multiple regression output.

4. Standard Error – The standard error in the regression output is an important number to understand when interpreting regression data. It is the average distance that the observed values fall from the regression line. Approximately 95% of the observations should fall within +/- two standard error of the regression which is also a quick approximation of a 95% prediction interval.

5. Observations – It is simply the number of observations in your model.

Regression analysis output: ANOVA

1. df – stands for degrees of freedom.

2. SS – is the sum of the squares. The smaller the Residual SS compared with the Total SS, the better your model fits the data.

3. MS –  or the Mean Square.

4. F – is the F statistic or F-test for the null hypothesis. It is used to test whether any of the independent variables in a multiple linear regression model are significant.

5. Significance F – gives an idea of how reliable your results are. If Significance F is less than 0.05, your model is better. If it is greater than 0.05, you’d probably better choose another independent variable.

This table indicates that the regression model predicts the dependent variable significantly well. Since the value of the Significance F (0.0008) is less than 0.05, the model is better and indicates that, overall, the regression model statistically significantly predicts the outcome variable.

Regression analysis output: Coefficient

Testing the Hypothesis of No Linear Relationship

1. Null Hypothesis: H0 : β = 0 (There is no linear relationship)

2. Alternative Hypothesis:  Ha : β   0 (There is a linear relationship)

The p-value tests the null hypothesis that the coefficient is equal to zero (no effect). A p-value less than the significance level of 0.05 indicates that you can reject the null hypothesis; otherwise, the null hypothesis is not rejected. In our case, the calculated p-value is 0.000082, which indicates the null hypothesis is rejected and concludes that there is a significant linear relationship between x and y. In general,  this data provides very strong evidence to conclude that the distribution of electricity usage depends upon the level of production.

Regression analysis output: Residuals

If you want to add Residuals, click the Residual checkbox in the Regression dialog box. The residuals show you how far away the actual data points from the predicted data points.

IMPLEMENTING CORRELATION COEFFICIENT

To perform Correlation, you need a free Microsoft add-in called the Analysis ToolPak. (Load the Data Analysis ToolPak in Excel).

Example:

The table below shows the height (x) in inches, and the pulse rate (y) per minute, for 9 people. Find the correlation coefficient and interpret your result.

The scatterplot shows no correlation, means that there is no evident relationship between the two variables.

Step 1: On the Data tab, in the Analysis group, click the Data Analysis button. Select Correlation and click OK.

Step 2: In the Correlation dialog box, specify your Input Range and Output Range.

Step 3: Click OK and observe the Correlation analysis output created by Excel.

Interpreting Correlation Analysis Results

A correlation of 1 shows a perfect positive correlation, a correlation of -1 shows a perfect negative correlation, and a correlation of 0 shows no relationship between the movement of the two variables. The correlation results (-0.15) appear to be extremely weak, which indicates that there is no relationship between the variables.

1. Jay-ar says: