Least Square Regression Line

Linear Regression

If our data shows a linear relationship between \(X\) and \(Y\), then the straight line which best describes the relationship is the regression line. The regression line is given by \(\hat{Y}\)=a+bX$.

Finding the value of b

The value of \(b\) can be calculated using either of the following formulae:

  • \(b=\frac{n\sum(x_iy_i)-(\sum x_i)(\sum y_i)}{n\sum(x_i^2)-(\sum x_i)^2}\)
  • \(b=\rho\frac{\sigma_Y}{\sigma_X}\), where \(\rho\) is the Pearson correlation coefficient, \(\sigma_X\)

Finding the value of a

\(a=\bar{y}-b\cdot\bar{x}\), where \(\bar{x}\) is the mean of \(X\) and \(\bar{y}\) is the mean of \(Y\).

Coefficient of determination (\(R^2\))

The coefficient of determination can be computer with : \(R^2 = \frac{SSR}{SST}=1-\frac{SSE}{SST}\) Where :

  • \(SST\) is the total Sum of Squares : \(SST=\sum (y_i-\bar{y})^2\)
  • \(SSR\) is the regression Sum of Squares : \(SSR=\sum (\hat{y_i}-\bar{y})^2\)
  • \(SSE\) is the error Sum of Squares : \(SSE=\sum (\hat{y_i}-y)^2\)

If \(SSE\) is small, we can assume that our fit is good.

Linear Regression in Python

We can use the fit function in the sklearn.linear_model.LinearRegression class.

from sklearn import linear_model
import numpy as np
xl = [1, 2, 3, 4, 5]
x = np.asarray(xl).reshape(-1, 1)
y = [2, 1, 4, 3, 5]
lm = linear_model.LinearRegression()
lm.fit(x, y)
print(f'a = {lm.intercept_}')
print(f'b = {lm.coef_[0]}')
print("Where Y=a+b*X")
a = 0.5999999999999996
b = 0.8000000000000002
Where Y=a+b*X