Day 8 - Least Square Regression Line

Least Square Regression Line

Problem

A group of five students enrolls in Statistics immediately after taking a Math aptitude test. Each student's Math aptitude test score, \(x\), and Statistics course grade, \(y\), can be expressed as the following list \((x,y)\) of points:

  • \((95, 85)\)
  • \((85, 95)\)
  • \((80, 70)\)
  • \((70, 65)\)
  • \((60, 70)\)

If a student scored an 80 on the Math aptitude test, what grade would we expect them to achieve in Statistics? Determine the equation of the best-fit line using the least squares method, then compute and print the value of \(y\) when \(x=80\).

X = [95, 85, 80, 70, 60]
Y = [85, 95, 70, 65, 70]
n = len(X)

def cov(X, Y, n):
    x_mean = 1/n*sum(X)
    y_mean = 1/n*sum(Y)
    return 1/n*sum([(X[i]-x_mean)*(Y[i]-y_mean) for i in range(n)])

def stdv(X, mu_x, n):
    return (sum([(x - mu_x)**2 for x in X]) / n)**0.5

def pearson_1(X, Y, n):
    std_x = stdv(X, 1/n*sum(X), n)
    std_y = stdv(Y, 1/n*sum(Y), n)
    return cov(X, Y, n)/(std_x*std_y)


b = pearson_1(X, Y, n)*stdv(Y, sum(Y)/n, n)/stdv(X, sum(X)/n, n)
a = sum(Y)/n - b*sum(X)/n

print(f"If a student scored 80 on the math test, he would most likely score a {round(a+80*b,3)} in statistics")
If a student scored 80 on the math test, he would most likely score a 78.288 in statistics

Pearson correlation coefficient

Problem

The regression line of \(y\) on \(x\) is \(3x+4y+8=0\), and the regression line of \(x\) on \(y\) is \(4x+3y+7=0\). What is the value of the Pearson correlation coefficient?

Mathematical explanation

The initial equation system is :

$$ \left\{\begin{array}{ r @{{}={}} r >{{}}c<{{}} r >{{}}c<{{}} r } 3x+4y+8=0 & (1)\\ 4x+3y+7=0 & (2)\\ \end{array} \right. $$

So we can rewrite the 2 lines this way :

$$ \left\{\begin{array}{ r @{{}={}} r >{{}}c<{{}} r >{{}}c<{{}} r } y=-2+(\frac{-3}{4})x & (1)\\ x=-\frac{7}{4}+(-\frac{3}{4})y & (2)\\ \end{array} \right. $$

so \(b_1=-\frac{3}{4}\) and \(b_2=-\frac{3}{4}\)

When we apply the Pearson's coefficient formula :

  • let \(p\) be the pearson coefficient
  • let \(\sigma_X\) be the standard deviation of \(x\)
  • let \(\sigma_Y\) be the standard deviation of \(y\)

We hence have

$$ \left\{\begin{array}{ r @{{}={}} r >{{}}c<{{}} r >{{}}c<{{}} r } p=b_1\left(\frac{\sigma_X}{\sigma_Y}\right) & (1)\\ p=b_2\left(\frac{\sigma_Y}{\sigma_X}\right) & (2)\\ \end{array} \right. $$

by multiplying theses 2 equations together we get

$$p^2=b_1\cdot b_2$$
$$p^2=\left(-\frac{3}{4}\right)\left(-\frac{3}{4}\right)$$
$$p^2=\left(-\frac{9}{16}\right)$$

finally we get \(p=\left(-\frac{3}{4}\right)\) or \(p=\left(\frac{3}{4}\right)\)

Since \(X\) and \(Y\) are negatively correlated we have \(p=\left(-\frac{3}{4}\right)\)