Fundamentals of Linear Regression
Definition
Linear Regression is a methodology that allows us to relate two phenomena. This approach allows identifying a rule based on real data to predict new values. There are algorithms and packages that perform linear regression in a simple way, but we don’t aways understand what’s going on behind the curtains. The purpose of this presentation is to expose the mathematical foundations of the linear regression.
The Method
The method can divided into three steps:
- Initial view of data;
- Determination of parameters;
- Generate the curve found parameters.
Initial view of data
The objective of this step is to verify if the data follow an approximately linear behavior. Otherwise, the linear regression method should not be applied. Let’s consider the following dateset.
We can plot these data and check their behavior. In the image below we can see that the graph demonstrates a linear behavior, therefore the application of the linear regression technique is justified.
In this repository I have provided a python code that generates a curve fitting using linear regression, as I said, the mathematial foundations are suppressed and we have no idea of what happens internally. The code generates the image below.
Determining the parameters
To determine the parameters a and b of the curve, we will use the least squares method. In this method, the values a and b are obtained in such a way that the sum of squares of the differences between the observed values of Y and those obtained from the adjusted curve for the same values of X is minimal. Mathematically this idea can be represented as:
To find the minimum configuration we can derive the above expression with respect to parameters a and b and set it equal to zero.
Knowing that n is the number of observations, we can write equations (I) and (II) as equations (III) and (IV) as follows.
To determine the parameter b, we divide the first equation by n and isolate a in order to substitute it in the second equation:
Substituting in the second equation:
The linear coefficient of the line can be easily determined by the equation already defined above:
We can calculate these parameters directly from the data, without using the scipy package through the code below:
And finally, after obtaining the linear coefficient a and the angular coefficient b, we can plot the fit curve.