In this post we are going to demystify the learning algorithm of linear regression. We are going to to analyze simplest univariate case with single feature X where in previous example was temperature and output was cricket chirps/sec. Lets use same data with crickets to build *learning algorithm* and see if it produces similar hypothesis as in excel.

As you may already know from this example, we need to find linear equation parameters **θ**_{0} and **θ**_{1}, to fit line most optimally on given data set:

y = θ_{0} + θ_{1}** ****x**

x here is feature (temperature) and y – output value (chirps/sec). So how we are going to find parameters **θ**_{0} and **θ**_{1}? The whole point of learning algorithm is doing this iteratively. We need to find optimal **θ**_{0}, **θ**_{1} parameter values so that approximation line error from plotted training set is minimal. By doing successive corrections to randomly selected parameters we can find optimal solution. From statistics you probably know the *Least Mean Square *(LMS) algorithm. It uses gradient based method of steepest descent.

First of all we construct a cost function **J()** which calculates square error between suggested output and real output:

We need to minimize J(θ0, θ1) by adjusting θ0, θ1 parameters in several iterations. First of all we need to find derivatives of cost function for both theta parameters. And then update each parameter with some learning rate α.

repeat until converge {

}

Without proof **gradient descent** algorithm has to perform following update for every theta:

repeat until converge{

}

As you will see on every iteration, θ0, θ1 parameters on each iteration will get closer and closer to optimized values by finding local minimum in cost function J. For this example we will initiate α parameter and number of iterations by intuition.

In this and future example I am going to use Python based environment. For windows users it is best to download Anakonda package which already includes latest Python release, SciPi and other 400 popular Python packages including special machine learning tools (which we might use in future). It is open source, so everyone can get hands on it. It is quite new to me, so this is gonna be a learning process for me too. It also comes with powerful GUI spyder.

Lets start writing Python program gradually. First of all we need to read cricket data from file. I have saved it as CSV from excel:

First column is outputs (Y) ans second is inputs/features (X).

First of all we need to write couple functions: **cost** and **gradient_descent**:

def cost(x,y,theta): m = y.size #number of training examples predicted = x.dot(theta).flatten() sqErr = (predicted - y) ** 2 J = (1.0) / (2 * m) * sqErr.sum() return J def gradient_descent(x, y, theta, alpha, iterations): #gradient descent algorithm to find optimal theta values m = y.size J_theta_log = np.zeros(shape=(iterations+1, 3)) #store initial values in to log J_theta_log[0, 0] = cost(x, y, theta) J_theta_log[0,1] = theta[0][0] J_theta_log[0,2] = theta[1][0] # theta_log = np.zeros(shape=(iterations,2)) for i in range(iterations): #split equation in to several parts predicted = x.dot(theta).flatten() err1 = (predicted - y) * x[:, 0] err2 = (predicted - y) * x[:, 1] theta[0][0] = theta[0][0] - alpha * (1.0 / m) * err1.sum() theta[1][0] = theta[1][0] - alpha * (1.0 / m) * err2.sum() J_theta_log[i+1, 0] = cost(x, y, theta) J_theta_log[i+1,1] = theta[0][0] J_theta_log[i+1,2] = theta[1][0] return theta, J_theta_log

In gradient_descent function we also save cost and theta values to show in graph how they travel towards minimum point.

After less than 20 iterations we already got local minimum point:

*machine learned* linear regression hypothesis looks like:

**y = 0.0026 + 0.2081 • x**

and this is how it looks on training data graph:

And final test is to run hypothesis with some test data:

At temperature = 85F, predicted chirp frequency 17.687319

At temperature = 50F, predicted chirp frequency 10.405367

You may notice that machine learned formula parameters are somewhat different from excel trend formula, but when you plot data on top of it, you can see that red dots lie on top of trend.

We didn’t implemented feature x normalization to reduce calculation overhead. But this is more applicable to learning with multiple features where we would like to make them similar in scale. Next time we will continue to more complex linear regression having more than one feature in training example. Download python code here [linear regression.zip] to run in Anaconda.

Great information. I personally would appreciate more details on how cost function and gradient descent works. This part seems really condensed.

This is a simple optimization problem. I am planning on making short post about this.

Actually I have made a post some time ago where similar optimization problem is solved: http://www.scienceprog.com/gauss-seidel-optimization-routines/

Gradient Descent algorithm is similar.If you have any further questions don’t hesitate to ask.