Day 7: Gradient Descent for multiple linear regression

Today, we will put together what we've previously learned to implement gradient descent for multiple linear regression using vectorization.

To recap

parameters: W1, ..., Wn

Model:

Cost function:

Normal Equation

Before moving on, we will make a quick note on an alternative way to find w and b for linear regression. This method is called the normal equation:

only used for linear regression
solve the problem of finding w and b without iteration

Disadvantages:

doesn't generalize to other learning algorithms
slow when number of features is large ( > 10,000 )

what you need to know:

normal equation method may be used in machine learning libraries that implement linear regression
Gradient descent is the recommended method for finding parameters w, b

Vector Vector dot product

The dot product is a mainstay of Linear Algebra and NumPy. The dot product is shown below:

The dot product multiplies the values in two vectors element-wise and the sums the result. Vector dot product requires the dimensions of the two vectors to be the same.

Let's implement our own version of the dot product below, using a for loop, to implement a function which returns the dot product of two vectors (assume both a and b are the same shape):

def my_dot(a, b):
    x = 0
    for i in range(a.shape[0]):
        x = x + a[i] * b[i]
    return x

# test 1-D
a = np.array([1, 2, 3, 4])
b = np.array([-1, 4, 3, 2])
my_dot(a, b)

# result of my_dot(a, b) = 24

Note, the dot product is expected to return a scalar value.

Let's try the same operations using np.dot:

# test 1-D
a = np.array([1, 2, 3, 4])
b = np.array([-1, 4, 3, 2])
c = np.dot(a, b)
c = np.dot(b, a)

# result of both c would be 24

Compute Cost with Multiple Variables

The equation for the cost function with multiple variables is:

where:

In contrast to previous functions, w and x_i are vectors rather than scalars, supporting multiple features. Below is an implementation of the above equations:

def compute_cost(X, y, w, b):
    m = X.shape[0]
    for i in range(m):
        f_wb_i = np.dot(X[i], w) + b
        cost = cost + (f_wb_i - y[i]) ** 2
    cost = cost / (2*m)
    return cost

Gradient descent with Multiple Variables

please note that I'm referring to the term: partial derivative/ derivative term/ symbol interchangeably in my posts, as was done in the course and they refer to the same thing. but in mathematics, partial derivative is used to refer to multi-variable functions, (>1), and derivative used to refer to single variable function.

Gradient descent for multiple variables:

where, n is the number of features, parameters w_j, b, are updated simultaneously and where:

let's implement the equations above (there are many ways to implement this equation, and this is one version):

# let's first compute the partial derivative terms
def compute_gradient(X, y, w, b):
    m, n = X.shape
    dj_dw = np.zeros((n,))
    dj_db = 0.
    
    for i in range(m):
        err = (np.dot(X[i], w) + b) - y[i]
        for j in range(n):
            dj_dw[j] = dj_dw[j] + err * X[i, j]
        dj_db = dj_db + err
    dj_dw = dj_dw / m
    dj_db = dj_db / m
    
    return dj_db, dj_dw

after receiving your derivative terms, let's compute gradient descent:

def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters):
    J_history = []
    w = copy.deepcopy(w_in)
    b = b_in
    
    for i in range(num_iters):
        dj_db, dj_dw = gradient_function(X, y, w, b)
        
        w = w - alpha * dj_dw
        b = b - alpha * dj_db
        
        if i < 100000:
            J_history.append(cost_function(X, y, w, b))
            
        if i%math.ceil(num_iters/10) == 0:
            print(f"Iteration {i:4d}: Cost {J_history[-1]:8.2f}")
            
    return w, b, J_history

To test for implementation:

note that this is the code to implement gradient descent, but no actual data is being used, and this is for reference purposes only

# initialize parameters
initial_w = np.zeros_like(w_init)
initial_b = 0.

# set gradient descent settings
iterations = 1000
alpha = 5.0e-7

# run gradient descent
w_final, b_final, J_hist = gradient_descent(X_train, y_train, initial_w, initial_b, compute_cost, compute_gradient, alpha, iterations)

print(f"b, w found by gradient descent: {b_final:0.2f}, {w_final}")
m, _ = X_train.shape
for i in range(m):
    print(f"prediction: {np.dot(X_train[i], w_final) + b_final:0.2f}, target value: {y_train[i]}")

# expected result:
b, w found by gradient descent: -0.00,[0.2   0.  -0.01. -0.07]
prediction: 426.19, target value: 460
prediction: 286.17, target value: 232
prediction: 171.47, target value: 178

Our example result shows that our predictions are not very accurate (vs the target value), we'll explore how to improve on this in our next post tomorrow.