Gradient Boosting Algorithm for Regression Problem

7 minute read

Published: December 01, 2020

In this post, we’re going to look at how Gradient Boosting algorithm works in a regression problem.

Gradient Boosting is one of the boosting types along with Adaptive Boosting and Extreme Gradient Boosting (XGBoost).

The followings are the inputs required before applying the algorithm.

training data
differentiable loss function L(y, F(x)) where y is the actual value and F(x) is the prediction value
number of trees (M)

For the sake of clarity, let’s use the following example instances as our training data (y is the dependent variable).

A  | B  | C  | y
=================
A1 | B1 | C1 | 50
A2 | B2 | C2 | 70
A3 | B3 | C3 | 30

Let’s execute the algorithm.

Initialize model with a constant value

As the very first step, we build a simple model that just returns a constant value C. This C is the prediction value where the loss function is minimum.

Formally, we’d like to solve for the following optimization problem.

F0(x) = argmin(C) SUM(i=1 to n) L(yi, C)

Let’s perform this step on our training data.

Know that the loss function for an instance is defined as L(yi, F(xi)) = 0.5 * (yi - F(xi))^2. Or in other words, the sum of our loss function becomes the following.

sum_loss_func = L(y1, F(x1)) + L(y2, F(x2)) + L(y3, F(x3))
sum_loss_func = 0.5 * (y1 - F(x1))^2 + 0.5 * (y2 - F(x2))^2 + 0.5 * (y3 - F(x3))^2

Since the prediction values are the same over the train data, we can replace F(x1), F(x2), F(x3) to a constant of C. Plugging in the actual value to the above equation becomes the following.

sum_loss_func = 0.5 * (50 - C)^2 + 0.5 * (70 - C)^2 + 0.5 * (30 - C)^2

And solve for the optimization problem can be performed by taking the first derivative of sum_loss_func with respect to C and assign the equation to zero, such as the following.

sum_loss_func = 0.5 * (50 - C)^2 + 0.5 * (70 - C)^2 + 0.5 * (30 - C)^2

d(sum_loss_function) = -(50 - C) - (70 - C) - (30 - C) = -150 + 3C
--------------------
         dC

Assigning the result to zero yields 200 = 3C where C = 150 / 3 = 50.

Our initial model is constructed so that it always return 50 for every data point.

Construct the decision tree regressor

In this step, we iterate M times (number of decision tree specified before) and perform the following actions.

a) Compute so-called pseudo-residuals

For each instance in the training data, we compute the following.

residual_i_m = - [ dL(yi, F(xi)) / dF(xi) ] where F(x) = Fm-1(x) for i = 1, ..., n

If we take a look at the above residual_i_m, it’s simply the difference between the actual and the predicted value. Here’s how we’d derive it.

Know that L(y, F(x)) = 0.5 * (y - F(x))^2.

Taking the derivative of the loss function with respect to F(x) yields dL(y, F(x)) / dF(x) = -(y - F(x)).

Moving the minus sign to the LHS yields -dL(y, F(x)) / dF(x) = y - F(x).

Or to be more specific, we'll finally have -dL(y, Fm-1(x)) / dFm-1(x) = y - Fm-1(x).

The last line of the above snippet concludes that residual_i_m is the same as yi - Fm-1(xi) for an instance i.

Applying this step to our train data will yield the following result. In the below table, residual_1 means the error values for the first tree regressor.

A  | B  | C  | y  | F0(x) | residual_1
========================================
A1 | B1 | C1 | 50 | 50    |    0
A2 | B2 | C2 | 70 | 50    |    20
A3 | B3 | C3 | 30 | 50    |    -20

b) Fit a base learner to pseudo-residuals

In this step, we train a weak model (decision tree regressor) on the training data where residual_i_m becomes the dependent variable.

Let’s call the model h_m(x).

For instance, let’s apply our h_m(x) to our previously training data. In this case, h_m(x) becomes h_1(x) since it’s our first weak model.

Note that the values of h_1(x) are dummy values.

A  | B  | C  | y  | F0(x) | residual_1 | h_1(x)
===============================================
A1 | B1 | C1 | 50 | 50    |    0       |  0.5
A2 | B2 | C2 | 70 | 50    |    20      |  18
A3 | B3 | C3 | 30 | 50    |    -20     |  -18

c) Compute multiplier for the weak model

In this step, we solve the following optimization problem. We’d like to find the multiplier (gamma) that minimizes the sum of loss functions.

gamma_min = argmin(gamma) SUM(i=1 to n) L(yi, Fm-1(xi) + gamma * h_m(xi))

Let’s apply this step on our training data (the one with h_m(x) column).

sum_loss_func = L(y1, F0(x1) + gamma * h_1(x1)) + L(y2, F0(x2) + gamma * h_1(x2)) + L(y3, F0(x3) + gamma * h_1(x3))

sum_loss_func = 0.5 * (y1 - (F0(x1) + gamma * h_1(x1)))^2 \
                + 0.5 * (y2 - (F0(x2) + gamma * h_1(x2)))^2 \
                + 0.5 * (y3 - (F0(x3) + gamma * h_1(x3)))^2
                
sum_loss_func = 0.5 * (50 - (50 + gamma*0.5))^2 \
                + 0.5 * (70 - (50 + gamma*18))^2 \
                + 0.5 * (30 - (50 + gamma*(-18)))^2
                
sum_loss_func = 0.5 * (-gamma*0.5))^2 \
                + 0.5 * (20 + gamma*18)^2 \
                + 0.5 * (-20 + gamma*(-18))^2

Afterwards, let’s take the first derivative of sum_loss_func with respect to gamma and set the equation to zero.

d(sum_loss_func) = (-gamma*0.5))*(-0.5) + (20 + gamma*18)*(18) + (-20 + gamma*(-18))*(-18)
----------------
    d(gamma)
    
d(sum_loss_func) = gamma*0.25 + 360 + gamma*324 + 360 + gamma*(324)
----------------
    d(gamma)
    
d(sum_loss_func) = gamma*648.25 + 720
----------------
    d(gamma)
    
Setting the above to zero yields -720 = 648.25*gamma, therefore gamma = -1.11

From the above calculation, we get gamma_min equals to -1.11.

d) Update the model

In this step, we calculate the following.

Fm(x) = Fm-1(x) + gamma_min * h_m(x)

Applying the above formula to our training data, we’ll calculate F1(x) = F0(x) + gamma_min * h_1(x) and get the following.

A  | B  | C  | y  | F0(x) | residual_1 | h_1(x) |        F1(x)
======================================================================
A1 | B1 | C1 | 50 | 50    |    0       |  0.5   | 50 + (-1.11) * 0.5
A2 | B2 | C2 | 70 | 50    |    20      |  18    | 50 + (-1.11) * 18
A3 | B3 | C3 | 30 | 50    |    -20     |  -18   | 50 + (-1.11) * (-18)

e) Iterate process a) to d) until M base learners are constructed

Output the final model

After constructing M base learners for pseudo-residuals, we finally come up with our final model, that is FM(x).

Share on

Twitter Facebook LinkedIn

Albertus Kelvin

Gradient Boosting Algorithm for Regression Problem

Initialize model with a constant value

Construct the decision tree regressor

a) Compute so-called pseudo-residuals

b) Fit a base learner to pseudo-residuals

c) Compute multiplier for the weak model

d) Update the model

e) Iterate process a) to d) until M base learners are constructed

Output the final model

Share on

You May Also Enjoy

IMO 2012 Problem 2 - Solution

Little Note on MySQL and Adminer

XGBoost Algorithm for Classification Problem

The Levinson-Durbin Recursion Example