Gradient Boosting Algorithm for Regression Problem
Published:
In this post, we’re going to look at how Gradient Boosting algorithm works in a regression problem.
Gradient Boosting is one of the boosting types along with Adaptive Boosting and Extreme Gradient Boosting (XGBoost).
The followings are the inputs required before applying the algorithm.
- training data
- differentiable loss function
L(y, F(x))
wherey
is the actual value andF(x)
is the prediction value - number of trees (
M
)
For the sake of clarity, let’s use the following example instances as our training data (y
is the dependent variable).
A | B | C | y
=================
A1 | B1 | C1 | 50
A2 | B2 | C2 | 70
A3 | B3 | C3 | 30
Let’s execute the algorithm.
Initialize model with a constant value
As the very first step, we build a simple model that just returns a constant value C
. This C
is the prediction value where the loss function is minimum.
Formally, we’d like to solve for the following optimization problem.
F0(x) = argmin(C) SUM(i=1 to n) L(yi, C)
Let’s perform this step on our training data.
Know that the loss function for an instance is defined as L(yi, F(xi)) = 0.5 * (yi - F(xi))^2
. Or in other words, the sum of our loss function becomes the following.
sum_loss_func = L(y1, F(x1)) + L(y2, F(x2)) + L(y3, F(x3))
sum_loss_func = 0.5 * (y1 - F(x1))^2 + 0.5 * (y2 - F(x2))^2 + 0.5 * (y3 - F(x3))^2
Since the prediction values are the same over the train data, we can replace F(x1), F(x2), F(x3)
to a constant of C
. Plugging in the actual value to the above equation becomes the following.
sum_loss_func = 0.5 * (50 - C)^2 + 0.5 * (70 - C)^2 + 0.5 * (30 - C)^2
And solve for the optimization problem can be performed by taking the first derivative of sum_loss_func
with respect to C
and assign the equation to zero, such as the following.
sum_loss_func = 0.5 * (50 - C)^2 + 0.5 * (70 - C)^2 + 0.5 * (30 - C)^2
d(sum_loss_function) = -(50 - C) - (70 - C) - (30 - C) = -150 + 3C
--------------------
dC
Assigning the result to zero yields 200 = 3C where C = 150 / 3 = 50.
Our initial model is constructed so that it always return 50 for every data point.
Construct the decision tree regressor
In this step, we iterate M
times (number of decision tree specified before) and perform the following actions.
a) Compute so-called pseudo-residuals
For each instance in the training data, we compute the following.
residual_i_m = - [ dL(yi, F(xi)) / dF(xi) ] where F(x) = Fm-1(x) for i = 1, ..., n
If we take a look at the above residual_i_m
, it’s simply the difference between the actual and the predicted value. Here’s how we’d derive it.
Know that L(y, F(x)) = 0.5 * (y - F(x))^2.
Taking the derivative of the loss function with respect to F(x) yields dL(y, F(x)) / dF(x) = -(y - F(x)).
Moving the minus sign to the LHS yields -dL(y, F(x)) / dF(x) = y - F(x).
Or to be more specific, we'll finally have -dL(y, Fm-1(x)) / dFm-1(x) = y - Fm-1(x).
The last line of the above snippet concludes that residual_i_m
is the same as yi - Fm-1(xi)
for an instance i
.
Applying this step to our train data will yield the following result. In the below table, residual_1
means the error values for the first tree regressor.
A | B | C | y | F0(x) | residual_1
========================================
A1 | B1 | C1 | 50 | 50 | 0
A2 | B2 | C2 | 70 | 50 | 20
A3 | B3 | C3 | 30 | 50 | -20
b) Fit a base learner to pseudo-residuals
In this step, we train a weak model (decision tree regressor) on the training data where residual_i_m
becomes the dependent variable.
Let’s call the model h_m(x)
.
For instance, let’s apply our h_m(x)
to our previously training data. In this case, h_m(x)
becomes h_1(x)
since it’s our first weak model.
Note that the values of h_1(x)
are dummy values.
A | B | C | y | F0(x) | residual_1 | h_1(x)
===============================================
A1 | B1 | C1 | 50 | 50 | 0 | 0.5
A2 | B2 | C2 | 70 | 50 | 20 | 18
A3 | B3 | C3 | 30 | 50 | -20 | -18
c) Compute multiplier for the weak model
In this step, we solve the following optimization problem. We’d like to find the multiplier (gamma
) that minimizes the sum of loss functions.
gamma_min = argmin(gamma) SUM(i=1 to n) L(yi, Fm-1(xi) + gamma * h_m(xi))
Let’s apply this step on our training data (the one with h_m(x)
column).
sum_loss_func = L(y1, F0(x1) + gamma * h_1(x1)) + L(y2, F0(x2) + gamma * h_1(x2)) + L(y3, F0(x3) + gamma * h_1(x3))
sum_loss_func = 0.5 * (y1 - (F0(x1) + gamma * h_1(x1)))^2 \
+ 0.5 * (y2 - (F0(x2) + gamma * h_1(x2)))^2 \
+ 0.5 * (y3 - (F0(x3) + gamma * h_1(x3)))^2
sum_loss_func = 0.5 * (50 - (50 + gamma*0.5))^2 \
+ 0.5 * (70 - (50 + gamma*18))^2 \
+ 0.5 * (30 - (50 + gamma*(-18)))^2
sum_loss_func = 0.5 * (-gamma*0.5))^2 \
+ 0.5 * (20 + gamma*18)^2 \
+ 0.5 * (-20 + gamma*(-18))^2
Afterwards, let’s take the first derivative of sum_loss_func
with respect to gamma
and set the equation to zero.
d(sum_loss_func) = (-gamma*0.5))*(-0.5) + (20 + gamma*18)*(18) + (-20 + gamma*(-18))*(-18)
----------------
d(gamma)
d(sum_loss_func) = gamma*0.25 + 360 + gamma*324 + 360 + gamma*(324)
----------------
d(gamma)
d(sum_loss_func) = gamma*648.25 + 720
----------------
d(gamma)
Setting the above to zero yields -720 = 648.25*gamma, therefore gamma = -1.11
From the above calculation, we get gamma_min
equals to -1.11
.
d) Update the model
In this step, we calculate the following.
Fm(x) = Fm-1(x) + gamma_min * h_m(x)
Applying the above formula to our training data, we’ll calculate F1(x) = F0(x) + gamma_min * h_1(x)
and get the following.
A | B | C | y | F0(x) | residual_1 | h_1(x) | F1(x)
======================================================================
A1 | B1 | C1 | 50 | 50 | 0 | 0.5 | 50 + (-1.11) * 0.5
A2 | B2 | C2 | 70 | 50 | 20 | 18 | 50 + (-1.11) * 18
A3 | B3 | C3 | 30 | 50 | -20 | -18 | 50 + (-1.11) * (-18)
e) Iterate process a) to d) until M base learners are constructed
Output the final model
After constructing M
base learners for pseudo-residuals, we finally come up with our final model, that is FM(x)
.