XGBoost Algorithm for Classification Problem
Published:
Let’s use a simple example data to demonstrate how XGBoost algorithm works on a classification problem.
X | Y
======================
1 | false
2 | true
3 | false
4 | true
From the above dataset, we have one independent variable (X
) and one dependent variable (Y
). The task is to classify whether an input X
belongs to class false
or true
.
In addition, XGBoost has several hyperparameters to tune. For the sake of this demonstration, we’re going to set these hyperparameters:
n_estimators
: 2max_depth
: 2learning_rate
: 1gamma
: 0.5min_child_weight
: 0reg_lambda
: 0
Step 1. Set an initial prediction
Let’s set the initial prediction (F0(x)
) to be 0.5. In other words, regardless of the value of X
, the predicted Y
will be 0.5.
With the above dataset, we can see that the probability of being true
for X
equals to 2 and 4 is one. Meanwhile, the probability of being true
for X
equals to 1 and 3 is zero.
Step 2. Calculate the residuals
In this step we calculate the difference between the actual value (Y
) and the predicted value (F0(x)
).
Note that we may represent Y
in terms of probability of being classified as true
.
Here’s the result.
X | Y | F0(x) | res_0
==============================================================
1 | 0 | 0.5 | -0.5
2 | 1 | 0.5 | 0.5
3 | 0 | 0.5 | -0.5
4 | 1 | 0.5 | 0.5
Step 3. Train model on the residuals
Here, we build one or more tree-based models that are trained on [X
, res
].
The following shows how a tree model is constructed.
A) Generate split candidates
In this step we find one or more split candidates. The formula is with N
rows we would have N - 1
split candidates.
With our dataset example, here’s the result.
===== Candidate Split 1 =====
X | Y | F0(x) | res_0
==============================================================
1 | 0 | 0.5 | -0.5
————————————————————————————————————————————————————————————————————————> threshold = (1 + 2) / 2 = 1.5
2 | 1 | 0.5 | 0.5
3 | 0 | 0.5 | -0.5
4 | 1 | 0.5 | 0.5
===== Candidate Split 2 =====
X | Y | F0(x) | res_0
==============================================================
1 | 0 | 0.5 | -0.5
2 | 1 | 0.5 | 0.5
————————————————————————————————————————————————————————————————————————> threshold = (2 + 3) / 2 = 2.5
3 | 0 | 0.5 | -0.5
4 | 1 | 0.5 | 0.5
===== Candidate Split 3 =====
X | Y | F0(x) | res_0
==============================================================
1 | 0 | 0.5 | -0.5
2 | 1 | 0.5 | 0.5
3 | 0 | 0.5 | -0.5
————————————————————————————————————————————————————————————————————————> threshold = (3 + 4) / 2 = 3.5
4 | 1 | 0.5 | 0.5
The threshold
will be used to create two child nodes. For instance, Candidate Split 1
with threshold 1.5 will have a root node of X < 1.5
.
Here’s the complete parent-child nodes for each candidate.
===== Candidate Split 1 =====
Parent node -> X < 1.5
Left child node -> [-0.5]
Right child node -> [0.5, -0.5, -0.5]
===== Candidate Split 2 =====
Parent node -> X < 2.5
Left child node -> [-0.5, 0.5]
Right child node -> [-0.5, 0.5]
===== Candidate Split 3 =====
Parent node -> X < 3.5
Left child node -> [-0.5, 0.5, -0.5]
Right child node -> [0.5]
B) Select the best split
From step (A) we generate one or more split candidates. The question is how to select the best one?
In XGBoost, we select the best split by leveraging what are called as Similarity Score and Gain. Gain is calculated for the parent node, while Similarity Score is calculated for the child nodes. We select a candidate with the maximum Gain.
Here are the formulas.
Gain = Left_similarity_score + Right_similarity_score - Root_similarity_score
Similarity score = SUM(residual_i)^2 / {SUM(prev_proba_i x (1 - prev_proba_i)) + lambda}
For this demonstration, let’s set lambda
to zero.
With the above formulas, let’s apply them on our split candidates. Note that the Root_similarity_score
is zero for the root node.
===== Candidate Split 1 =====
Parent node -> X < 1.5
Left child node -> [-0.5]
Right child node -> [0.5, -0.5, -0.5]
Left_similarity_score = (-0.5)^2 / [(0.5 * (1 - 0.5)) + 0] = 1
Right_similarity_score = (0.5 - 0.5 - 0.5)^2 / [(0.5 * (1 - 0.5)) + (0.5 * (1 - 0.5)) + (0.5 * (1 - 0.5)) + 0] = 0.33
Gain = 1 + 0.33 - 0 = 1.33
===== Candidate Split 2 =====
Parent node -> X < 2.5
Left child node -> [-0.5, 0.5]
Right child node -> [-0.5, 0.5]
Left_similarity_score = (-0.5 + 0.5)^2 / [(0.5 * (1 - 0.5)) + (0.5 * (1 - 0.5)) + 0] = 0
Right_similarity_score = (-0.5 + 0.5)^2 / [(0.5 * (1 - 0.5)) + (0.5 * (1 - 0.5)) + 0] = 0
Gain = 0 + 0 - 0 = 0
===== Candidate Split 3 =====
Parent node -> X < 3.5
Left child node -> [-0.5, 0.5, -0.5]
Right child node -> [0.5]
Left_similarity_score = (-0.5 + 0.5 - 0.5)^2 / [(0.5 * (1 - 0.5)) + (0.5 * (1 - 0.5)) + (0.5 * (1 - 0.5)) + 0] = 0.33
Right_similarity_score = (0.5)^2 / [(0.5 * (1 - 0.5)) + 0] = 1
Gain = 0.33 + 1 - 0 = 1.33
According to the above calculation, the maximum Gain is 1.33 and being acquired by Candidate Split 1
and Candidate Split 3
. Let’s choose Candidate Split 3
.
C) Continue searching the best split until max_depth is reached
Since max_depth
is set to 2, this basically means that our tree model should have maximum two levels. Therefore, let’s perform the same task like (B).
We’ll start from Candidate Split 3
since we chose it in the previous step.
Here’s the Candidate Split 3
. Recall that with N
rows we’d have N - 1
split candidates. Since the right child node consists of only 1 row, there’ll be no split candidates. On the other hand, we’ll get two candidates from the left child node.
===== Candidate Split 3 =====
X | Y | F0(x) | res_0
==============================================================
1 | 0 | 0.5 | -0.5
2 | 1 | 0.5 | 0.5
3 | 0 | 0.5 | -0.5
————————————————————————————————————————————————————————————————————————> threshold = (3 + 4) / 2 = 3.5
4 | 1 | 0.5 | 0.5
Parent node -> X < 3.5
Left child node -> [-0.5, 0.5, -0.5]
Right child node -> [0.5]
Left_similarity_score = (-0.5 + 0.5 - 0.5)^2 / [(0.5 * (1 - 0.5)) + (0.5 * (1 - 0.5)) + (0.5 * (1 - 0.5)) + 0] = 0.33
Right_similarity_score = (0.5)^2 / [(0.5 * (1 - 0.5)) + 0] = 1
Gain = 0.33 + 1 - 0 = 1.33
So, let’s grow the tree from the left child node! We start by generating the split candidates.
===== Candidate Split 3_1 =====
X | Y | F0(x) | res_0
==============================================================
1 | 0 | 0.5 | -0.5
————————————————————————————————————————————————————————————————————————> threshold = (1 + 2) / 2 = 1.5
2 | 1 | 0.5 | 0.5
3 | 0 | 0.5 | -0.5
===== Candidate Split 3_2 =====
X | Y | F0(x) | res_0
==============================================================
1 | 0 | 0.5 | -0.5
2 | 1 | 0.5 | 0.5
————————————————————————————————————————————————————————————————————————> threshold = (2 + 3) / 2 = 2.5
3 | 0 | 0.5 | -0.5
Next, similar to the previous step, we select the best split candidate by calculating the Gain and Similarity Score.
Note that when calculating Gain, the Root_similarity_score
is simply the same as Left_similarity_score
of Candidate Split 3
which is 0.33.
===== Candidate Split 3_1 =====
Parent node -> X < 1.5
Left child node -> [-0.5]
Right child node -> [0.5, -0.5]
Left_similarity_score = (-0.5)^2 / [(0.5 * (1 - 0.5)) + 0] = 1
Right_similarity_score = (0.5 - 0.5)^2 / [(0.5 * (1 - 0.5)) + (0.5 * (1 - 0.5)) + 0] = 0
Gain = 1 + 0 - 0.33 = 0.67
===== Candidate Split 3_2 =====
Parent node -> X < 2.5
Left child node -> [-0.5, 0.5]
Right child node -> [-0.5]
Left_similarity_score = (-0.5 + 0.5)^2 / [(0.5 * (1 - 0.5)) + (0.5 * (1 - 0.5)) + 0] = 0
Right_similarity_score = (-0.5)^2 / [(0.5 * (1 - 0.5)) + 0] = 1
Gain = 0 + 1 - 0.33 = 0.67
We got the same number. Well, let’s just select Candidate Split 3_1
.
Since we have reached the max_depth
, we’ll stop building the tree.
Our final tree is shown by the following.
IF X < 3.5 THEN
IF X < 1.5 THEN
[-0.5]
ELSE
[0.5, -0.5]
ELSE
[0.5]
D) Tree pruning
In this step, we reduce the size of decision tree by removing sections that has little power to classify data points.
Below is the rule of pruning.
Start from low level parent node
IF (Gain - gamma) < 0 THEN
Remove that parent node & Check the same for upper parent node
ELSE
Don’t remove the parent node & Stop pruning
Using our decision tree, here’s how the pruning is performed.
The lowest parent node has Gain of 0.67 with Similarity Score of 0.33.
Calculating Gain - gamma gives 0.67 - 0.5 = 0.17.
Since 0.17 > 0, we don’t remove the node and stop pruning.
E) Calculate Output Values
Output Values is calculated for all leaves of the decision tree.
The formula is given by the following:
Output Value = SUM(residual_i) / [SUM(prev_proba_i x (1 - prev_proba_i)) + lambda]
Recall that our previous tree has leaves with value of [0.5, -0.5]
, [0.5]
and [-0.5]
.
Output value for [0.5, -0.5] => (0.5 - 0.5) / [(0.5 * (1 - 0.5)) + (0.5 * (1 - 0.5)) + 0] = 0
Output value for [-0.5] => (-0.5) / [(0.5 * (1 - 0.5)) + 0] = -2
Output value for [0.5] => (0.5) / [(0.5 * (1 - 0.5)) + 0] = 2
This step simply replaces the leaf values (which may have multiple values of residuals) with the Output Values.
To conclude, here’s our first decision tree (DT_0) that is constructed by fitting it on [X
, res_0
].
IF X < 3.5 THEN
IF X < 1.5 THEN
-2
ELSE
0
ELSE
2
F) Predict the new values using DT_0
Before predicting with DT_0, we need to convert the initial probability (which is 0.5) to the log(odds). This can be achieved by taking the log of (p / (1 - p))
where p
is the initial prediction. Therefore, the log(odds)
of the initial prediction is log(1)
or zero.
To predict the new values, we use the following formula.
log(odds)_F1(x) = log(odds)_F0(x) + learning_rate * (DT_0(x))
F1(x) is retrieved by computing e^log(odds)_F1(x) / [1 + e^log(odds)_F1(x)]
Let’s apply this step to our dataset. Assume that learning_rate
is set to 1.
X | Y | F0(x) | res_0 | F1(x) | res_1
=============================================================================================================
1 | 0 | 0.5 | -0.5 | e^-2 / [1 + e^-2] | 0 - F1(x)
2 | 1 | 0.5 | 0.5 | e^0 / [1 + e^0] | 1 - F1(x)
3 | 0 | 0.5 | -0.5 | e^0 / [1 + e^0] | 0 - F1(x)
4 | 1 | 0.5 | 0.5 | e^2 / [1 + e^2] | 1 - F1(x)
Step 4. Repeat Step 3 until n_estimators is reached
After we get res_1
, we can build another decision tree that is fitted on [X
, res_1
] until n_estimators
(number of decision trees fitted on residuals) is reached.
To conclude, the final XGBoost model is given by the following.
log(odds)_Fn(x) = log(odds)_F0(x) + [learning_rate * (DT_0(x))] + [learning_rate * (DT_1(x))] + [learning_rate * (DT_2(x))] + . . .
Fn(x) = e^log(odds)_Fn(x) / [1 + e^log(odds)_Fn(x)]