Kruskal-Wallis Test Statistic Formula Derivation When No Tied Values Exist
Published:
In the previous post, I mentioned about the general formula of the H statistic is the following (Source: Wikipedia - Kruskal–Wallis one-way analysis of variance):
H = (N - 1) * SUM(i=1 to g) n_i * (r_i_hat - r_hat)^2
-----------------------------------------------
SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j - r_hat)^2
Where:
n_i
: the number of observations in groupi
r_i_j
: the rank (among all observations) of observationj
from groupi
N
: the total number of observations across all groupsr_i_hat
: the average rank of all observations in groupi
which is given by(1/n_i) * (SUM(j=1 to n_i) r_i_j)
r_hat
: the average of all ther_i_j
which is given by0.5 * (N + 1)
In addition, you might see that if the combined observations doesn’t consist of the same values, then the test statistic could be expressed alternatively as follow:
H = [[12 / (N * (N + 1))] * SUM(i=1 to g) n_i * (r_i_hat)^2] - 3 * (N + 1) --> We're going to prove this
In this post, we’re going to look at how to derive the above formula (H statistic without tied values).
Start from the denominator
We’ll start from the denominator. Let’s take a look at what the denominator would be if there’re no tied values.
Expanding the denominator yields the following.
SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j - r_hat)^2
SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j)^2 - 2*(r_i_j)*(r_hat) + (r_hat)^2 (A)
Re-structure the formula for rank average
Next, we’ll leverage the formula of r_hat
which is (N + 1) / 2
. It becomes the following.
(N + 1) / 2 = [SUM(i=1 to g) SUM(j=1 to n_i) r_i_j] / N
N * (N + 1) / 2 = SUM(i=1 to g) SUM(j=1 to n_i) r_i_j (B)
Express (A) in form of (B)
Next, we’ll express each term in (A)
in form of (B)
.
SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j)^2 - 2*(r_i_j)*(r_hat) + (r_hat)^2
[SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j)^2] \
- [SUM(i=1 to g) SUM(j=1 to n_i) 2 * (r_i_j) * (r_hat)] \
+ [SUM(i=1 to g) SUM(j=1 to n_i) (r_hat)^2] (C)
The term [SUM(i=1 to g) SUM(j=1 to n_i) (r_hat)^2]
simply states that (r_hat)^2
appears N
times. Therefore, the term becomes N * (r_hat)^2
or N * (N + 1)^2 / 4
.
Meanwhile, the term [SUM(i=1 to g) SUM(j=1 to n_i) 2 * (r_i_j) * (r_hat)]
can be replaced by 2 * r_hat * N * (N + 1) / 2
or 0.5 * N * (N + 1)^2
.
Last but not least, the term [SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j)^2]
is basically in the form of 1^2 + 2^2 + 3^2 + ... + n^2
. Recall that such a sum of squared can be expressed as n * (n + 1) * (2n + 1) / 6
where n = N
.
Therefore, (C)
can be expressed as the following.
[N * (N + 1) * (2N + 1) / 6] - [0.5 * N * (N + 1)^2] + [N * (N + 1)^2 / 4]
And expanding the above yields the following.
{ [2 * N * (N + 1) * (2N + 1)] - [6 * N * (N + 1)^2] + [3 * N * (N + 1)^2] } / 12
{ [N + 1] * [2N(2N+1) - 6N(N+1) + 3N(N+1)] } / 12
{ [N + 1] * [N^2 - N] } / 12
{ [N + 1] * N * [N - 1] } / 12 (D)
To conclude all the process above, the denominator can be expressed by (D)
.
Plug in (D) to the H statistic formula
Let’s plug in (D)
to the H statistic formula.
H = (N - 1) * SUM(i=1 to g) n_i * (r_i_hat - r_hat)^2
---------------------------------------
{ [N + 1] * N * [N - 1] } / 12
H = 12 * SUM(i=1 to g) n_i * (r_i_hat - r_hat)^2
---------------------------------------
{ [N + 1] * N }
Continuing the process yields the following.
H = [12 / (N(N+1))] * [SUM(i=1 to g) n_i * (r_i_hat - r_hat)^2]
H = [12 / (N(N+1))] * [SUM(i=1 to g) n_i * {(r_i_hat)^2 - 2*r_i_hat*r_hat + (r_hat)^2}]
H = [12 / (N(N+1))] * [SUM(i=1 to g) n_i * (r_i_hat)^2 - (n_i * 2 * r_i_hat * r_hat) + n_i * (r_hat)^2]
H = [12 / (N(N+1))] * [{SUM(i=1 to g) n_i * (r_i_hat)^2} - {SUM(i=1 to g) n_i * 2 * r_i_hat * r_hat} + {SUM(i=1 to g) n_i * (r_hat)^2}] (E)
Express (E) in form of N (the total number of observations across all groups)
Recall the followings before proceeding to the next step:
(X) The formula of r_i_hat
is (1/n_i) * (SUM(j=1 to n_i) r_i_j)
(Y) The formula of r_hat
is (N + 1) / 2
(Z) The formula from (B)
is N * (N + 1) / 2 = SUM(i=1 to g) SUM(j=1 to n_i) r_i_j
We’ll expand the following terms so that it’s in the form of N
:
(P) SUM(i=1 to g) n_i * 2 * r_i_hat * r_hat
(Q) SUM(i=1 to g) n_i * (r_hat)^2
Now let’s take a look at (P)
first.
SUM(i=1 to g) n_i * 2 * r_i_hat * r_hat
Using (X) and (Y) to replace r_i_hat and r_hat yields the following:
SUM(i=1 to g) n_i * 2 * (1/n_i) * (SUM(j=1 to n_i) r_i_j) * (N + 1) / 2
SUM(i=1 to g) (SUM(j=1 to n_i) r_i_j) * (N + 1)
(N + 1) * SUM(i=1 to g) (SUM(j=1 to n_i) r_i_j)
Using (Z), the above can be expressed with the following:
N * (N + 1)^2 / 2 (F)
Next, let’s take a look at (Q)
.
SUM(i=1 to g) n_i * (r_hat)^2
Using (Y) to replace r_hat yields the following:
SUM(i=1 to g) n_i * (N + 1)^2 / 4
[(N + 1)^2 / 4] * SUM(i=1 to g) n_i
Know that SUM(i=1 to g) n_i is simply N or the total number of observations (combined groups).
Therefore, we get:
N * (N + 1)^2 / 4 (G)
Plug in (F) and (G) to (E)
Finally, plugging in (F)
and (G)
to (E)
.
H = [12 / (N(N+1))] * [{SUM(i=1 to g) n_i * (r_i_hat)^2} - {N * (N + 1)^2 / 2} + {N * (N + 1)^2 / 4}]
H = [12 / (N(N+1))] * [{SUM(i=1 to g) n_i * (r_i_hat)^2} - {N * (N + 1)^2 / 4}]
H = [12 / (N(N+1))] * [{SUM(i=1 to g) n_i * (r_i_hat)^2}] - [12 / (N * (N + 1))] * [{N * (N + 1)^2 / 4}]
H = [[12 / (N * (N + 1))] * [{SUM(i=1 to g) n_i * (r_i_hat)^2}]] - [3 * (N + 1)]
Done.