Kruskal-Wallis Test Statistic Formula Derivation When No Tied Values Exist

7 minute read

Published:

In the previous post, I mentioned about the general formula of the H statistic is the following (Source: Wikipedia - Kruskal–Wallis one-way analysis of variance):

H = (N - 1) *     SUM(i=1 to g) n_i * (r_i_hat - r_hat)^2
              -----------------------------------------------
              SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j - r_hat)^2

Where:

  • n_i: the number of observations in group i
  • r_i_j: the rank (among all observations) of observation j from group i
  • N: the total number of observations across all groups
  • r_i_hat: the average rank of all observations in group i which is given by (1/n_i) * (SUM(j=1 to n_i) r_i_j)
  • r_hat: the average of all the r_i_j which is given by 0.5 * (N + 1)

In addition, you might see that if the combined observations doesn’t consist of the same values, then the test statistic could be expressed alternatively as follow:

H = [[12 / (N * (N + 1))] * SUM(i=1 to g) n_i * (r_i_hat)^2] - 3 * (N + 1)    --> We're going to prove this

In this post, we’re going to look at how to derive the above formula (H statistic without tied values).


Start from the denominator

We’ll start from the denominator. Let’s take a look at what the denominator would be if there’re no tied values.

Expanding the denominator yields the following.

SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j - r_hat)^2

SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j)^2 - 2*(r_i_j)*(r_hat) + (r_hat)^2        (A)

Re-structure the formula for rank average

Next, we’ll leverage the formula of r_hat which is (N + 1) / 2. It becomes the following.

(N + 1) / 2 = [SUM(i=1 to g) SUM(j=1 to n_i) r_i_j] / N

N * (N + 1) / 2 = SUM(i=1 to g) SUM(j=1 to n_i) r_i_j            (B)

Express (A) in form of (B)

Next, we’ll express each term in (A) in form of (B).

SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j)^2 - 2*(r_i_j)*(r_hat) + (r_hat)^2

[SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j)^2] \
- [SUM(i=1 to g) SUM(j=1 to n_i) 2 * (r_i_j) * (r_hat)] \
+ [SUM(i=1 to g) SUM(j=1 to n_i) (r_hat)^2]                     (C)

The term [SUM(i=1 to g) SUM(j=1 to n_i) (r_hat)^2] simply states that (r_hat)^2 appears N times. Therefore, the term becomes N * (r_hat)^2 or N * (N + 1)^2 / 4.

Meanwhile, the term [SUM(i=1 to g) SUM(j=1 to n_i) 2 * (r_i_j) * (r_hat)] can be replaced by 2 * r_hat * N * (N + 1) / 2 or 0.5 * N * (N + 1)^2.

Last but not least, the term [SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j)^2] is basically in the form of 1^2 + 2^2 + 3^2 + ... + n^2. Recall that such a sum of squared can be expressed as n * (n + 1) * (2n + 1) / 6 where n = N.

Therefore, (C) can be expressed as the following.

[N * (N + 1) * (2N + 1) / 6] - [0.5 * N * (N + 1)^2] + [N * (N + 1)^2 / 4]

And expanding the above yields the following.

{ [2 * N * (N + 1) * (2N + 1)] - [6 * N * (N + 1)^2] + [3 * N * (N + 1)^2] } / 12

{ [N + 1] * [2N(2N+1) - 6N(N+1) + 3N(N+1)] } / 12

{ [N + 1] * [N^2 - N] } / 12

{ [N + 1] * N * [N - 1] } / 12           (D)

To conclude all the process above, the denominator can be expressed by (D).


Plug in (D) to the H statistic formula

Let’s plug in (D) to the H statistic formula.

H = (N - 1) * SUM(i=1 to g) n_i * (r_i_hat - r_hat)^2
              ---------------------------------------
                  { [N + 1] * N * [N - 1] } / 12

H = 12 * SUM(i=1 to g) n_i * (r_i_hat - r_hat)^2
         ---------------------------------------
                    { [N + 1] * N }

Continuing the process yields the following.

H = [12 / (N(N+1))] * [SUM(i=1 to g) n_i * (r_i_hat - r_hat)^2]

H = [12 / (N(N+1))] * [SUM(i=1 to g) n_i * {(r_i_hat)^2 - 2*r_i_hat*r_hat + (r_hat)^2}]

H = [12 / (N(N+1))] * [SUM(i=1 to g) n_i * (r_i_hat)^2 - (n_i * 2 * r_i_hat * r_hat) + n_i * (r_hat)^2]

H = [12 / (N(N+1))] * [{SUM(i=1 to g) n_i * (r_i_hat)^2} - {SUM(i=1 to g) n_i * 2 * r_i_hat * r_hat} + {SUM(i=1 to g) n_i * (r_hat)^2}]           (E)

Express (E) in form of N (the total number of observations across all groups)

Recall the followings before proceeding to the next step:

(X) The formula of r_i_hat is (1/n_i) * (SUM(j=1 to n_i) r_i_j)
(Y) The formula of r_hat is (N + 1) / 2
(Z) The formula from (B) is N * (N + 1) / 2 = SUM(i=1 to g) SUM(j=1 to n_i) r_i_j

We’ll expand the following terms so that it’s in the form of N:

(P) SUM(i=1 to g) n_i * 2 * r_i_hat * r_hat
(Q) SUM(i=1 to g) n_i * (r_hat)^2

Now let’s take a look at (P) first.

SUM(i=1 to g) n_i * 2 * r_i_hat * r_hat

Using (X) and (Y) to replace r_i_hat and r_hat yields the following:

SUM(i=1 to g) n_i * 2 * (1/n_i) * (SUM(j=1 to n_i) r_i_j) * (N + 1) / 2

SUM(i=1 to g) (SUM(j=1 to n_i) r_i_j) * (N + 1)

(N + 1) * SUM(i=1 to g) (SUM(j=1 to n_i) r_i_j)

Using (Z), the above can be expressed with the following:

N * (N + 1)^2 / 2           (F)

Next, let’s take a look at (Q).

SUM(i=1 to g) n_i * (r_hat)^2

Using (Y) to replace r_hat yields the following:

SUM(i=1 to g) n_i * (N + 1)^2 / 4

[(N + 1)^2 / 4] * SUM(i=1 to g) n_i

Know that SUM(i=1 to g) n_i is simply N or the total number of observations (combined groups).

Therefore, we get:

N * (N + 1)^2 / 4          (G)

Plug in (F) and (G) to (E)

Finally, plugging in (F) and (G) to (E).

H = [12 / (N(N+1))] * [{SUM(i=1 to g) n_i * (r_i_hat)^2} - {N * (N + 1)^2 / 2} + {N * (N + 1)^2 / 4}]

H = [12 / (N(N+1))] * [{SUM(i=1 to g) n_i * (r_i_hat)^2} - {N * (N + 1)^2 / 4}]

H = [12 / (N(N+1))] * [{SUM(i=1 to g) n_i * (r_i_hat)^2}] - [12 / (N * (N + 1))] * [{N * (N + 1)^2 / 4}]

H = [[12 / (N * (N + 1))] * [{SUM(i=1 to g) n_i * (r_i_hat)^2}]] - [3 * (N + 1)]

Done.