Hypothesis Testing with the Kruskal-Wallis Test

6 minute read

Published:

The Kruskal-Wallis test is a non-parametric statistical test that is used to evaluate whether the medians of two or more groups are different. Since the test is non-parametric, it doesn’t assume that the data comes from a particular distribution.

The test statistic calculated in this test is called the H statistic.


Hypotheses

  • H0: population medians are equal
  • H1: population medians are unequal

Assumptions

  • Since this test is an extension of Mann-Whitney U Test, this test is commonly leveraged to evaluate differences between three or more groups
  • The observations to evaluate should be in ordinal scale, ratio scale or interval scale
  • The observations should be independent where there should be no relationship between the members in each group or between groups
  • All groups should have the same shape distributions (number of peaks, symmetry, skewness)

Test statistic formula

The general formula of the H statistic is the following (Source: Wikipedia - Kruskal–Wallis one-way analysis of variance):

H = (N - 1) *     SUM(i=1 to g) n_i * (r_i_hat - r_hat)^2
              -----------------------------------------------
              SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j - r_hat)^2

Where:

  • n_i: the number of observations in group i
  • r_i_j: the rank (among all observations) of observation j from group i
  • N: the total number of observations across all groups
  • r_i_hat: the average rank of all observations in group i which is given by (1/n_i) * (SUM(j=1 to n_i) r_i_j)
  • r_hat: the average of all the r_i_j which is given by 0.5 * (N + 1)

In addition, you might see that if the combined observations doesn’t consist of the same values, then the test statistic could be expressed alternatively as follow:

H = [[12 / (N(N + 1))] * SUM(i=1 to g) n_i * (r_i_hat)^2] - 3 * (N + 1)

Notice

I mentioned about how to prove the above formula (H statistic when there are no tied values) in this post.


How to run the test

For the sake of clarity, let’s use the following example data for the demonstration:

Group A: 30, 40, 50, 60
Group B: 10, 20, 70, 80
Group C: 100, 200, 300, 400

Step 1. Combine the observations from all the groups and sort in ascending order

Combined observations: 30, 40, 50, 60, 10, 20, 70, 80, 100, 200, 300, 400
Sorted observations: 10, 20, 30, 40, 50, 60, 70, 80, 100, 200, 300, 400

Step 2. Assign ranks to the sorted observations

Sorted observations: 10, 20, 30, 40, 50, 60, 70, 80, 100, 200, 300, 400
Ranks: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12

In case there are duplicate values, the rank is assigned by the following way:

  • Assign normal ranks like the one in the above example
  • Take the average of the ranks for the duplicate values

Here’s a simple example.

Sorted observations: 10, 10, 10, 30, 40, 40, 50
Normal ranks: 1, 2, 3, 4, 5, 6, 7
Averaged ranks: 2, 2, 2, 4, 5.5, 5.5, 7

Step 3. Calculate the H statistic

Recall that the H statistic is given by the following:

H = (N - 1) *     SUM(i=1 to g) n_i * (r_i_hat - r_hat)^2
              -----------------------------------------------
              SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j - r_hat)^2

In this case, g is the number of groups which is 3 (Group A, B, and C).

Before executing the formula, let’s calculate the expressions within the formula first.

The first one is r_i_hat which is the average rank of all observations in group i. It equals to (1/n_i) * (SUM(j=1 to n_i) r_i_j).

Applying the formula to our data yields the following.

r_i_j: the rank (among all observations) of observation j from group i

Group A
r_1_hat = (1/4) * (r_1_1 + r_1_2 + r_1_3 + r_1_4)
r_1_hat = (1/4) * (3 + 4 + 5 + 6) = 1/4 * 18 = 4.5

Group B
r_2_hat = (1/4) * (r_2_1 + r_2_2 + r_2_3 + r_2_4)
r_2_hat = (1/4) * (1 + 2 + 7 + 8) = 1/4 * 18 = 4.5

Group C
r_3_hat = (1/4) * (r_3_1 + r_3_2 + r_3_3 + r_3_4)
r_3_hat = (1/4) * (9 + 10 + 11 + 12) = 1/4 * 42 = 10.5

The second one is r_hat which is the average of all the r_i_j. It equals to 0.5 * (N + 1).

Applying the formula to our data yields the following.

r_hat = 0.5 * (12 + 1) = 0.5 * 13 = 6.5

With all the above results, let’s compute the H statistic.

H = (N - 1) *     SUM(i=1 to g) n_i * (r_i_hat - r_hat)^2
              -----------------------------------------------
              SUM(i=1 to g) SUM(j=1 to n_i) (r_i_j - r_hat)^2

H = (12 - 1) * [(4 * (4.5 - 6.5)^2) + (4 * (4.5 - 6.5)^2) + (4 * (10.5 - 6.5)^2)]
               ------------------------------------------------------------------
                 [(3 - 6.5)^2 + (4 - 6.5)^2 + ... + (11 - 6.5)^2 + (12 - 6.5)^2]

H = 11 *                             [(4 * 4) + (4 * 4) + (4 * 16)]
         -----------------------------------------------------------------------------------------
         [12.25 + 6.25 + 2.25 + 0.25 + 30.25 + 20.25 + 0.25 + 2.25 + 6.25 + 12.25 + 20.25 + 30.25]

H = 11 * (96 / 143)

H = 7.3843

Step 4. State the conclusion.

After computing the H statistic, we compare the value to a critical value of chi-squared with g - 1 degrees of freedom (g is the number of groups. In our example above, g is 3) and an alpha level. This critical value could be retrieved from the chi-squared probability distribution’s table. Let’s denote his critical value as Hc.

If H is bigger than Hc, then reject the null hypothesis. Otherwise, there is no evidence that the population medians are not equal.


References

  • Wikipedia: Kruskal–Wallis one-way analysis of variance
  • statisticshowto: Kruskal Wallis H Test: Definition, Examples & Assumptions