This page uses content from Wikipedia and is licensed under CC BY-SA.

# Kendall rank correlation coefficient

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's tau coefficient (after the Greek letter τ), is a statistic used to measure the ordinal association between two measured quantities. A tau test is a non-parametric hypothesis test for statistical dependence based on the tau coefficient.

It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938,[1] though Gustav Fechner had proposed a similar measure in the context of time series in 1897.[2]

Intuitively, the Kendall correlation between two variables will be high when observations have a similar (or identical for a correlation of 1) rank (i.e. relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and low when observations have a dissimilar (or fully different for a correlation of -1) rank between the two variables.

Both Kendall's ${\displaystyle \tau }$ and Spearman's ${\displaystyle \rho }$ can be formulated as special cases of a more general correlation coefficient.

## Definition

Let (x1y1), (x2y2), …, (xnyn) be a set of observations of the joint random variables X and Y respectively, such that all the values of (${\displaystyle x_{i}}$) and (${\displaystyle y_{i}}$) are unique. Any pair of observations ${\displaystyle (x_{i},y_{i})}$ and ${\displaystyle (x_{j},y_{j})}$, where ${\displaystyle i, are said to be concordant if the ranks for both elements (more precisely, the sort order by x and by y) agree: that is, if both ${\displaystyle x_{i}>x_{j}}$ and ${\displaystyle y_{i}>y_{j}}$; or if both ${\displaystyle x_{i} and ${\displaystyle y_{i}. They are said to be discordant, if ${\displaystyle x_{i}>x_{j}}$ and ${\displaystyle y_{i}; or if ${\displaystyle x_{i} and ${\displaystyle y_{i}>y_{j}}$. If ${\displaystyle x_{i}=x_{j}}$ or ${\displaystyle y_{i}=y_{j}}$, the pair is neither concordant nor discordant.

The Kendall τ coefficient is defined as:

${\displaystyle \tau ={\frac {({\text{number of concordant pairs}})-({\text{number of discordant pairs}})}{n(n-1)/2}}.}$[3]

### Properties

The denominator is the total number of pair combinations, so the coefficient must be in the range −1 ≤ τ ≤ 1.

• If the agreement between the two rankings is perfect (i.e., the two rankings are the same) the coefficient has value 1.
• If the disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the other) the coefficient has value −1.
• If X and Y are independent, then we would expect the coefficient to be approximately zero.
• An explicit expression for Kendall's rank coefficient is ${\displaystyle \tau ={\frac {2}{n(n-1)}}\sum _{i.

## Hypothesis test

The Kendall rank coefficient is often used as a test statistic in a statistical hypothesis test to establish whether two variables may be regarded as statistically dependent. This test is non-parametric, as it does not rely on any assumptions on the distributions of X or Y or the distribution of (X,Y).

Under the null hypothesis of independence of X and Y, the sampling distribution of τ has an expected value of zero. The precise distribution cannot be characterized in terms of common distributions, but may be calculated exactly for small samples; for larger samples, it is common to use an approximation to the normal distribution, with mean zero and variance

${\displaystyle {\frac {2(2n+5)}{9n(n-1)}}}$.[4]

## Accounting for ties

A pair ${\displaystyle \{(x_{i},y_{i}),(x_{j},y_{j})\}}$ is said to be tied if ${\displaystyle x_{i}=x_{j}}$ or ${\displaystyle y_{i}=y_{j}}$; a tied pair is neither concordant nor discordant. When tied pairs arise in the data, the coefficient may be modified in a number of ways to keep it in the range [−1, 1]:

### Tau-a

The Tau-a statistic tests the strength of association of the cross tabulations. Both variables have to be ordinal. Tau-a will not make any adjustment for ties. It is defined as:

${\displaystyle \tau _{A}={\frac {n_{c}-n_{d}}{n_{0}}}}$

where nc, nd and n0 are defined as in the next section.

### Tau-b

The Tau-b statistic, unlike Tau-a, makes adjustments for ties.[5] Values of Tau-b range from −1 (100% negative association, or perfect inversion) to +1 (100% positive association, or perfect agreement). A value of zero indicates the absence of association.

The Kendall Tau-b coefficient is defined as:

${\displaystyle \tau _{B}={\frac {n_{c}-n_{d}}{\sqrt {(n_{0}-n_{1})(n_{0}-n_{2})}}}}$

where

{\displaystyle {\begin{aligned}n_{0}&=n(n-1)/2\\n_{1}&=\sum _{i}t_{i}(t_{i}-1)/2\\n_{2}&=\sum _{j}u_{j}(u_{j}-1)/2\\n_{c}&={\text{Number of concordant pairs}}\\n_{d}&={\text{Number of discordant pairs}}\\t_{i}&={\text{Number of tied values in the }}i^{\text{th}}{\text{ group of ties for the first quantity}}\\u_{j}&={\text{Number of tied values in the }}j^{\text{th}}{\text{ group of ties for the second quantity}}\end{aligned}}}

Be aware that some statistical packages, e.g. SPSS, use alternative formulas for computational efficiency, with double the 'usual' number of concordant and discordant pairs.[6]

### Tau-c

Tau-c (also called Stuart-Kendall Tau-c)[7] is more suitable than Tau-b for the analysis of data based on non-square (i.e. rectangular) contingency tables.[7][8] So use Tau-b if the underlying scale of both variables has the same number of possible values (before ranking) and Tau-c if they differ. For instance, one variable might be scored on a 5-point scale (very good, good, average, bad, very bad), whereas the other might be based on a finer 10-point scale.

The Kendall Tau-c coefficient is defined as:[8]

${\displaystyle \tau _{C}={\frac {2(n_{c}-n_{d})}{n^{2}{\frac {(m-1)}{m}}}}}$

where

{\displaystyle {\begin{aligned}n_{c}&={\text{Number of concordant pairs}}\\n_{d}&={\text{Number of discordant pairs}}\\r&={\text{Number of rows}}\\c&={\text{Number of columns}}\\m&=\min(r,c)\end{aligned}}}

## Significance tests

When two quantities are statistically independent, the distribution of ${\displaystyle \tau }$ is not easily characterizable in terms of known distributions. However, for ${\displaystyle \tau _{A}}$ the following statistic, ${\displaystyle z_{A}}$, is approximately distributed as a standard normal when the variables are statistically independent:

${\displaystyle z_{A}={3(n_{c}-n_{d}) \over {\sqrt {n(n-1)(2n+5)/2}}}}$

Thus, to test whether two variables are statistically dependent, one computes ${\displaystyle z_{A}}$, and finds the cumulative probability for a standard normal distribution at ${\displaystyle -|z_{A}|}$. For a 2-tailed test, multiply that number by two to obtain the p-value. If the p-value is below a given significance level, one rejects the null hypothesis (at that significance level) that the quantities are statistically independent.

Numerous adjustments should be added to ${\displaystyle z_{A}}$ when accounting for ties. The following statistic, ${\displaystyle z_{B}}$, has the same distribution as the ${\displaystyle \tau _{B}}$ distribution, and is again approximately equal to a standard normal distribution when the quantities are statistically independent:

${\displaystyle z_{B}={n_{c}-n_{d} \over {\sqrt {v}}}}$

where

${\displaystyle {\begin{array}{ccl}v&=&(v_{0}-v_{t}-v_{u})/18+v_{1}+v_{2}\\v_{0}&=&n(n-1)(2n+5)\\v_{t}&=&\sum _{i}t_{i}(t_{i}-1)(2t_{i}+5)\\v_{u}&=&\sum _{j}u_{j}(u_{j}-1)(2u_{j}+5)\\v_{1}&=&\sum _{i}t_{i}(t_{i}-1)\sum _{j}u_{j}(u_{j}-1)/(2n(n-1))\\v_{2}&=&\sum _{i}t_{i}(t_{i}-1)(t_{i}-2)\sum _{j}u_{j}(u_{j}-1)(u_{j}-2)/(9n(n-1)(n-2))\end{array}}}$

## Algorithms

The direct computation of the numerator ${\displaystyle n_{c}-n_{d}}$, involves two nested iterations, as characterized by the following pseudo-code:

numer := 0
for i:=2..N do
for j:=1..(i-1) do
numer := numer + sign(x[i] - x[j]) * sign(y[i] - y[j])
return numer


Although quick to implement, this algorithm is ${\displaystyle O(n^{2})}$ in complexity and becomes very slow on large samples. A more sophisticated algorithm[9] built upon the Merge Sort algorithm can be used to compute the numerator in ${\displaystyle O(n\cdot \log {n})}$ time.

Begin by ordering your data points sorting by the first quantity, ${\displaystyle x}$, and secondarily (among ties in ${\displaystyle x}$) by the second quantity, ${\displaystyle y}$. With this initial ordering, ${\displaystyle y}$ is not sorted, and the core of the algorithm consists of computing how many steps a Bubble Sort would take to sort this initial ${\displaystyle y}$. An enhanced Merge Sort algorithm, with ${\displaystyle O(n\log n)}$ complexity, can be applied to compute the number of swaps, ${\displaystyle S(y)}$, that would be required by a Bubble Sort to sort ${\displaystyle y_{i}}$. Then the numerator for ${\displaystyle \tau }$ is computed as:

${\displaystyle n_{c}-n_{d}=n_{0}-n_{1}-n_{2}+n_{3}-2S(y),}$

where ${\displaystyle n_{3}}$ is computed like ${\displaystyle n_{1}}$ and ${\displaystyle n_{2}}$, but with respect to the joint ties in ${\displaystyle x}$ and ${\displaystyle y}$.

A Merge Sort partitions the data to be sorted, ${\displaystyle y}$ into two roughly equal halves, ${\displaystyle y_{\mathrm {left} }}$ and ${\displaystyle y_{\mathrm {right} }}$, then sorts each half recursive, and then merges the two sorted halves into a fully sorted vector. The number of Bubble Sort swaps is equal to:

${\displaystyle S(y)=S(y_{\mathrm {left} })+S(y_{\mathrm {right} })+M(Y_{\mathrm {left} },Y_{\mathrm {right} })}$

where ${\displaystyle Y_{\mathrm {left} }}$ and ${\displaystyle Y_{\mathrm {right} }}$ are the sorted versions of ${\displaystyle y_{\mathrm {left} }}$ and ${\displaystyle y_{\mathrm {right} }}$, and ${\displaystyle M(\cdot ,\cdot )}$ characterizes the Bubble Sort swap-equivalent for a merge operation. ${\displaystyle M(\cdot ,\cdot )}$ is computed as depicted in the following pseudo-code:

function M(L[1..n], R[1..m])
i := 1
j := 1
nSwaps := 0
while i <= n  and j <= m do
if R[j] < L[i] then
nSwaps := nSwaps + n - i + 1
j := j + 1
else
i := i + 1
return nSwaps


A side effect of the above steps is that you end up with both a sorted version of ${\displaystyle x}$ and a sorted version of ${\displaystyle y}$. With these, the factors ${\displaystyle t_{i}}$ and ${\displaystyle u_{j}}$ used to compute ${\displaystyle \tau _{B}}$ are easily obtained in a single linear-time pass through the sorted arrays.

## References

1. ^ Kendall, M. (1938). "A New Measure of Rank Correlation". Biometrika. 30 (1–2): 81–89. doi:10.1093/biomet/30.1-2.81. JSTOR 2332226.
2. ^ Kruskal, W.H. (1958). "Ordinal Measures of Association". Journal of the American Statistical Association. 53 (284): 814–861. doi:10.2307/2281954. JSTOR 2281954. MR 0100941.
3. ^ Nelsen, R.B. (2001) [1994], "Kendall tau metric", in Hazewinkel, Michiel, Encyclopedia of Mathematics, Springer Science+Business Media B.V. / Kluwer Academic Publishers, ISBN 978-1-55608-010-4
4. ^ Prokhorov, A.V. (2001) [1994], "Kendall coefficient of rank correlation", in Hazewinkel, Michiel, Encyclopedia of Mathematics, Springer Science+Business Media B.V. / Kluwer Academic Publishers, ISBN 978-1-55608-010-4
5. ^ Agresti, A. (2010). Analysis of Ordinal Categorical Data (Second ed.). New York: John Wiley & Sons. ISBN 978-0-470-08289-8.
6. ^ IBM (2016). IBM SPSS Statistics 24 Algorithms. IBM. p. 168. Retrieved 31 August 2017.
7. ^ a b Berry, K. J.; Johnston, J. E.; Zahran, S.; Mielke, P. W. (2009). "Stuart's tau measure of effect size for ordinal variables: Some methodological considerations". Behavior Research Methods. 41 (4): 1144–1148. doi:10.3758/brm.41.4.1144. PMID 19897822.
8. ^ a b Stuart, A. (1953). "The Estimation and Comparison of Strengths of Association in Contingency Tables". Biometrika. 40 (1–2): 105–110. doi:10.2307/2333101. JSTOR 2333101.
9. ^ Knight, W. (1966). "A Computer Method for Calculating Kendall's Tau with Ungrouped Data". Journal of the American Statistical Association. 61 (314): 436–439. doi:10.2307/2282833. JSTOR 2282833.