This page uses content from Wikipedia and is licensed under CC BY-SA.
In computer science and statistics, the Jaro–Winkler distance is a string metric measuring an edit distance between two sequences. It is a variant proposed in 1990 by William E. Winkler of the Jaro distance metric (1989, Matthew A. Jaro).
The Jaro–Winkler distance uses a prefix scale which gives more favourable ratings to strings that match from the beginning for a set prefix length .
The lower the Jaro–Winkler distance for two strings is, the more similar the strings are. The score is normalized such that 1 equates to no similarity and 0 is an exact match. The Jaro–Winkler similarity is given by 1 − Jaro–Winkler distance.
The Jaro Similarity of two given strings and is
Two characters from and respectively, are considered matching only if they are the same and not farther than .
Each character of is compared with all its matching characters in . The number of matching (but different sequence order) characters divided by 2 defines the number of transpositions. For example, in comparing CRATE with TRACE, only 'R' 'A' 'E' are the matching characters, i.e. m=3. Although 'C', 'T' appear in both strings, they are farther than 1 (the result of ). Therefore, t=0 . In DwAyNE versus DuANE the matching letters are already in the same order D-A-N-E, so no transpositions are needed.
Jaro–Winkler similarity uses a prefix scale which gives more favorable ratings to strings that match from the beginning for a set prefix length . Given two strings and , their Jaro–Winkler similarity is:
The Jaro-Winkler distance is defined as .
Although often referred to as a distance metric, the Jaro–Winkler distance is not a metric in the mathematical sense of that term because it does not obey the triangle inequality. The Jaro–Winkler distance also does not satisfy the identity axiom .
There are other popular measures of edit distance, which are calculated using a different set of allowable edit operations. For instance,
Edit distance is usually defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation's cost depend on where it is applied.