Performance plays an important role in record linkage. Record linkage problems scale quadratically with the size of the file(s). The number of record pairs can be enormous and so are the number of comparisons. Over the years, smart techniques are developed to reduce the number of record pairs. The Python Record Linkage Toolkit offers several effective techniques for making record pairs. Nevertheless, the Python Record Linkage Toolkit is NOT developed with speed in mind. Understandability, usability and extensibility are the main focus points in the development. This makes the toolkit very useful for linking small or medium sized files.
Okeee… There is not only bad news. The performance of many algorithms is good. Nevertheless, there are plenty of options to increase the performance of your record linkage implementation. Some methods are discussed in this topic.
Do you know more tricks? Let us know!
Block on multiple columns¶
Blocking is an effective way to increase the performance of your record linkage. If the performance of your implementation is still poor, decrease the number of pairs by blocking on multiple variables. This implies that the record pair is agrees on two or more variables. In the following example, the record pairs agree on the given name and surname.
p = recordlinkage.BlockIndex(left_on=['first_name', 'surname'], right_on=['name', 'surname']) pairs = p.index(dfA, dfB)
You might exclude more links then desired. This can be solved by repeating the process with a different combinations of blocking variables. In the end, merge the links of the two, so called, index passes.
p_surname = recordlinkage.BlockIndex(left_on=['first_name', 'surname'], right_on=['name', 'surname']) p_age = recordlinkage.BlockIndex(left_on=['first_name', 'age'], right_on=['name', 'age']) pairs_surname = p_surname.index(dfA, dfB) pairs_age = p_age.index(dfA, dfB) # make a union of the candidate links of both classes pairs_surname.union(pairs_age)
Sorted Neighbourhood indexing supports, besides the sorted neighbourhood, additional blocking on variables.
Make record pairs¶
The structure of the Python Record Linkage Toolkit has a drawback for the performance. In the indexation step (the step in which record pairs are selected), only the index of both records is stored. The entire records are not stored. This results in less memory usage. The drawback is that the records need to be queried from the data.
Compare only discriminating variables¶
Not all variables may be worth comparing in a record linkage. Some variables do not discriminate the links of the non-links or do have only minor effects. These variables can be excluded. Only discriminating and informative should be included.
Prevent string comparisons¶
String similarity measures and phonetic encodings are computationally expensive. Phonetic encoding takes place on the original data, while string simililatiry measures are applied on the record pairs. After phonetic encoding of the string variables, exact comparing can be used instead of computing the string similarity of all record pairs. If the number of candidate pairs is much larger than the number of records in both datasets together, then consider using phonetic encoding of string variables instead of string comparison.
Comparing strings is computationally expensive. The Python Record Linkage
Toolkit uses the package
jellyfish for string comparisons. The package has
two implementations, a C and a Python implementation. Ensure yourself of
having the C-version installed (
import jellyfish.cjellyfish should not
raise an exception).
There can be a large difference in the performance of different string comparison algorithms. The Jaro and Jaro-Winkler methods are faster than the Levenshtein distance and much faster than the Damerau-Levenshtein distance.
Indexing with large files¶
Sometimes, the input files are very large. In that case, it can be hard
to make an index without running out of memory in the indexing step or
in the comparing step.
recordlinkage has a method to deal with large
files. It is fast, although is not primary developed to be fast. SQL
databases may outperform this method. It is especially developed for the
useability. The idea was to spllit the input files into small blocks.
For each block the record pairs are computed. Then iterate over the
blocks. Consider full indexing:
import recordlinkage import numpy cl = recordlinkage.FullIndex() for dfB_subset in numpy.split(dfB): # a subset of record pairs pairs_subset = cl.index(dfA, dfB_subset) # Your analysis on pairs_subset here