The Python Record Linkage Toolkit contains several built-in algorithms for making record pairs and comparing record pairs. Sometimes, these built-in algorithms do not fit your needs. With the Python Record Linkage Toolkit, it is easy to use other algorithms. This section describes how to implement custom algorithms for the making and comparing record pairs. When you think your algorithm might help others, consider sharing it!
For run the examples below, import pandas, recordlinkage and the two datasets belonging to sample dataset FEBRL4.
import pandas import recordlinkage as rl from recordlinkage.datasets import load_febrl4 df_a, df_b = load_febrl4()
Custom index algorithms¶
The Python Record Linkage Toolkit contains multiple algorithms to pair records (index algorithms) such as full indexing, blocking and sorted neighbourhood indexing. This section explains how to make and implement a custom algorithm to make record pairs.
To create a custom algorithm, you have to make a subclass of the
recordlinkage.base.BaseIndexator. In your subclass, you overwrite
_link_index method. This method accepts two pandas DataFrames as
arguments. Based on these DataFrames, your method must create pairs and
return them as a
pandas.MultiIndex in which the MultiIndex names are
the index names of DataFrame A and DataFrame B respectively.
The algorithm for linking data frames can be used for finding duplicates
as well. In this situation, DataFrame B is a copy of DataFrame A. The
Pairs class removes pairs like
(record_i, record_i) and one of
(record_i, record_j) (record_j, record_i) under the
hood. As result of this, only unique combinations are returned. If you
do have a specific algorithm for finding duplicates, then you can
_dedup_index method. This method accepts only one
argument (DataFrame A) and the internal base class does not look for
combinations like explained above.
Let’s make an algorithm that pairs records of which the given name in both records starts with the letter ‘W’.
from recordlinkage.base import BaseIndexator class FirstLetterWIndex(BaseIndexator): """Custom class for indexing""" def _link_index(self, df_a, df_b): """Make pairs of all records where the given name start with the letter 'W'.""" # Select records with names starting with a w. name_a_startswith_w = df_a[df_a['given_name'].str.startswith('w') == True] name_b_startswith_w = df_b[df_b['given_name'].str.startswith('w') == True] # Make a product of the two numpy arrays return pandas.MultiIndex.from_product( [name_a_startswith_w.index.values, name_b_startswith_w.index.values], names=[df_a.index.name, df_b.index.name] )
indexer = FirstLetterWIndex() candidate_pairs = indexer.index(df_a, df_b) print ('Returns a', type(candidate_pairs).__name__) print ('Number of candidate record pairs starting with the letter w:', len(candidate_pairs))
Returns a MultiIndex Number of candidate record pairs starting with the letter w: 6072
The custom index class below does not restrict the first letter to ‘w’,
but the first letter is an argument (named
letter). This letter can
is initialized during the setup of the class.
class FirstLetterIndex(BaseIndexator): """Custom class for indexing""" def __init__(self, letter): super(FirstLetterIndex, self).__init__() # the letter to save self.letter = letter def _link_index(self, df_a, df_b): """Make record pairs that agree on the first letter of the given name.""" # Select records with names starting with a 'letter'. a_startswith_w = df_a[df_a['given_name'].str.startswith(self.letter) == True] b_startswith_w = df_b[df_b['given_name'].str.startswith(self.letter) == True] # Make a product of the two numpy arrays return pandas.MultiIndex.from_product( [a_startswith_w.index.values, b_startswith_w.index.values], names=[df_a.index.name, df_b.index.name] )
indexer = FirstLetterIndex('w') candidate_pairs = indexer.index(df_a, df_b) print ('Number of record pairs (letter w):', len(candidate_pairs))
Number of record pairs (letter w): 6072
for letter in 'wxa': indexer = FirstLetterIndex(letter) candidate_pairs = indexer.index(df_a, df_b) print ('Number of record pairs (letter %s):' % letter, len(candidate_pairs))
Number of record pairs (letter w): 6072 Number of record pairs (letter x): 132 Number of record pairs (letter a): 172431
Custom compare algorithms¶
The Python Record Linkage Toolkit holds algorithms to compare strings, numeric values and dates. These functions may not be sufficient for your record linkage. The internal framework of the toolkit makes it easy to implement custom algorithms to compare records.
A custom algorithm is a function that accepts two arguments, one
pandas.Series with information on the variable in the first file and
pandas.Series with information on the variable in the second
file. The function returns a
numpy.array or list
with the similarity/comparison values.
def compare_zipcodes(s1, s2): """ If the zipcodes in both records are identical, the similarity is 0. If the first two values agree and the last two don't, then the similarity is 0.5. Otherwise, the similarity is 0. """ # check if the zipcode are identical (return 1 or 0) sim = (s1 == s2).astype(float) # check the first 2 numbers of the distinct comparisons sim[(sim == 0) & (s1.str[0:2] == s2.str[0:2])] = 0.5 return sim
# Make an index of record pairs pcl = rl.BlockIndex('given_name') candidate_pairs = pcl.index(df_a, df_b) crl = rl.Compare(candidate_pairs, df_a, df_b) # default algorithm crl.string('surname', 'surname', method='jarowinkler', name='sim_surname') # custom zipcodes algorithm # crl.compare(callable_algorithm, 'label_A', 'label_B', name='sim_name') crl.compare(compare_zipcodes, 'postcode', 'postcode', name='sim_postcode') crl.vectors['sim_postcode'].value_counts()
0.0 71229 0.5 3166 1.0 2854 Name: sim_postcode, dtype: int64
As you can see, one can pass the labels of the columns as arguments. The
first argument is a column label, or a list of column labels, found in
the first DataFrame (
postcode in this example). The second argument
is a column label, or a list of column labels, found in the second
postcode in this example). The
recordlinkage.Compare class selects the columns with the given
labels before passing them to the custom algorithm/function. The
compare method in the
recordlinkage.Compare class passes
additional (keyword) arguments to the custom function.
Warning: Do not change the order of the pairs in the MultiIndex.
The Python Record Linkage Toolkit supports the comparison of more than two columns. This is especially useful in situations with multi-dimensional data (for example geographical coordinates) and situations where fields can be swapped.
The FEBRL4 dataset has two columns filled with address information
address_2). In a naive approach, one compares
address_1 of file A with
address_1 of file B and
of file A with
address_2 of file B. If the values for
address_2 are swapped during the record generating process, the
naive approach considers the addresses to be distinct. In a more
address_1 of file A is compared with
address_2 of file B. Variable
file A is compared with
address_2 of file B. This
is done with the single function given below.
def compare_addresses(s1_1, s1_2, s2_1, s2_2): """ Compare addresses. Compare address_1 of file A with address_1 and address_2 of file B. The same for address_2 of dataset 1. """ return ((s1_1 == s2_1) | (s1_2 == s2_2) | (s1_1 == s2_2) | (s1_2 == s2_1)).astype(float)
crl = rl.Compare(candidate_pairs, df_a, df_b) # naive crl.exact('address_1', 'address_1', name='sim_address_1') crl.exact('address_2', 'address_2', name='sim_address_2') # better crl.compare(compare_addresses, ['address_1', 'address_2'], ['address_1', 'address_2'], name='sim_address') crl.vectors.mean()
sim_address_1 0.02488 sim_address_2 0.02025 sim_address 0.03566 dtype: float64