Custom algorithms

The Python Record Linkage Toolkit contains several built-in algorithms for making record pairs and comparing record pairs. Sometimes, these built-in algorithms do not fit your needs. With the Python Record Linkage Toolkit, it is easy to use other algorithms. This section describes how to implement custom algorithms for the making and comparing record pairs. When you think your algorithm might help others, consider sharing it!

For run the examples below, import pandas, recordlinkage and the two datasets belonging to sample dataset FEBRL4.

In [2]:
import pandas

import recordlinkage
from recordlinkage.datasets import load_febrl4

df_a, df_b = load_febrl4()
In [3]:
df_a
Out[3]:
given_name surname street_number address_1 address_2 suburb postcode state date_of_birth soc_sec_id
rec_id
rec-1070-org michaela neumann 8 stanley street miami winston hills 4223 nsw 19151111 5304218
rec-1016-org courtney painter 12 pinkerton circuit bega flats richlands 4560 vic 19161214 4066625
rec-4405-org charles green 38 salkauskas crescent kela dapto 4566 nsw 19480930 4365168
rec-1288-org vanessa parr 905 macquoid place broadbridge manor south grafton 2135 sa 19951119 9239102
rec-3585-org mikayla malloney 37 randwick road avalind hoppers crossing 4552 vic 19860208 7207688
... ... ... ... ... ... ... ... ... ... ...
rec-2153-org annabel grierson 97 mclachlan crescent lantana lodge broome 2480 nsw 19840224 7676186
rec-1604-org sienna musolino 22 smeaton circuit pangani mckinnon 2700 nsw 19890525 4971506
rec-1003-org bradley matthews 2 jondol place horseshoe ck jacobs well 7018 sa 19481122 8927667
rec-4883-org brodee egan 88 axon street greenslopes wamberal 2067 qld 19121113 6039042
rec-66-org koula houweling 3 mileham street old airdmillan road williamstown 2350 nsw 19440718 6375537

5000 rows × 10 columns

Custom index algorithms

The Python Record Linkage Toolkit contains multiple algorithms to pair records (index algorithms) such as full indexing, blocking and sorted neighbourhood indexing. This section explains how to make and implement a custom algorithm to make record pairs.

To create a custom algorithm, you have to make a subclass of the recordlinkage.base.BaseIndexator. In your subclass, you overwrite the _link_index method. This method accepts two pandas DataFrames as arguments. Based on these DataFrames, your method must create pairs and return them as a pandas.MultiIndex in which the MultiIndex names are the index names of DataFrame A and DataFrame B respectively.

The algorithm for linking data frames can be used for finding duplicates as well. In this situation, DataFrame B is a copy of DataFrame A. The Pairs class removes pairs like (record_i, record_i) and one of the following (record_i, record_j) (record_j, record_i) under the hood. As result of this, only unique combinations are returned. If you do have a specific algorithm for finding duplicates, then you can overwrite the _dedup_index method. This method accepts only one argument (DataFrame A) and the internal base class does not look for combinations like explained above.

Let’s make an algorithm that pairs records of which the given name in both records starts with the letter ‘W’.

In [4]:
from recordlinkage.base import BaseIndexator

class FirstLetterWIndex(BaseIndexator):
    """Custom class for indexing"""

    def _link_index(self, df_a, df_b):
        """Make pairs of all records where the given name start with the letter 'W'."""

        # Select records with names starting with a w.
        name_a_startswith_w = df_a[df_a['given_name'].str.startswith('w') == True]
        name_b_startswith_w = df_b[df_b['given_name'].str.startswith('w') == True]

        # Make a product of the two numpy arrays
        return pandas.MultiIndex.from_product(
            [name_a_startswith_w.index.values, name_b_startswith_w.index.values],
            names=[df_a.index.name, df_b.index.name]
        )
In [5]:
indexer = FirstLetterWIndex()
candidate_pairs = indexer.index(df_a, df_b)

print ('Returns a', type(candidate_pairs).__name__)
print ('Number of candidate record pairs starting with the letter w:', len(candidate_pairs))
Returns a MultiIndex
Number of candidate record pairs starting with the letter w: 6072

The custom index class below does not restrict the first letter to ‘w’, but the first letter is an argument (named letter). This letter can is initialized during the setup of the class.

In [6]:
class FirstLetterIndex(BaseIndexator):
    """Custom class for indexing"""

    def __init__(self, letter):
        super(FirstLetterIndex, self).__init__()

        # the letter to save
        self.letter = letter

    def _link_index(self, df_a, df_b):
        """Make record pairs that agree on the first letter of the given name."""

        # Select records with names starting with a 'letter'.
        a_startswith_w = df_a[df_a['given_name'].str.startswith(self.letter) == True]
        b_startswith_w = df_b[df_b['given_name'].str.startswith(self.letter) == True]

        # Make a product of the two numpy arrays
        return pandas.MultiIndex.from_product(
            [a_startswith_w.index.values, b_startswith_w.index.values],
            names=[df_a.index.name, df_b.index.name]
        )
In [7]:
indexer = FirstLetterIndex('w')
candidate_pairs = indexer.index(df_a, df_b)

print ('Number of record pairs (letter w):', len(candidate_pairs))
Number of record pairs (letter w): 6072
In [8]:
for letter in 'wxa':

    indexer = FirstLetterIndex(letter)
    candidate_pairs = indexer.index(df_a, df_b)

    print ('Number of record pairs (letter %s):' % letter, len(candidate_pairs))
Number of record pairs (letter w): 6072
Number of record pairs (letter x): 132
Number of record pairs (letter a): 172431

Custom compare algorithms

The Python Record Linkage Toolkit holds algorithms to compare strings, numeric values and dates. These functions may not be sufficient for your record linkage. The internal framework of the toolkit makes it easy to implement custom algorithms to compare records. A custom function can be passed to the compare_vectorized method (see example below).

A custom algorithm is a function that accepts at least two arguments. The first argument is a pandas.Series with information on the variable in the first file and the second argument is a pandas.Series with information on the variable in the second file. The custom function has to return a pandas.Series, numpy.array or list with the similarity/comparison values.

The following code is a custom function to compare zipcodes. The function returns 1.0 for record pairs that agree on the zipcode and returns 0.0 for records that disagree on the zipcode. If the zipcodes disagree but the first two numbers are identical, then the function returns 0.5.

In [9]:
def compare_zipcodes(s1, s2):
    """
    If the zipcodes in both records are identical, the similarity
    is 0. If the first two values agree and the last two don't, then
    the similarity is 0.5. Otherwise, the similarity is 0.
    """

    # check if the zipcode are identical (return 1 or 0)
    sim = (s1 == s2).astype(float)

    # check the first 2 numbers of the distinct comparisons
    sim[(sim == 0) & (s1.str[0:2] == s2.str[0:2])] = 0.5

    return sim
In [10]:
# Make an index of record pairs
pcl = recordlinkage.BlockIndex('given_name')
candidate_pairs = pcl.index(df_a, df_b)

comparer = recordlinkage.Compare()
comparer.compare_vectorized(compare_zipcodes, 'postcode', 'postcode', label='sim_postcode')
features = comparer.compute(candidate_pairs, df_a, df_b)

features['sim_postcode'].value_counts()
Out[10]:
0.0    71229
0.5     3166
1.0     2854
Name: sim_postcode, dtype: int64

As you can see, one can pass the labels of the columns as arguments. The first argument is a column label, or a list of column labels, found in the first DataFrame (postcode in this example). The second argument is a column label, or a list of column labels, found in the second DataFrame (also postcode in this example). The recordlinkage.Compare class selects the columns with the given labels before passing them to the custom algorithm/function. The compare method in the recordlinkage.Compare class passes additional (keyword) arguments to the custom function.

Warning: Do not change the order of the pairs in the MultiIndex.

Multicolumn comparisons

The Python Record Linkage Toolkit supports the comparison of more than two columns. This is especially useful in situations with multi-dimensional data (for example geographical coordinates) and situations where fields can be swapped.

The FEBRL4 dataset has two columns filled with address information (address_1 and address_2). In a naive approach, one compares address_1 of file A with address_1 of file B and address_2 of file A with address_2 of file B. If the values for address_1 and address_2 are swapped during the record generating process, the naive approach considers the addresses to be distinct. In a more advanced approach, address_1 of file A is compared with address_1 and address_2 of file B. Variable address_2 of file A is compared with address_1 and address_2 of file B. This is done with the single function given below.

In [11]:
def compare_addresses(s1_1, s1_2, s2_1, s2_2):
    """
    Compare addresses. Compare address_1 of file A with
    address_1 and address_2 of file B. The same for address_2
    of dataset 1.

    """

    return ((s1_1 == s2_1) | (s1_2 == s2_2) | (s1_1 == s2_2) | (s1_2 == s2_1)).astype(float)
In [12]:
comparer = recordlinkage.Compare()

# naive
comparer.exact('address_1', 'address_1', label='sim_address_1')
comparer.exact('address_2', 'address_2', label='sim_address_2')

# better
comparer.compare_vectorized(
    compare_addresses,
    ('address_1', 'address_2'), ('address_1', 'address_2'),
    label='sim_address'
)

features = comparer.compute(candidate_pairs, df_a, df_b)

features.mean()
Out[12]:
sim_address_1    0.02488
sim_address_2    0.02025
sim_address      0.03566
dtype: float64