Custom algorithms

The Python Record Linkage Toolkit contains several built-in algorithms for making record pairs and comparing record pairs. Sometimes, these built-in algorithms do not fit your needs. With the Python Record Linkage Toolkit, it is easy to use other algorithms. This section describes how to implement custom algorithms for the making and comparing record pairs. When you think your algorithm might help others, consider sharing it!

For run the examples below, import pandas, recordlinkage and the two datasets belonging to sample dataset FEBRL4.

In [2]:
import pandas

import recordlinkage as rl
from recordlinkage.datasets import load_febrl4

df_a, df_b = load_febrl4()

Custom index algorithms

The Python Record Linkage Toolkit contains multiple algorithms to pair records (index algorithms) such as full indexing, blocking and sorted neighbourhood indexing. This section explains how to make and implement a custom algorithm to make record pairs.

To create a custom algorithm, you have to make a subclass of the recordlinkage.base.BaseIndexator. In your subclass, you overwrite the _link_index method. This method accepts two pandas DataFrames as arguments. Based on these DataFrames, your method must create pairs and return them as a pandas.MultiIndex in which the MultiIndex names are the index names of DataFrame A and DataFrame B respectively.

The algorithm for linking data frames can be used for finding duplicates as well. In this situation, DataFrame B is a copy of DataFrame A. The Pairs class removes pairs like (record_i, record_i) and one of the following (record_i, record_j) (record_j, record_i) under the hood. As result of this, only unique combinations are returned. If you do have a specific algorithm for finding duplicates, then you can overwrite the _dedup_index method. This method accepts only one argument (DataFrame A) and the internal base class does not look for combinations like explained above.

Let’s make an algorithm that pairs records of which the given name in both records starts with the letter ‘W’.

In [3]:
from recordlinkage.base import BaseIndexator

class FirstLetterWIndex(BaseIndexator):
    """Custom class for indexing"""

    def _link_index(self, df_a, df_b):
        """Make pairs of all records where the given name start with the letter 'W'."""

        # Select records with names starting with a w.
        name_a_startswith_w = df_a[df_a['given_name'].str.startswith('w') == True]
        name_b_startswith_w = df_b[df_b['given_name'].str.startswith('w') == True]

        # Make a product of the two numpy arrays
        return pandas.MultiIndex.from_product(
            [name_a_startswith_w.index.values, name_b_startswith_w.index.values],
            names=[df_a.index.name, df_b.index.name]
        )
In [4]:
indexer = FirstLetterWIndex()
candidate_pairs = indexer.index(df_a, df_b)

print ('Returns a', type(candidate_pairs).__name__)
print ('Number of candidate record pairs starting with the letter w:', len(candidate_pairs))
Returns a MultiIndex
Number of candidate record pairs starting with the letter w: 6072

The custom index class below does not restrict the first letter to ‘w’, but the first letter is an argument (named letter). This letter can is initialized during the setup of the class.

In [5]:
class FirstLetterIndex(BaseIndexator):
    """Custom class for indexing"""

    def __init__(self, letter):
        super(FirstLetterIndex, self).__init__()

        # the letter to save
        self.letter = letter

    def _link_index(self, df_a, df_b):
        """Make record pairs that agree on the first letter of the given name."""

        # Select records with names starting with a 'letter'.
        a_startswith_w = df_a[df_a['given_name'].str.startswith(self.letter) == True]
        b_startswith_w = df_b[df_b['given_name'].str.startswith(self.letter) == True]

        # Make a product of the two numpy arrays
        return pandas.MultiIndex.from_product(
            [a_startswith_w.index.values, b_startswith_w.index.values],
            names=[df_a.index.name, df_b.index.name]
        )
In [6]:
indexer = FirstLetterIndex('w')
candidate_pairs = indexer.index(df_a, df_b)

print ('Number of record pairs (letter w):', len(candidate_pairs))
Number of record pairs (letter w): 6072
In [7]:
for letter in 'wxa':

    indexer = FirstLetterIndex(letter)
    candidate_pairs = indexer.index(df_a, df_b)

    print ('Number of record pairs (letter %s):' % letter, len(candidate_pairs))
Number of record pairs (letter w): 6072
Number of record pairs (letter x): 132
Number of record pairs (letter a): 172431

Custom compare algorithms

The Python Record Linkage Toolkit holds algorithms to compare strings, numeric values and dates. These functions may not be sufficient for your record linkage. The internal framework of the toolkit makes it easy to implement custom algorithms to compare records.

A custom algorithm is a function that accepts two arguments, one pandas.Series with information on the variable in the first file and one pandas.Series with information on the variable in the second file. The function returns a pandas.Series, numpy.array or list with the similarity/comparison values.

In [8]:
def compare_zipcodes(s1, s2):
    """
    If the zipcodes in both records are identical, the similarity
    is 0. If the first two values agree and the last two don't, then
    the similarity is 0.5. Otherwise, the similarity is 0.
    """

    # check if the zipcode are identical (return 1 or 0)
    sim = (s1 == s2).astype(float)

    # check the first 2 numbers of the distinct comparisons
    sim[(sim == 0) & (s1.str[0:2] == s2.str[0:2])] = 0.5

    return sim
In [9]:
# Make an index of record pairs
pcl = rl.BlockIndex('given_name')
candidate_pairs = pcl.index(df_a, df_b)


crl = rl.Compare(candidate_pairs, df_a, df_b)

# default algorithm
crl.string('surname', 'surname', method='jarowinkler', name='sim_surname')

# custom zipcodes algorithm
# crl.compare(callable_algorithm, 'label_A', 'label_B', name='sim_name')
crl.compare(compare_zipcodes, 'postcode', 'postcode', name='sim_postcode')

crl.vectors['sim_postcode'].value_counts()
Out[9]:
0.0    71229
0.5     3166
1.0     2854
Name: sim_postcode, dtype: int64

As you can see, one can pass the labels of the columns as arguments. The first argument is a column label, or a list of column labels, found in the first DataFrame (postcode in this example). The second argument is a column label, or a list of column labels, found in the second DataFrame (also postcode in this example). The recordlinkage.Compare class selects the columns with the given labels before passing them to the custom algorithm/function. The compare method in the recordlinkage.Compare class passes additional (keyword) arguments to the custom function.

Warning: Do not change the order of the pairs in the MultiIndex.

Multicolumn comparisons

The Python Record Linkage Toolkit supports the comparison of more than two columns. This is especially useful in situations with multi-dimensional data (for example geographical coordinates) and situations where fields can be swapped.

The FEBRL4 dataset has two columns filled with address information (address_1 and address_2). In a naive approach, one compares address_1 of file A with address_1 of file B and address_2 of file A with address_2 of file B. If the values for address_1 and address_2 are swapped during the record generating process, the naive approach considers the addresses to be distinct. In a more advanced approach, address_1 of file A is compared with address_1 and address_2 of file B. Variable address_2 of file A is compared with address_1 and address_2 of file B. This is done with the single function given below.

In [10]:
def compare_addresses(s1_1, s1_2, s2_1, s2_2):
    """
    Compare addresses. Compare address_1 of file A with
    address_1 and address_2 of file B. The same for address_2
    of dataset 1.

    """

    return ((s1_1 == s2_1) | (s1_2 == s2_2) | (s1_1 == s2_2) | (s1_2 == s2_1)).astype(float)
In [11]:
crl = rl.Compare(candidate_pairs, df_a, df_b)

# naive
crl.exact('address_1', 'address_1', name='sim_address_1')
crl.exact('address_2', 'address_2', name='sim_address_2')

# better
crl.compare(compare_addresses, ['address_1', 'address_2'], ['address_1', 'address_2'], name='sim_address')

crl.vectors.mean()
Out[11]:
sim_address_1    0.02488
sim_address_2    0.02025
sim_address      0.03566
dtype: float64