Indexing

Migrating from recordlinkage<=0.8.2 to recordlinkage>=0.9? Click here

The indexing module is used to make pairs of records. These pairs are called candidate links or candidate matches. There are several indexing algorithms available such as blocking and sorted neighborhood indexing. See the following references for background information about indexation.

[christen2012]Christen, P. (2012). Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media.
[christen2008]Christen, P. (2008). Febrl - A Freely Available Record Linkage System with a Graphical User Interface.
class FullIndex

Class to generate a ‘full’ index.

A full index is an index with all possible combinations of record pairs. In case of linking, this indexation method generates the cartesian product of both DataFrame’s. In case of deduplicating DataFrame A, this indexation method are the pairs defined by the upper triangular matrix of the A x A.

Note

This indexation method can be slow for large DataFrame’s. The number of comparisons scales quadratic. Also, not all classifiers work well with large numbers of record pairs were most of the pairs are distinct.

index(x, x_link=None)

Make an index of record pairs.

Use a custom function to make record pairs of one or two dataframes. Each function should return a pandas.MultiIndex with record pairs.

Parameters:
  • x (pandas.DataFrame) – A pandas DataFrame. When x_link is None, the algorithm makes record pairs within the DataFrame. When x_link is not empty, the algorithm makes pairs between x and x_link.
  • x_link (pandas.DataFrame, optional) – A second DataFrame to link with the DataFrame x.
Returns:

pandas.MultiIndex – A pandas.MultiIndex with record pairs. Each record pair contains the index labels of two records.

class BlockIndex(on=None, left_on=None, right_on=None)

Make candidate record pairs that agree on one or more variables.

Returns all record pairs that agree on the given variable(s). This method is known as blocking. Blocking is an effective way to make a subset of the record space (A * B).

Parameters:
  • on (label, optional) – A column name or a list of column names. These columns are used to block on.
  • left_on (label, optional) – A column name or a list of column names of dataframe A. These columns are used to block on. This argument is ignored when argument ‘on’ is given.
  • right_on (label, optional) – A column name or a list of column names of dataframe B. These columns are used to block on. This argument is ignored when argument ‘on’ is given.

Examples

In the following example, the record pairs are made for two historical datasets with census data. The datasets are named census_data_1980 and census_data_1990.

>>> indexer = recordlinkage.BlockIndex(on='first_name')
>>> indexer.index(census_data_1980, census_data_1990)
index(x, x_link=None)

Make an index of record pairs.

Use a custom function to make record pairs of one or two dataframes. Each function should return a pandas.MultiIndex with record pairs.

Parameters:
  • x (pandas.DataFrame) – A pandas DataFrame. When x_link is None, the algorithm makes record pairs within the DataFrame. When x_link is not empty, the algorithm makes pairs between x and x_link.
  • x_link (pandas.DataFrame, optional) – A second DataFrame to link with the DataFrame x.
Returns:

pandas.MultiIndex – A pandas.MultiIndex with record pairs. Each record pair contains the index labels of two records.

class SortedNeighbourhoodIndex(on=None, left_on=None, right_on=None, window=3, sorting_key_values=None, block_on=[], block_left_on=[], block_right_on=[])

Make candidate record pairs with the SortedNeighbourhood algorithm.

This algorithm returns record pairs that agree on the sorting key, but also records pairs in their neighbourhood. A large window size results in more record pairs. A window size of 1 returns the blocking index.

The Sorted Neighbourhood Index method is a great method when there is relatively large amount of spelling mistakes. Blocking will fail in that situation because it excludes to many records on minor spelling mistakes.

Parameters:
  • on (label) – Specify the on to make a sorted index
  • window (int, optional) – The width of the window, default is 3
  • sorting_key_values (array, optional) – A list of sorting key values (optional).
  • block_on (label) – Additional columns to use standard blocking on
  • block_left_on (label) – Additional columns of the left dataframe to use standard blocking on.
  • block_right_on (label) – Additional columns of the right dataframe to use standard blocking on

Examples

In the following example, the record pairs are made for two historical datasets with census data. The datasets are named census_data_1980 and census_data_1990.

>>> indexer = recordlinkage.SortedNeighbourhoodIndex(on='first_name', w=9)
>>> indexer.index(census_data_1980, census_data_1990)
index(x, x_link=None)

Make an index of record pairs.

Use a custom function to make record pairs of one or two dataframes. Each function should return a pandas.MultiIndex with record pairs.

Parameters:
  • x (pandas.DataFrame) – A pandas DataFrame. When x_link is None, the algorithm makes record pairs within the DataFrame. When x_link is not empty, the algorithm makes pairs between x and x_link.
  • x_link (pandas.DataFrame, optional) – A second DataFrame to link with the DataFrame x.
Returns:

pandas.MultiIndex – A pandas.MultiIndex with record pairs. Each record pair contains the index labels of two records.

class RandomIndex(n, replace=True, random_state=None)

Class to generate random pairs of records.

This class returns random pairs of records with or without replacement. Use the random_state parameter to seed the algorithm and reproduce results. This way to make record pairs is useful for the training of unsupervised learning models for record linkage.

Parameters:
  • n (int) – The number of record pairs to return. In case replace=False, the integer n should be bounded by 0 < n <= n_max where n_max is the maximum number of pairs possible.
  • replace (bool, optional) – Whether the sample of record pairs is with or without replacement. Default: True
  • random_state (int or numpy.random.RandomState, optional) – Seed for the random number generator (if int), or numpy RandomState object.
index(x, x_link=None)

Make an index of record pairs.

Use a custom function to make record pairs of one or two dataframes. Each function should return a pandas.MultiIndex with record pairs.

Parameters:
  • x (pandas.DataFrame) – A pandas DataFrame. When x_link is None, the algorithm makes record pairs within the DataFrame. When x_link is not empty, the algorithm makes pairs between x and x_link.
  • x_link (pandas.DataFrame, optional) – A second DataFrame to link with the DataFrame x.
Returns:

pandas.MultiIndex – A pandas.MultiIndex with record pairs. Each record pair contains the index labels of two records.

Migrating

Version 0.9 of the Python Record Linkage Toolkit uses a new indexing API. The new indexing API uses a different syntax. With the new API, each algorithm has it’s own class. See the following example to migrate a blocking index:

Old (linking):

cl = recordlinkage.Pairs(df_a, df_b)
cl.block('given_name')

New (linking):

cl = recordlinkage.BlockIndex('given_name')
cl.index(df_a, df_b)

Old (deduplication):

cl = recordlinkage.Pairs(df_a)
cl.block('given_name')

New (deduplication):

cl = recordlinkage.BlockIndex('given_name')
cl.index(df_a)