Comparing

A set of informative, discriminating and independent features is important for a good classification of record pairs into matching and distinct pairs. The recordlinkage.Compare class and its methods can be used to compare records pairs. Several comparison methods are included such as string similarity measures, numerical measures and distance measures.

class Compare(block_size=1000000)

Class to compare record pairs with efficiently.

Class to compare the attributes of candidate record pairs. The Compare class has methods like string, exact and numeric to initialise the comparing of the records. The compute method is used to start the actual comparing.

Parameters:block_size (int) – The maximum size of data blocks. Default 1,000,000.

Example

Consider two historical datasets with census data to link. The datasets are named census_data_1980 and census_data_1990. The MultiIndex candidate_pairs contains the record pairs to compare. The record pairs are compared on the first name, last name, sex, date of birth, address, place, and income:

# initialise class
comp = recordlinkage.Compare()

# initialise similarity measurement algorithms
comp.string('first_name', 'name', method='jarowinkler')
comp.string('lastname', 'lastname', method='jarowinkler')
comp.exact('dateofbirth', 'dob')
comp.exact('sex', 'sex')
comp.string('address', 'address', method='levenshtein')
comp.exact('place', 'place')
comp.numeric('income', 'income')

# the method .compute() returns the DataFrame with the feature vectors.
comp.compute(candidate_pairs, census_data_1980, census_data_1990)
exact(s1, s2, agree_value=1, disagree_value=0, missing_value=0, label=None)

Compare the record pairs exactly.

This method initialises the exact similarity measurement between values. The similarity is 1 in case of agreement and 0 otherwise.

Parameters:
  • s1 (str or int) – Field name to compare in left DataFrame.
  • s2 (str or int) – Field name to compare in right DataFrame.
  • agree_value (float, str, numpy.dtype) – The value when two records are identical. Default 1. If ‘values’ is passed, then the value of the record pair is passed.
  • disagree_value (float, str, numpy.dtype) – The value when two records are not identical.
  • missing_value (float, str, numpy.dtype) – The value for a comparison with a missing value. Default 0.
  • label (label) – The label of the column in the resulting dataframe.
string(s1, s2, method='levenshtein', threshold=None, missing_value=0, label=None)

Compute the (partial) similarity between strings values.

This method initialises the similarity measurement between string values. The implemented algorithms are: ‘jaro’,’jarowinkler’, ‘levenshtein’, ‘damerau_levenshtein’, ‘qgram’ or ‘cosine’. In case of agreement, the similarity is 1 and in case of complete disagreement it is 0. The Python Record Linkage Toolkit uses the jellyfish package for the Jaro, Jaro-Winkler, Levenshtein and Damerau-Levenshtein algorithms.

Parameters:
  • s1 (str or int) – The name or position of the column in the left DataFrame.
  • s2 (str or int) – The name or position of the column in the right DataFrame.
  • method (str, default 'levenshtein') – An approximate string comparison method. Options are [‘jaro’, ‘jarowinkler’, ‘levenshtein’, ‘damerau_levenshtein’, ‘qgram’, ‘cosine’, ‘smith_waterman’, ‘lcs’]. Default: ‘levenshtein’
  • threshold (float, tuple of floats) – A threshold value. All approximate string comparisons higher or equal than this threshold are 1. Otherwise 0.
  • missing_value (numpy.dtype) – The value for a comparison with a missing value. Default 0.
  • label (label) – The label of the column in the resulting dataframe.
numeric(s1, s2, method='linear', offset, scale, origin=0, missing_value=0, label=None)

Compute the (partial) similarity between numeric values.

This method initialises the similarity measurement between numeric values. The implemented algorithms are: ‘step’, ‘linear’, ‘exp’, ‘gauss’ or ‘squared’. In case of agreement, the similarity is 1 and in case of complete disagreement it is 0. The implementation is similar with numeric comparing in ElasticSearch, a full-text search tool. The parameters are explained in the image below (source ElasticSearch, The Definitive Guide)

Decay functions, like in ElasticSearch
Parameters:
  • s1 (str or int) – The name or position of the column in the left DataFrame.
  • s2 (str or int) – The name or position of the column in the right DataFrame.
  • method (float) – The metric used. Options ‘step’, ‘linear’, ‘exp’, ‘gauss’ or ‘squared’. Default ‘linear’.
  • offset (float) – The offset. See image above.
  • scale (float) – The scale of the numeric comparison method. See the image above. This argument is not available for the ‘step’ algorithm.
  • origin (str) – The shift of bias between the values. See image above.
  • missing_value (numpy.dtype) – The value if one or both records have a missing value on the compared field. Default 0.
  • label (label) – The label of the column in the resulting dataframe.

Note

Numeric comparing can be an efficient way to compare date/time variables. This can be done by comparing the timestamps.

geo(lat1, lng1, lat2, lng2, method='linear', offset, scale, origin=0, missing_value=0, label=None)

Compute the (partial) similarity between WGS84 coordinate values.

Compare the geometric (haversine) distance between two WGS- coordinates. The similarity algorithms are ‘step’, ‘linear’, ‘exp’, ‘gauss’ or ‘squared’. The similarity functions are the same as in recordlinkage.comparing.Compare.numeric()

Parameters:
  • lat1 (str or int) – The name or position of the column in the left DataFrame.
  • lng1 (str or int) – The name or position of the column in the left DataFrame.
  • lat2 (str or int) – The name or position of the column in the right DataFrame.
  • lng2 (str or int) – The name or position of the column in the right DataFrame.
  • method (str) – The metric used. Options ‘step’, ‘linear’, ‘exp’, ‘gauss’ or ‘squared’. Default ‘linear’.
  • offset (float) – The offset. See Compare.numeric.
  • scale (float) – The scale of the numeric comparison method. See Compare.numeric. This argument is not available for the ‘step’ algorithm.
  • origin (float) – The shift of bias between the values. See Compare.numeric.
  • missing_value (numpy.dtype) – The value for a comparison with a missing value. Default 0.
  • label (label) – The label of the column in the resulting dataframe.
date(self, s1, s2, swap_month_day=0.5, swap_months='default', missing_value=0, label=None)

Compute the (partial) similarity between date values.

Parameters:
  • s1 (str or int) – The name or position of the column in the left DataFrame.
  • s2 (str or int) – The name or position of the column in the right DataFrame.
  • swap_month_day (float) – The value if the month and day are swapped.
  • swap_months (list of tuples) – A list of tuples with common errors caused by the translating of months into numbers, i.e. October is month 10. The format of the tuples is (month_good, month_bad, value). Default : swap_months = [(6, 7, 0.5), (7, 6, 0.5), (9, 10, 0.5), (10, 9, 0.5)]
  • missing_value (numpy.dtype) – The value for a comparison with a missing value. Default 0.
  • label (label) – The label of the column in the resulting dataframe.
compare_vectorized(comp_func, labels_left, labels_right, *args, **kwargs)

Compute the similarity between values with a callable.

This method initialises the comparing of values with a custom function/callable. The function/callable should accept numpy.ndarray’s.

Example

>>> comp = recordlinkage.Compare()
>>> comp.compare_vectorized(custom_callable, 'first_name', 'name')
>>> comp.compare(PAIRS, DATAFRAME1, DATAFRAME2)
Parameters:
  • comp_func (function) – A comparison function. This function can be a built-in function or a user defined comparison function. The function should accept numpy.ndarray’s as first two arguments.
  • labels_left (label, pandas.Series, pandas.DataFrame) – The labels, Series or DataFrame to compare.
  • labels_right (label, pandas.Series, pandas.DataFrame) – The labels, Series or DataFrame to compare.
  • *args – Additional arguments to pass to callable comp_func.
  • **kwargs – Additional keyword arguments to pass to callable comp_func. (keyword ‘label’ is reserved.)
  • label ((list of) label(s)) – The name of the feature and the name of the column. IMPORTANT: This argument is a keyword argument.
compute(pairs, x, x_link=None)

Compare the records of each record pair.

Calling this method starts the comparing of records.

Parameters:
  • pairs (pandas.MultiIndex) – A pandas MultiIndex with the record pairs to compare. The indices in the MultiIndex are indices of the DataFrame(s) to link.
  • x (pandas.DataFrame) – The DataFrame to link. If x_link is given, the comparing is a linking problem. If x_link is not given, the problem is one of deduplication.
  • x_link (pandas.DataFrame, optional) – The second DataFrame.
Returns:

pandas.DataFrame – A pandas DataFrame with feature vectors, i.e. the result of comparing each record pair.

Migrating

Version 0.10 of the Python Record Linkage Toolkit uses a new API to compare record pairs. The new API uses a different syntax. Records are now compared after calling the compute method. Also, the Compare class is no longer initialized with the data and the record pairs. The data and record pairs are passed to the compute method. The old procedure still works but will be removed in the future.

Old (linking):

c = recordlinkage.Compare(candidate_links, df_a, df_b)

c.string('name_a', 'name_b', method='jarowinkler', threshold=0.85)
c.exact('sex', 'gender')
c.date('dob', 'date_of_birth')
c.string('str_name', 'streetname', method='damerau_levenshtein', threshold=0.7)
c.exact('place', 'placename')
c.numeric('income', 'income', method='gauss', offset=3, scale=3, missing_value=0.5)

# The comparison vectors
c.vectors

New (linking):

c = recordlinkage.Compare()

c.string('name_a', 'name_b', method='jarowinkler', threshold=0.85)
c.exact('sex', 'gender')
c.date('dob', 'date_of_birth')
c.string('str_name', 'streetname', method='damerau_levenshtein', threshold=0.7)
c.exact('place', 'placename')
c.numeric('income', 'income', method='gauss', offset=3, scale=3, missing_value=0.5)

# The comparison vectors
feature_vectors = c.compute(candidate_links, df_a, df_b)

Old (deduplication):

c = recordlinkage.Compare(candidate_links, df_a)

c.string('name_a', 'name_b', method='jarowinkler', threshold=0.85)
c.exact('sex', 'gender')
c.date('dob', 'date_of_birth')
c.string('str_name', 'streetname', method='damerau_levenshtein', threshold=0.7)
c.exact('place', 'placename')
c.numeric('income', 'income', method='gauss', offset=3, scale=3, missing_value=0.5)

# The comparison vectors
c.vectors

New (deduplication):

c = recordlinkage.Compare()

c.string('name_a', 'name_b', method='jarowinkler', threshold=0.85)
c.exact('sex', 'gender')
c.date('dob', 'date_of_birth')
c.string('str_name', 'streetname', method='damerau_levenshtein', threshold=0.7)
c.exact('place', 'placename')
c.numeric('income', 'income', method='gauss', offset=3, scale=3, missing_value=0.5)

# The comparison vectors
feature_vectors = c.compute(candidate_links, df_a)