Comparing

The recordlinkage.Compare class and its methods can be used to compare records pairs. Several comparison methods are included such as string similarity measures, numerical measures and distance measures.

class Compare(pairs, df_a=None, df_b=None, low_memory=False, block_size=1000000, njobs=1, **kwargs)

Compare record pairs with the tools in this class.

Class to compare the attributes of candidate record pairs. The Compare class has several methods to compare data such as string similarity measures, numeric metrics and exact comparison methods.

Parameters:
  • pairs (pandas.MultiIndex) – A MultiIndex of candidate record pairs.
  • df_a (pandas.DataFrame) – The first dataframe.
  • df_b (pandas.DataFrame) – The second dataframe.
  • low_memory (bool) – Reduce the amount of memory used by the Compare class. Default False.
  • block_size (int) – The maximum size of data blocks. Default 1,000,000.
pairs

pandas.MultiIndex – The candidate record pairs.

df_a

pandas.DataFrame – The first DataFrame.

df_b

pandas.DataFrame – The second DataFrame.

vectors

pandas.DataFrame – The DataFrame with comparison data.

Examples

In the following example, the record pairs of two historical datasets with census data are compared. The datasets are named census_data_1980 and census_data_1990. The candidate_pairs are the record pairs to compare. The record pairs are compared on the first name, last name, sex, date of birth, address, place, and income.

>>> comp = recordlinkage.Compare(
    candidate_pairs, census_data_1980, census_data_1990
    )
>>> comp.string('first_name', 'name', method='jarowinkler')
>>> comp.string('lastname', 'lastname', method='jarowinkler')
>>> comp.exact('dateofbirth', 'dob')
>>> comp.exact('sex', 'sex')
>>> comp.string('address', 'address', method='levenshtein')
>>> comp.exact('place', 'place')
>>> comp.numeric('income', 'income')
>>> print(comp.vectors.head())

The attribute vectors is the DataFrame with the comparison data. It can be called whenever you want.

exact(s1, s2, agree_value=1, disagree_value=0, missing_value=0, name=None, store=True)

Compare the record pairs exactly.

Parameters:
  • s1 (label, pandas.Series) – Series or DataFrame to compare all fields.
  • s2 (label, pandas.Series) – Series or DataFrame to compare all fields.
  • agree_value (float, str, numpy.dtype) – The value when two records are identical. Default 1. If ‘values’ is passed, then the value of the record pair is passed.
  • disagree_value (float, str, numpy.dtype) – The value when two records are not identical.
  • missing_value (float, str, numpy.dtype) – The value for a comparison with a missing value. Default 0.
  • name (label) – The name of the feature and the name of the column.
  • store (bool) – Store the result in the dataframe. Default True
Returns:

pandas.Series – A pandas series with the result of comparing each record pair.

string(s1, s2, method='levenshtein', threshold=None, missing_value=0, name=None, store=True)

Compare strings.

Parameters:
  • s1 (label, pandas.Series) – Series or DataFrame to compare all fields.
  • s2 (label, pandas.Series) – Series or DataFrame to compare all fields.
  • method (str) – A approximate string comparison method. Options are [‘jaro’, ‘jarowinkler’, ‘levenshtein’, ‘damerau_levenshtein’, ‘qgram’, ‘cosine’]. Default: ‘levenshtein’
  • threshold (float, tuple of floats) – A threshold value. All approximate string comparisons higher or equal than this threshold are 1. Otherwise 0.
  • missing_value (numpy.dtype) – The value for a comparison with a missing value. Default 0.
  • name (label) – The name of the feature and the name of the column.
  • store (bool) – Store the result in the dataframe. Default True
Returns:

pandas.Series – A pandas series with similarity values. Values equal or between 0 and 1.

numeric(s1, s2, method='linear', offset, scale, origin=0, missing_value=0, name=None, store=True)

Compute the similarity of numeric values.

This method returns the similarity of two numeric values. The implemented algorithms are: ‘step’, ‘linear’, ‘exp’, ‘gauss’ or ‘squared’. In case of agreement, the similarity is 1 and in case of complete disagreement it is 0. The implementation is similar with numeric comparing in ElasticSearch, a full-text search tool. The parameters are explained in the image below (source ElasticSearch, The Definitive Guide)

Decay functions, like in ElasticSearch
Parameters:
  • s1 (label, pandas.Series) – Series or DataFrame to compare all fields.
  • s2 (label, pandas.Series) – Series or DataFrame to compare all fields.
  • method (float) – The metric used. Options ‘step’, ‘linear’, ‘exp’, ‘gauss’ or ‘squared’. Default ‘linear’.
  • offset (float) – The offset. See image above.
  • scale (float) – The scale of the numeric comparison method. See the image above. This argument is not available for the ‘step’ algorithm.
  • origin (str) – The shift of bias between the values. See image above.
  • missing_value (numpy.dtype) – The value if one or both records have a missing value on the compared field. Default 0.
  • name (label) – The name of the feature and the name of the column.
  • store (bool) – Store the result in the dataframe. Default True
Returns:

pandas.Series – A pandas series with the result of comparing each record pair.

Note

Numeric comparing can be an efficient way to compare date/time variables. This can be done by comparing the timestamps.

geo(lat1, lng1, lat2, lng2, method='linear', offset, scale, origin=0, missing_value=0, name=None, store=True)

Compute the similarity of two WGS84 coordinates.

Compare the geometric (haversine) distance between two WGS- coordinates. The similarity algorithms are ‘step’, ‘linear’, ‘exp’, ‘gauss’ or ‘squared’. The similarity functions are the same as in recordlinkage.comparing.Compare.numeric()

Parameters:
  • lat1 (pandas.Series, numpy.array, label/string) – Series with Lat-coordinates
  • lng1 (pandas.Series, numpy.array, label/string) – Series with Lng-coordinates
  • lat2 (pandas.Series, numpy.array, label/string) – Series with Lat-coordinates
  • lng2 (pandas.Series, numpy.array, label/string) – Series with Lng-coordinates
  • method (str) – The metric used. Options ‘step’, ‘linear’, ‘exp’, ‘gauss’ or ‘squared’. Default ‘linear’.
  • offset (float) – The offset. See Compare.numeric.
  • scale (float) – The scale of the numeric comparison method. See Compare.numeric. This argument is not available for the ‘step’ algorithm.
  • origin (float) – The shift of bias between the values. See Compare.numeric.
  • missing_value (numpy.dtype) – The value for a comparison with a missing value. Default 0.
  • name (label) – The name of the feature and the name of the column.
  • store (bool) – Store the result in the dataframe. Default True.
Returns:

pandas.Series – A pandas series with the result of comparing each record pair.

date(self, s1, s2, swap_month_day=0.5, swap_months='default', missing_value=0, name=None, store=True)

Compare two dates.

Parameters:
  • s1 (pandas.Series, numpy.array, label/string) – Dates. This can be a Series, DatetimeIndex or DataFrame (with columns ‘year’, ‘month’ and ‘day’).
  • s2 (pandas.Series, numpy.array, label/string) – This can be a Series, DatetimeIndex or DataFrame (with columns ‘year’, ‘month’ and ‘day’).
  • swap_month_day (float) – The value if the month and day are swapped.
  • swap_months (list of tuples) – A list of tuples with common errors caused by the translating of months into numbers, i.e. October is month 10. The format of the tuples is (month_good, month_bad, value). Default : swap_months = [(6, 7, 0.5), (7, 6, 0.5), (9, 10, 0.5), (10, 9, 0.5)]
  • missing_value (numpy.dtype) – The value for a comparison with a missing value. Default 0.
  • name (label) – The name of the feature and the name of the column.
  • store (bool) – Store the result in the dataframe. Default True.
Returns:

pandas.Series – A pandas series with the result of comparing each record pair.

clear_memory()

Clear memory.

Clear some memory when low_memory was set to True.

compare(comp_func, labels_a, labels_b, *args, **kwargs)

Compare two records.

Core method to compare record pairs. This method takes a function and data from both records in the record pair. The data is compared with the compare function. The built-in methods also use this function.

Example

>>> comp = recordlinkage.Compare(PAIRS, DATAFRAME1, DATAFRAME2)
>>> comp.exact('first_name', 'name')
>>> # same as
>>> comp.compare(recordlinkage._compare_exact, 'first_name', 'name')
Parameters:
  • comp_func (function) – A comparison function. This function can be a built-in function or a user defined comparison function.
  • labels_a (label, pandas.Series, pandas.DataFrame) – The labels, Series or DataFrame to compare.
  • labels_b (label, pandas.Series, pandas.DataFrame) – The labels, Series or DataFrame to compare.
  • name (label) – The name of the feature and the name of the column.
  • store (bool, default True) – Store the result in the dataframe.
Returns:

pandas.Series – A pandas series with the result of comparing each record pair.

run()

Run in a batch

This method is decrecated. Use the comparing.Compare(..., low_memory=False) for better performance.