0. Preprocessing

Preprocessing data, like cleaning and standardising, may increase your record linkage accuracy. The Python Record Linkage Toolkit contains several tools for data preprocessing. The preprocessing and standardising functions are available in the submodule recordlinkage.preprocessing. Import the algorithms in the following way:

from recordlinkage.preprocessing import clean, phonetic

Cleaning

The Python Record Linkage Toolkit has some cleaning function from which recordlinkage.preprocessing.clean() is the most generic function. Pandas itself is also very usefull for (string) data cleaning. See the pandas documentation on this topic: Working with Text Data.

recordlinkage.preprocessing.clean(s, lowercase=True, replace_by_none='[^ \\-\\_A-Za-z0-9]+', replace_by_whitespace='[\\-\\_]', strip_accents=None, remove_brackets=True, encoding='utf-8', decode_error='strict')

Clean string variables.

Clean strings in the Series by removing unwanted tokens, whitespace and brackets.

Parameters:
  • s (pandas.Series) – A Series to clean.

  • lower (bool, optional) – Convert strings in the Series to lowercase. Default True.

  • replace_by_none (str, optional) – The matches of this regular expression are replaced by ‘’.

  • replace_by_whitespace (str, optional) – The matches of this regular expression are replaced by a whitespace.

  • remove_brackets (bool, optional) – Remove all content between brackets and the bracket themselves. Default True.

  • strip_accents ({'ascii', 'unicode', None}, optional) – Remove accents during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any characters. None (default) does nothing.

  • encoding (str, optional) – If bytes are given, this encoding is used to decode. Default is ‘utf-8’.

  • decode_error ({'strict', 'ignore', 'replace'}, optional) – Instruction on what to do if a byte Series is given that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.

Example

>>> import pandas
>>> from recordlinkage.preprocessing import clean
>>>
>>> names = ['Mary-ann',
            'Bob :)',
            'Angel',
            'Bob (alias Billy)',
            None]
>>> s = pandas.Series(names)
>>> print(clean(s))
0    mary ann
1         bob
2       angel
3         bob
4         NaN
dtype: object
Returns:

pandas.Series – A cleaned Series of strings.

recordlinkage.preprocessing.phonenumbers(s)

Clean phonenumbers by removing all non-numbers (except +).

Parameters:

s (pandas.Series) – A Series to clean.

Returns:

pandas.Series – A Series with cleaned phonenumbers.

recordlinkage.preprocessing.value_occurence(s)

Count the number of times each value occurs.

This function returns the counts for each row, in contrast with pandas.value_counts.

Returns:

pandas.Series – A Series with value counts.

Phonetic encoding

Phonetic algorithms are algorithms for indexing of words by their pronunciation. The most well-known algorithm is the Soundex algorithm. The Python Record Linkage Toolkit supports multiple algorithms through the recordlinkage.preprocessing.phonetic() function.

Note

Use phonetic algorithms in advance of the indexing and comparing step. This results in most siutations in better performance.

recordlinkage.preprocessing.phonetic(s, method, concat=True, encoding='utf-8', decode_error='strict')

Convert names or strings into phonetic codes.

The implemented algorithms are soundex, nysiis, metaphone or match_rating.

Parameters:
  • s (pandas.Series) – A pandas.Series with string values (often names) to encode.

  • method (str) – The algorithm that is used to phonetically encode the values. The possible options are “soundex”, “nysiis”, “metaphone” or “match_rating”.

  • concat (bool, optional) – Remove whitespace before phonetic encoding.

  • encoding (str, optional) – If bytes are given, this encoding is used to decode. Default is ‘utf-8’.

  • decode_error ({'strict', 'ignore', 'replace'}, optional) – Instruction on what to do if a byte Series is given that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.

Returns:

pandas.Series – A Series with phonetic encoded values.

preprocessing.phonetic_algorithms = ['soundex', 'nysiis', 'metaphone', 'match_rating']