Standardising

Cleaning and standardising your data may increase your record linkage accuracy. The Python Record Linkage Toolkit contains several tools for data cleaning and standardising. Some of the tools included are: phonetic encoding algorithms and string cleaning tools. The tools are included in the submodule standardise. (Import example: from recordlinkage.standardise import clean, phonenumbers)

clean(s, lowercase=True, replace_by_none='[^ \\-\\_A-Za-z0-9]+', replace_by_whitespace='[\\-\\_]', strip_accents=None, remove_brackets=True, encoding='utf-8', decode_error='strict')

Clean string variables.

Clean strings in the Series by removing unwanted tokens, whitespace and brackets.

Parameters:
  • s (pandas.Series) – A Series to clean.
  • lower (bool, optional) – Convert strings in the Series to lowercase. Default True.
  • replace_by_none (str, optional) – The matches of this regular expression are replaced by ‘’.
  • replace_by_whitespace (str, optional) – The matches of this regular expression are replaced by a whitespace.
  • remove_brackets (bool, optional) – Remove all content between brackets and the brackets themselves. Default True.
  • strip_accents ({'ascii', 'unicode', None}, optional) – Remove accents during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any characters. None (default) does nothing.
  • encoding (string, optional) – If bytes are given, this encoding is used to decode. Default is ‘utf-8’.
  • decode_error ({'strict', 'ignore', 'replace'}, optional) – Instruction on what to do if a byte Series is given that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.

Example

>>> import pandas
>>> from recordlinkage.standardise import clean
>>>
>>> name = ['Mary-ann', 'Bob :)', 'Angel', 'Bob (alias Billy)', None]
>>> s = pandas.Series(names)
>>> print(clean(s))
0    mary ann
1         bob
2       angel
3         bob
4         NaN
dtype: object
Returns:pandas.Series – A cleaned Series of strings.
phonenumbers(s)

Clean phonenumbers by removing all non-numbers (except +).

Parameters:s (pandas.Series) – A Series to clean.
Returns:pandas.Series – A Series with cleaned phonenumbers.
value_occurence(s)

Count the number of times each value occurs.

This function returns the counts for each row, in contrast with pandas.value_counts.

Returns:pandas.Series – A Series with value counts.
phonetic(s, method, concat=True, encoding='utf-8', decode_error='strict')

Convert names or strings into phonetic codes.

The implemented algorithms are soundex, nysiis, metaphone or match_rating.

Parameters:
  • method (string) – The algorithm that is used to phonetically encode the values. The possible options are “soundex”, “nysiis”, “metaphone” or “match rating”.
  • concat (bool, optional) – Remove whitespace before phonetic encoding.
  • encoding (string, optional) – If bytes are given, this encoding is used to decode. Default is ‘utf-8’.
  • decode_error ({'strict', 'ignore', 'replace'}, optional) – Instruction on what to do if a byte Series is given that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.:type method: str
Returns:

pandas.Series – A Series with phonetic encoded values.

Note

The ‘soundex’ and ‘nysiis’ algorithms use the package ‘jellyfish’. It can be installed with pip (pip install jellyfish).