Classification

Classification is the step in the record linkage process were record pairs are classified into matches, non-matches and possible matches [Christen2012]. Classification algorithms can be supervised or unsupervised (rougly speaking: with or without training data). Many of the algorithms need trainings data to classify the record pairs. Trainings data is data for which is known whether it is a match or not.

See also

[Christen2012]Christen, Peter. 2012. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media.
exception LearningError

Learning error

class Classifier

Base class for classification of records pairs.

This class contains methods for training the classifier. Distinguish different types of training, such as supervised and unsupervised learning.

learn(comparison_vectors, match_index, return_type='index')

Train the classifier.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The comparison vectors.
  • match_index (pandas.MultiIndex) – The true matches.
  • return_type ('index' (default), 'series', 'array') – The format to return the classification result. The argument value ‘index’ will return the pandas.MultiIndex of the matches. The argument value ‘series’ will return a pandas.Series with zeros (distinct) and ones (matches). The argument value ‘array’ will return a numpy.ndarray with zeros and ones.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

predict(comparison_vectors, return_type='index')

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:
  • comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
  • return_type ('index' (default), 'series', 'array') – The format to return the classification result. The argument value ‘index’ will return the pandas.MultiIndex of the matches. The argument value ‘series’ will return a pandas.Series with zeros (distinct) and ones (matches). The argument value ‘array’ will return a numpy.ndarray with zeros and ones.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

prob(comparison_vectors, return_type='series')

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
  • return_type ('series' or 'array') – Return a pandas series or numpy array. Default ‘series’.
Returns:

pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

class KMeansClassifier(match_cluster_center=None, nonmatch_cluster_center=None, *args, **kwargs)

KMeans classifier.

The K-means clusterings algorithm (wikipedia) partitions candidate record pairs into matches and non-matches. Each comparison vector belongs to the cluster with the nearest mean. The K-means algorithm does not need trainings data, but it needs two starting points (one for the matches and one for the non-matches). The K-means clustering problem is NP-hard.

Parameters:
  • match_cluster_center (list, numpy.array) – The center of the match cluster. The length of the list/array must equal the number of comparison variables.
  • nonmatch_cluster_center (list, numpy.array) – The center of the nonmatch (distinct) cluster. The length of the list/array must equal the number of comparison variables.
classifier

sklearn.cluster.KMeans – The Kmeans cluster class in sklearn.

match_cluster_center

list, numpy.array – The center of the match cluster.

nonmatch_cluster_center

list, numpy.array – The center of the nonmatch (distinct) cluster.

Note

There are way better methods for linking records than the k-means clustering algorithm. However, this algorithm does not need trainings data and is useful to do an initial partition.

learn(comparison_vectors, return_type='index')

Train the K-means classifier.

The K-means classifier is unsupervised and therefore does not need labels. The K-means classifier classifies the data into two sets of links and non- links. The starting point of the cluster centers are 0.05 for the non-matches and 0.95 for the matches.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
  • return_type ('index' (default), 'series', 'array') – The format to return the classification result. The argument value ‘index’ will return the pandas.MultiIndex of the matches. The argument value ‘series’ will return a pandas.Series with zeros (distinct) and ones (matches). The argument value ‘array’ will return a numpy.ndarray with zeros and ones.
Returns:

pandas.MultiIndex, pandas.Series or numpy.ndarray – The prediction (see also the argument ‘return_type’)

predict(comparison_vectors, return_type='index')

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:
  • comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
  • return_type ('index' (default), 'series', 'array') – The format to return the classification result. The argument value ‘index’ will return the pandas.MultiIndex of the matches. The argument value ‘series’ will return a pandas.Series with zeros (distinct) and ones (matches). The argument value ‘array’ will return a numpy.ndarray with zeros and ones.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

class LogisticRegressionClassifier(coefficients=None, intercept=None)

Logistic Regression Classifier.

This classifier is an application of the logistic regression model (wikipedia). The classifier partitions candidate record pairs into matches and non-matches.

Parameters:
  • coefficients (list, numpy.array) – The coefficients of the logistic regression.
  • intercept (float) – The interception value.
classifier

sklearn.linear_model.LogisticRegression – The Logistic regression classifier in sklearn.

coefficients

list – The coefficients of the logistic regression.

intercept

float – The interception value.

learn(comparison_vectors, match_index, return_type='index')

Train the classifier.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The comparison vectors.
  • match_index (pandas.MultiIndex) – The true matches.
  • return_type ('index' (default), 'series', 'array') – The format to return the classification result. The argument value ‘index’ will return the pandas.MultiIndex of the matches. The argument value ‘series’ will return a pandas.Series with zeros (distinct) and ones (matches). The argument value ‘array’ will return a numpy.ndarray with zeros and ones.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

predict(comparison_vectors, return_type='index')

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:
  • comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
  • return_type ('index' (default), 'series', 'array') – The format to return the classification result. The argument value ‘index’ will return the pandas.MultiIndex of the matches. The argument value ‘series’ will return a pandas.Series with zeros (distinct) and ones (matches). The argument value ‘array’ will return a numpy.ndarray with zeros and ones.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

prob(comparison_vectors, return_type='series')

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
  • return_type ('series' or 'array') – Return a pandas series or numpy array. Default ‘series’.
Returns:

pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

class NaiveBayesClassifier(alpha=1.0)

Naive Bayes Classifier.

The Naive Bayes classifier (wikipedia) partitions candidate record pairs into matches and non-matches. The classifier is based on probabilistic principles. The Naive Bayes classification method is proven to be mathematical equivalent with the Fellegi and Sunter model.

Parameters:log_prior (list, numpy.array) – The log propabaility of each class.
classifier

sklearn.linear_model.LogisticRegression – The Logistic regression classifier in sklearn.

coefficients

list – The coefficients of the logistic regression.

intercept

float – The interception value.

Parameters:alpha (float) – Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
learn(comparison_vectors, match_index, return_type='index')

Train the classifier.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The comparison vectors.
  • match_index (pandas.MultiIndex) – The true matches.
  • return_type ('index' (default), 'series', 'array') – The format to return the classification result. The argument value ‘index’ will return the pandas.MultiIndex of the matches. The argument value ‘series’ will return a pandas.Series with zeros (distinct) and ones (matches). The argument value ‘array’ will return a numpy.ndarray with zeros and ones.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

predict(comparison_vectors, return_type='index')

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:
  • comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
  • return_type ('index' (default), 'series', 'array') – The format to return the classification result. The argument value ‘index’ will return the pandas.MultiIndex of the matches. The argument value ‘series’ will return a pandas.Series with zeros (distinct) and ones (matches). The argument value ‘array’ will return a numpy.ndarray with zeros and ones.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

prob(comparison_vectors, return_type='series')

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
  • return_type ('series' or 'array') – Return a pandas series or numpy array. Default ‘series’.
Returns:

pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

class SVMClassifier

Support Vector Machines

The Support Vector Machine classifier (wikipedia) partitions candidate record pairs into matches and non-matches. This implementation is a non-probabilistic binary linear classifier. Support vector machines are supervised learning models. Therefore, the SVM classifiers needs training-data.

learn(comparison_vectors, match_index, return_type='index')

Train the classifier.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The comparison vectors.
  • match_index (pandas.MultiIndex) – The true matches.
  • return_type ('index' (default), 'series', 'array') – The format to return the classification result. The argument value ‘index’ will return the pandas.MultiIndex of the matches. The argument value ‘series’ will return a pandas.Series with zeros (distinct) and ones (matches). The argument value ‘array’ will return a numpy.ndarray with zeros and ones.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

predict(comparison_vectors, return_type='index')

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:
  • comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
  • return_type ('index' (default), 'series', 'array') – The format to return the classification result. The argument value ‘index’ will return the pandas.MultiIndex of the matches. The argument value ‘series’ will return a pandas.Series with zeros (distinct) and ones (matches). The argument value ‘array’ will return a numpy.ndarray with zeros and ones.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

class FellegiSunter(random_decision_rule=False)

Fellegi and Sunter framework.

Base class for probabilistic classification of records pairs with the Fellegi and Sunter (1969) framework.

learn(comparison_vectors, match_index, return_type='index')

Train the classifier.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The comparison vectors.
  • match_index (pandas.MultiIndex) – The true matches.
  • return_type ('index' (default), 'series', 'array') – The format to return the classification result. The argument value ‘index’ will return the pandas.MultiIndex of the matches. The argument value ‘series’ will return a pandas.Series with zeros (distinct) and ones (matches). The argument value ‘array’ will return a numpy.ndarray with zeros and ones.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

predict(comparison_vectors, return_type='index')

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:
  • comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
  • return_type ('index' (default), 'series', 'array') – The format to return the classification result. The argument value ‘index’ will return the pandas.MultiIndex of the matches. The argument value ‘series’ will return a pandas.Series with zeros (distinct) and ones (matches). The argument value ‘array’ will return a numpy.ndarray with zeros and ones.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

prob(comparison_vectors, return_type='series')

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
  • return_type ('series' or 'array') – Return a pandas series or numpy array. Default ‘series’.
Returns:

pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

class ECMClassifier(*args, **kwargs)

Expectation/Conditional Maxisation vlassifier.

[EXPERIMENTAL] Expectation/Conditional Maximisation algorithm used as classifier. This probabilistic record linkage algorithm is used in combination with Fellegi and Sunter model.

learn(comparison_vectors, init='jaro', return_type='index')

Train the algorithm.

Train the Expectation-Maximisation classifier. This method is well- known as the ECM-algorithm implementation in the context of record linkage.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
  • params_init (dict) – A dictionary with initial parameters of the ECM algorithm (optional).
  • return_type ('index' (default), 'series', 'array') – The format to return the classification result. The argument value ‘index’ will return the pandas.MultiIndex of the matches. The argument value ‘series’ will return a pandas.Series with zeros (distinct) and ones (matches). The argument value ‘array’ will return a numpy.ndarray with zeros and ones.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

predict(comparison_vectors, return_type='index', *args, **kwargs)

Predict the class of reord pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
  • return_type ('index' (default), 'series', 'array') – The format to return the classification result. The argument value ‘index’ will return the pandas.MultiIndex of the matches. The argument value ‘series’ will return a pandas.Series with zeros (distinct) and ones (matches). The argument value ‘array’ will return a numpy.ndarray with zeros and ones.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

Note

Prediction is risky for this unsupervised learning method. Be aware that the sample from the population is valid.

prob(comparison_vectors)

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
  • return_type ('series' or 'array') – Return a pandas series or numpy array. Default ‘series’.
Returns:

pandas.Series or numpy.ndarray – The probability of being a match for each record pair.