3. Classification

Classifiers

Classification is the step in the record linkage process were record pairs are classified into matches, non-matches and possible matches [Christen2012]. Classification algorithms can be supervised or unsupervised (with or without training data).

See also

[Christen2012]Christen, Peter. 2012. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media.

Supervised

class recordlinkage.LogisticRegressionClassifier(coefficients=None, intercept=None, **kwargs)

Logistic Regression Classifier.

This classifier is an application of the logistic regression model (wikipedia). The classifier partitions candidate record pairs into matches and non-matches.

This algorithm is also known as Deterministic Record Linkage.

The LogisticRegressionClassifier classifier uses the sklearn.linear_model.LogisticRegression classification algorithm from SciKit-learn as kernel.

Parameters:
kernel

sklearn.linear_model.LogisticRegression – The kernel of the classifier. The kernel is sklearn.linear_model.LogisticRegression from SciKit-learn.

coefficients

list – The coefficients of the logistic regression.

intercept

float – The interception value.

fit(comparison_vectors, match_index=None)

Train the classifier.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
  • match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.

Note

A note in case of finding links within a single dataset (for example deduplication). Unsure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict(comparison_vectors, match_index=None)

Train the classifier.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The comparison vectors.
  • match_index (pandas.MultiIndex) – The true matches.
  • return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

learn(*args, **kwargs)

[DEPRECATED] Use ‘fit_predict’.

predict(comparison_vectors)

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:
  • comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
  • return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

prob(comparison_vectors, return_type=None)

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
  • return_type (str) – Deprecated. (default ‘series’)
Returns:

pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

class recordlinkage.NaiveBayesClassifier(binarize=None, alpha=0.0001, use_col_names=True, **kwargs)

Naive Bayes Classifier.

The Naive Bayes classifier (wikipedia) partitions candidate record pairs into matches and non-matches. The classifier is based on probabilistic principles. The Naive Bayes classification method has a close mathematical connection with the Fellegi and Sunter model.

Note

The NaiveBayesClassifier classifier differs of the Naive Bayes models in SciKit-learn. With binary input vectors, the NaiveBayesClassifier behaves like sklearn.naive_bayes.BernoulliNB.

Parameters:
  • binarize (float or None, optional (default=None)) – Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to consist of multilevel vectors.
  • alpha (float) – Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing). Default 1e-4.
  • use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True.
kernel

sklearn.naive_bayes.BernoulliNB – The kernel of the classifier. The kernel is sklearn.naive_bayes.BernoulliNB from SciKit-learn.

log_p

float – Log match probability as described in the FS framework.

log_m_probs

np.ndarray – Log probability P(x_i==1|Match) as described in the FS framework.

log_u_probs

np.ndarray – Log probability P(x_i==1|Non-match) as described in the FS framework.

log_weights

np.ndarray – Log weights as described in the FS framework.

p

float – Match probability as described in the FS framework.

m_probs

np.ndarray – Probability P(x_i==1|Match) as described in the FS framework.

u_probs

np.ndarray – Probability P(x_i==1|Non-match) as described in the FS framework.

weights

np.ndarray – Weights as described in the FS framework.

fit(X, *args, **kwargs)

Train the classifier.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
  • match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.

Note

A note in case of finding links within a single dataset (for example deduplication). Unsure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict(comparison_vectors, match_index=None)

Train the classifier.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The comparison vectors.
  • match_index (pandas.MultiIndex) – The true matches.
  • return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

learn(*args, **kwargs)

[DEPRECATED] Use ‘fit_predict’.

log_m_probs

Log probability P(x_i==1|Match) as described in the FS framework.

log_p

Log match probability as described in the FS framework.

log_u_probs

Log probability P(x_i==1|Non-match) as described in the FS framework.

log_weights

Log weights as described in the FS framework.

m_probs

Probability P(x_i==1|Match) as described in the FS framework.

p

Match probability as described in the FS framework.

predict(comparison_vectors)

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:
  • comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
  • return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

prob(comparison_vectors, return_type=None)

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
  • return_type (str) – Deprecated. (default ‘series’)
Returns:

pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

u_probs

Probability P(x_i==1|Non-match) as described in the FS framework.

weights

Weights as described in the FS framework.

class recordlinkage.SVMClassifier(*args, **kwargs)

Support Vector Machines Classifier

The Support Vector Machine classifier (wikipedia) partitions candidate record pairs into matches and non-matches. This implementation is a non-probabilistic binary linear classifier. Support vector machines are supervised learning models. Therefore, SVM classifiers need training- data.

The SVMClassifier classifier uses the sklearn.svm.LinearSVC classification algorithm from SciKit-learn as kernel.

Parameters:**kwargs – Arguments to pass to sklearn.svm.LinearSVC.
kernel

sklearn.svm.LinearSVC – The kernel of the classifier. The kernel is sklearn.svm.LinearSVC from SciKit-learn.

fit(comparison_vectors, match_index=None)

Train the classifier.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
  • match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.

Note

A note in case of finding links within a single dataset (for example deduplication). Unsure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict(comparison_vectors, match_index=None)

Train the classifier.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The comparison vectors.
  • match_index (pandas.MultiIndex) – The true matches.
  • return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

learn(*args, **kwargs)

[DEPRECATED] Use ‘fit_predict’.

predict(comparison_vectors)

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:
  • comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
  • return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

prob(*args, **kwargs)

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
  • return_type (str) – Deprecated. (default ‘series’)
Returns:

pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

Unsupervised

class recordlinkage.ECMClassifier(init='jaro', binarize=None, max_iter=100, atol=0.0001, use_col_names=True, *args, **kwargs)

Expectation/Conditional Maxisation classifier (Unsupervised).

Expectation/Conditional Maximisation algorithm used to classify record pairs. This probabilistic record linkage algorithm is used in combination with Fellegi and Sunter model. This classifier doesn’t need training data (unsupervised).

Parameters:
  • init (str) – Initialisation method for the algorithm. Options are: ‘jaro’ and ‘random’. Default ‘jaro’.
  • max_iter (int) – The maximum number of iterations of the EM algorithm. Default 100.
  • binarize (float or None, optional (default=None)) – Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.
  • atol (float) – The tolerance between parameters between each interation. If the difference between the parameters between the iterations is smaller than this value, the algorithm is considered to be converged. Default 10e-4.
  • use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True.
kernel

recordlinkage.algorithms.em_sklearn.ECM – The kernel of the classifier.

log_p

float – Log match probability as described in the FS framework.

log_m_probs

np.ndarray – Log probability P(x_i==1|Match) as described in the FS framework.

log_u_probs

np.ndarray – Log probability P(x_i==1|Non-match) as described in the FS framework.

log_weights

np.ndarray – Log weights as described in the FS framework.

p

float – Match probability as described in the FS framework.

m_probs

np.ndarray – Probability P(x_i==1|Match) as described in the FS framework.

u_probs

np.ndarray – Probability P(x_i==1|Non-match) as described in the FS framework.

weights

np.ndarray – Weights as described in the FS framework.

References

Herzog, Thomas N, Fritz J Scheuren and William E Winkler. 2007. Data quality and record linkage techniques. Vol. 1 Springer.

Fellegi, Ivan P and Alan B Sunter. 1969. “A theory for record linkage.” Journal of the American Statistical Association 64(328):1183–1210.

Collins, M. “The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm”. http://www.cs.columbia.edu/~mcollins/em.pdf

fit_predict(comparison_vectors, match_index=None)

Train the classifier.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The comparison vectors.
  • match_index (pandas.MultiIndex) – The true matches.
  • return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

learn(*args, **kwargs)

[DEPRECATED] Use ‘fit_predict’.

log_m_probs

Log probability P(x_i==1|Match) as described in the FS framework.

log_p

Log match probability as described in the FS framework.

log_u_probs

Log probability P(x_i==1|Non-match) as described in the FS framework.

log_weights

Log weights as described in the FS framework.

m_probs

Probability P(x_i==1|Match) as described in the FS framework.

p

Match probability as described in the FS framework.

predict(comparison_vectors)

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:
  • comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
  • return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

prob(comparison_vectors, return_type=None)

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
  • return_type (str) – Deprecated. (default ‘series’)
Returns:

pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

u_probs

Probability P(x_i==1|Non-match) as described in the FS framework.

weights

Weights as described in the FS framework.

fit(X, *args, **kwargs)

Train the classifier.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
  • match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.

Note

A note in case of finding links within a single dataset (for example deduplication). Unsure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

class recordlinkage.KMeansClassifier(match_cluster_center=None, nonmatch_cluster_center=None, **kwargs)

KMeans classifier.

The K-means clusterings algorithm (wikipedia) partitions candidate record pairs into matches and non-matches. Each comparison vector belongs to the cluster with the nearest mean.

The K-means algorithm is an unsupervised learning algorithm. The algorithm doesn’t need trainings data for fitting. The algorithm is calibrated for two clusters: a match cluster and a non-match cluster). The centers of these clusters can be given as arguments or set automatically.

The KMeansClassifier classifier uses the sklearn.cluster.KMeans clustering algorithm from SciKit-learn as kernel.

Parameters:
  • match_cluster_center (list, numpy.array) – The center of the match cluster. The length of the list/array must equal the number of comparison variables. If None, the match cluster center is set automatically. Default None.
  • nonmatch_cluster_center (list, numpy.array) – The center of the nonmatch (distinct) cluster. The length of the list/array must equal the number of comparison variables. If None, the non-match cluster center is set automatically. Default None.
  • **kwargs – Additional arguments to pass to sklearn.cluster.KMeans.
kernel

sklearn.cluster.KMeans – The kernel of the classifier. The kernel is sklearn.cluster.KMeans from SciKit-learn.

match_cluster_center

numpy.array – The center of the match cluster.

nonmatch_cluster_center

numpy.array – The center of the nonmatch (distinct) cluster.

Note

There are better methods for linking records than the k-means clustering algorithm. This algorithm can be useful for an (unsupervised) initial partition.

prob(*args, **kwargs)

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
  • return_type (str) – Deprecated. (default ‘series’)
Returns:

pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

fit(comparison_vectors, match_index=None)

Train the classifier.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
  • match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.

Note

A note in case of finding links within a single dataset (for example deduplication). Unsure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict(comparison_vectors, match_index=None)

Train the classifier.

Parameters:
  • comparison_vectors (pandas.DataFrame) – The comparison vectors.
  • match_index (pandas.MultiIndex) – The true matches.
  • return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

learn(*args, **kwargs)

[DEPRECATED] Use ‘fit_predict’.

predict(comparison_vectors)

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:
  • comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
  • return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

User-defined algorithms

Classifiers can make use of the recordlinkage.base.BaseClassifier for algorithms. ScitKit-learn based models may want recordlinkage.adapters.SKLearnAdapter as subclass as well.

class recordlinkage.base.BaseClassifier

Base class for classification of records pairs.

This class contains methods for training the classifier. Distinguish different types of training, such as supervised and unsupervised learning.

Probabilistic models can use the Fellegi and Sunter base class. This class is used for the recordlinkage.ECMClassifier and the recordlinkage.NaiveBayesClassifier.

class recordlinkage.classifiers.FellegiSunter(use_col_names=True, *args, **kwargs)

Fellegi and Sunter (1969) framework.

Meta class for probabilistic classification algorithms. The Fellegi and Sunter class is used for the recordlinkage.NaiveBayesClassifier and recordlinkage.ECMClassifier.

Parameters:use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True.
log_p

float – Log match probability as described in the FS framework.

log_m_probs

np.ndarray – Log probability P(x_i==1|Match) as described in the FS framework.

log_u_probs

np.ndarray – Log probability P(x_i==1|Non-match) as described in the FS framework.

log_weights

np.ndarray – Log weights as described in the FS framework.

p

float – Match probability as described in the FS framework.

m_probs

np.ndarray – Probability P(x_i==1|Match) as described in the FS framework.

u_probs

np.ndarray – Probability P(x_i==1|Non-match) as described in the FS framework.

weights

np.ndarray – Weights as described in the FS framework.

References

Fellegi, Ivan P and Alan B Sunter. 1969. “A theory for record linkage.” Journal of the American Statistical Association 64(328):1183–1210.

Examples

Unsupervised learning with the ECM algorithm. [See example on Github.](https://github.com/J535D165/recordlinkage/examples/unsupervised_learning.py)

Network

The Python Record Linkage Toolkit provides network/graph analysis tools for classification of record pairs into matches and distinct pairs. The toolkit provides the functionality for one-to-one linking and one-to-many linking. It is also possible to detect all connected components which is useful in data deduplication.

class recordlinkage.OneToOneLinking(method='greedy')

[EXPERIMENTAL] One-to-one linking

A record from dataset A can match at most one record from dataset B. For example, (a1, a2) are records from A and (b1, b2) are records from B. A linkage of (a1, b1), (a1, b2), (a2, b1), (a2, b2) is not one-to-one connected. One of the results of one-to-one linking can be (a1, b1), (a2, b2).

Parameters:method (str) – The method to solve the problem. Only ‘greedy’ is supported at the moment.

Note

This class is experimental and might change in future versions.

class recordlinkage.OneToManyLinking(level=0, method='greedy')

[EXPERIMENTAL] One-to-many linking

A record from dataset A can link multiple records from dataset B, but a record from B can link to only one record of dataset A. Use the level argument to switch A and B.

Parameters:
  • level (int) – The level of the MultiIndex to have the one relations. The options are 0 or 1 (incication the level of the MultiIndex). Default 0.
  • method (str) – The method to solve the problem. Only ‘greedy’ is supported at the moment.

Example

Consider a MultiIndex with record pairs constructed from datasets A and B. To link a record from B to at most one record of B, use the following syntax:

> one_to_many = OneToManyLinking(0) > one_to_many.compute(links)

To link a record from B to at most one record of B, use:

> one_to_many = OneToManyLinking(1) > one_to_many.compute(links)

Note

This class is experimental and might change in future versions.

class recordlinkage.ConnectedComponents

[EXPERIMENTAL] Connected record pairs

This class identifies connected record pairs. Connected components are especially used in detecting duplicates in a single dataset.

Note

This class is experimental and might change in future versions.