3. Classification

Classifiers

Classification is the step in the record linkage process were record pairs are classified into matches, non-matches and possible matches [Christen2012]. Classification algorithms can be supervised or unsupervised (with or without training data).

Supervised

class recordlinkage.LogisticRegressionClassifier(coefficients=None, intercept=None, **kwargs)

Logistic Regression Classifier.

This classifier is an application of the logistic regression model (wikipedia). The classifier partitions candidate record pairs into matches and non-matches.

This algorithm is also known as Deterministic Record Linkage.

The LogisticRegressionClassifier classifier uses the sklearn.linear_model.LogisticRegression classification algorithm from SciKit-learn as kernel.

Parameters:

coefficients (list, numpy.array) – The coefficients of the logistic regression.
intercept (float) – The interception value.
**kwargs – Additional arguments to pass to sklearn.linear_model.LogisticRegression.

kernel

The kernel of the classifier. The kernel is sklearn.linear_model.LogisticRegression from SciKit-learn.

Type:: sklearn.linear_model.LogisticRegression

coefficients

The coefficients of the logistic regression.

Type:: list

intercept

The interception value.

Type:: float

fit(comparison_vectors, match_index=None)

Train the classifier.

Parameters:

comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.

Note

A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict(comparison_vectors, match_index=None)

Train the classifier.

Parameters:

comparison_vectors (pandas.DataFrame) – The comparison vectors.
match_index (pandas.MultiIndex) – The true matches.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.

Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

learn(*args, **kwargs): [DEPRECATED] Use ‘fit_predict’.

predict(comparison_vectors)

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:

comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.

Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

prob(comparison_vectors, return_type=None)

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:

comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
return_type (str) – Deprecated. (default ‘series’)

Returns:

pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

class recordlinkage.NaiveBayesClassifier(binarize=None, alpha=0.0001, use_col_names=True, **kwargs)

Naive Bayes Classifier.

The Naive Bayes classifier (wikipedia) partitions candidate record pairs into matches and non-matches. The classifier is based on probabilistic principles. The Naive Bayes classification method has a close mathematical connection with the Fellegi and Sunter model.

Note

The NaiveBayesClassifier classifier differs of the Naive Bayes models in SciKit-learn. With binary input vectors, the NaiveBayesClassifier behaves like sklearn.naive_bayes.BernoulliNB.

Parameters:

binarize (float or None, optional (default=None)) – Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to consist of multilevel vectors.
alpha (float) – Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing). Default 1e-4.
use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True.

fit(X, *args, **kwargs)

Train the classifier.

Parameters:

comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.

Note

A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict(comparison_vectors, match_index=None)

Train the classifier.

Parameters:

comparison_vectors (pandas.DataFrame) – The comparison vectors.
match_index (pandas.MultiIndex) – The true matches.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.

Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

learn(*args, **kwargs): [DEPRECATED] Use ‘fit_predict’.

property log_m_probs: Log probability P(x_i=1|Match) as described in the FS framework

property log_p: Log match probability as described in the FS framework

property log_u_probs: Log probability P(x_i=1|Non-match) as described in the FS framework

property log_weights: Log weights as described in the FS framework

property m_probs: Probability P(x_i=1|Match) as described in the FS framework

property p: Match probability as described in the FS framework

predict(comparison_vectors)

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:

comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.

Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

prob(comparison_vectors, return_type=None)

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:

comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
return_type (str) – Deprecated. (default ‘series’)

Returns:

pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

property u_probs: Probability P(x_i=1|Non-match) as described in the FS framework

property weights: Weights as described in the FS framework

class recordlinkage.SVMClassifier(*args, **kwargs)

Support Vector Machines Classifier

The Support Vector Machine classifier (wikipedia) partitions candidate record pairs into matches and non-matches. This implementation is a non-probabilistic binary linear classifier. Support vector machines are supervised learning models. Therefore, SVM classifiers need training- data.

The SVMClassifier classifier uses the sklearn.svm.LinearSVC classification algorithm from SciKit-learn as kernel.

Parameters:: **kwargs – Arguments to pass to sklearn.svm.LinearSVC.

kernel

The kernel of the classifier. The kernel is sklearn.svm.LinearSVC from SciKit-learn.

Type:: sklearn.svm.LinearSVC

fit(comparison_vectors, match_index=None)

Train the classifier.

Parameters:

comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.

Note

A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict(comparison_vectors, match_index=None)

Train the classifier.

Parameters:

comparison_vectors (pandas.DataFrame) – The comparison vectors.
match_index (pandas.MultiIndex) – The true matches.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.

Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

learn(*args, **kwargs): [DEPRECATED] Use ‘fit_predict’.

predict(comparison_vectors)

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:

comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.

Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

prob(*args, **kwargs)

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:

comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
return_type (str) – Deprecated. (default ‘series’)

Returns:

pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

Unsupervised

class recordlinkage.ECMClassifier(init='jaro', binarize=None, max_iter=100, atol=0.0001, use_col_names=True, **kwargs)

Expectation/Conditional Maxisation classifier (Unsupervised).

Expectation/Conditional Maximisation algorithm used to classify record pairs. This probabilistic record linkage algorithm is used in combination with Fellegi and Sunter model. This classifier doesn’t need training data (unsupervised).

Parameters:

init (str) – Initialisation method for the algorithm. Options are: ‘jaro’ and ‘random’. Default ‘jaro’.
max_iter (int) – The maximum number of iterations of the EM algorithm. Default 100.
binarize (float or None, optional (default=None)) – Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.
atol (float) – The tolerance between parameters between each interation. If the difference between the parameters between the iterations is smaller than this value, the algorithm is considered to be converged. Default 10e-4.
use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True.

References

Herzog, Thomas N, Fritz J Scheuren and William E Winkler. 2007. Data quality and record linkage techniques. Vol. 1 Springer.

Fellegi, Ivan P and Alan B Sunter. 1969. “A theory for record linkage.” Journal of the American Statistical Association 64(328):1183–1210.

Collins, M. “The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm”. http://www.cs.columbia.edu/~mcollins/em.pdf

fit_predict(comparison_vectors, match_index=None)

Train the classifier.

Parameters:

comparison_vectors (pandas.DataFrame) – The comparison vectors.
match_index (pandas.MultiIndex) – The true matches.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.

Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

learn(*args, **kwargs): [DEPRECATED] Use ‘fit_predict’.

property log_m_probs: Log probability P(x_i=1|Match) as described in the FS framework

property log_p: Log match probability as described in the FS framework

property log_u_probs: Log probability P(x_i=1|Non-match) as described in the FS framework

property log_weights: Log weights as described in the FS framework

property m_probs: Probability P(x_i=1|Match) as described in the FS framework

property p: Match probability as described in the FS framework

predict(comparison_vectors)

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:

comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.

Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

prob(comparison_vectors, return_type=None)

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:

comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
return_type (str) – Deprecated. (default ‘series’)

Returns:

pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

property u_probs: Probability P(x_i=1|Non-match) as described in the FS framework

property weights: Weights as described in the FS framework

fit(X, *args, **kwargs)

Train the classifier.

Parameters:

comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.

Note

A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

class recordlinkage.KMeansClassifier(match_cluster_center=None, nonmatch_cluster_center=None, **kwargs)

KMeans classifier.

The K-means clusterings algorithm (wikipedia) partitions candidate record pairs into matches and non-matches. Each comparison vector belongs to the cluster with the nearest mean.

The K-means algorithm is an unsupervised learning algorithm. The algorithm doesn’t need trainings data for fitting. The algorithm is calibrated for two clusters: a match cluster and a non-match cluster). The centers of these clusters can be given as arguments or set automatically.

The KMeansClassifier classifier uses the sklearn.cluster.KMeans clustering algorithm from SciKit-learn as kernel.

Parameters:

match_cluster_center (list, numpy.array) – The center of the match cluster. The length of the list/array must equal the number of comparison variables. If None, the match cluster center is set automatically. Default None.
nonmatch_cluster_center (list, numpy.array) – The center of the nonmatch (distinct) cluster. The length of the list/array must equal the number of comparison variables. If None, the non-match cluster center is set automatically. Default None.
**kwargs – Additional arguments to pass to sklearn.cluster.KMeans.

kernel

The kernel of the classifier. The kernel is sklearn.cluster.KMeans from SciKit-learn.

Type:: sklearn.cluster.KMeans

match_cluster_center

The center of the match cluster.

Type:: numpy.array

nonmatch_cluster_center

The center of the nonmatch (distinct) cluster.

Type:: numpy.array

Note

There are better methods for linking records than the k-means clustering algorithm. This algorithm can be useful for an (unsupervised) initial partition.

prob(*args, **kwargs)

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:

comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
return_type (str) – Deprecated. (default ‘series’)

Returns:

pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

fit(comparison_vectors, match_index=None)

Train the classifier.

Parameters:

comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.

Note

A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict(comparison_vectors, match_index=None)

Train the classifier.

Parameters:

comparison_vectors (pandas.DataFrame) – The comparison vectors.
match_index (pandas.MultiIndex) – The true matches.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.

Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

learn(*args, **kwargs): [DEPRECATED] Use ‘fit_predict’.

predict(comparison_vectors)

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:

comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.

Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

Adapters

Adapters can be used to wrap a machine learning models from external packages like ScitKit-learn and Keras. For example, this makes it possible to classify record pairs with an neural network developed in Keras.

class recordlinkage.adapters.SKLearnAdapter

SciKit-learn adapter for record pair classification.

SciKit-learn adapter for record pair classification with SciKit-learn models.

# import ScitKit-Learn classifier
from sklearn.ensemble import RandomForestClassifier

# import BaseClassifier from recordlinkage.base
from recordlinkage.base import BaseClassifier
from recordlinkage.adapters import SKLearnClassifier
from recordlinkage.datasets import binary_vectors

class RandomForest(SKLearnClassifier, BaseClassifier):

    def __init__(*args, **kwargs):
        super(self, RandomForest).__init__()

        # set the kernel
        kernel = RandomForestClassifier(*args, **kwargs)


# make a sample dataset
features, links = binary_vectors(10000, 2000, return_links=True)

# initialise the random forest
cl = RandomForest(n_estimators=20)
cl.fit(features, links)

# predict the matches
cl.predict(...)

class recordlinkage.adapters.KerasAdapter

Keras adapter for record pair classification.

Keras adapter for record pair classification with Keras models.

Example of a Keras model used for classification.

from tensorflow.keras import layers
from recordlinkage.base import BaseClassifier
from recordlinkage.adapters import KerasAdapter

class NNClassifier(KerasAdapter, BaseClassifier):
    """Neural network classifier."""
    def __init__(self):
        super(NNClassifier, self).__init__()

        model = tf.keras.Sequential()
        model.add(layers.Dense(16, input_dim=8, activation='relu'))
        model.add(layers.Dense(8, activation='relu'))
        model.add(layers.Dense(1, activation='sigmoid'))
        model.compile(
            optimizer=tf.train.AdamOptimizer(0.001),
            loss='binary_crossentropy',
            metrics=['accuracy']
        )

        self.kernel = model

# initialise the model
cl = NNClassifier()
# fit the model to the data
cl.fit(X_train, links_true)
# predict the class of the data
cl.predict(X_pred)

User-defined algorithms

Classifiers can make use of the recordlinkage.base.BaseClassifier for algorithms. ScitKit-learn based models may want recordlinkage.adapters.SKLearnAdapter as subclass as well.

class recordlinkage.base.BaseClassifier

Base class for classification of records pairs.

This class contains methods for training the classifier. Distinguish different types of training, such as supervised and unsupervised learning.

learn(*args, **kwargs): [DEPRECATED] Use ‘fit_predict’.

fit(comparison_vectors, match_index=None)

Train the classifier.

Parameters:

comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.

Note

A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict(comparison_vectors, match_index=None)

Train the classifier.

Parameters:

comparison_vectors (pandas.DataFrame) – The comparison vectors.
match_index (pandas.MultiIndex) – The true matches.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.

Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

predict(comparison_vectors)

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:

comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.

Returns:

pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

prob(comparison_vectors, return_type=None)

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:

comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
return_type (str) – Deprecated. (default ‘series’)

Returns:

pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

Probabilistic models can use the Fellegi and Sunter base class. This class is used for the recordlinkage.ECMClassifier and the recordlinkage.NaiveBayesClassifier.

class recordlinkage.classifiers.FellegiSunter(use_col_names=True, *args, **kwargs)

Fellegi and Sunter (1969) framework.

Meta class for probabilistic classification algorithms. The Fellegi and Sunter class is used for the recordlinkage.NaiveBayesClassifier and recordlinkage.ECMClassifier.

Parameters:: use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True.

References

Fellegi, Ivan P and Alan B Sunter. 1969. “A theory for record linkage.” Journal of the American Statistical Association 64(328):1183–1210.

property log_p: Log match probability as described in the FS framework

property log_m_probs: Log probability P(x_i=1|Match) as described in the FS framework

property log_u_probs: Log probability P(x_i=1|Non-match) as described in the FS framework

property log_weights: Log weights as described in the FS framework

property p: Match probability as described in the FS framework

property m_probs: Probability P(x_i=1|Match) as described in the FS framework

property u_probs: Probability P(x_i=1|Non-match) as described in the FS framework

property weights: Weights as described in the FS framework

Examples

Unsupervised learning with the ECM algorithm. [See example on Github.](https://github.com/J535D165/recordlinkage/examples/unsupervised_learning.py)

Network

The Python Record Linkage Toolkit provides network/graph analysis tools for classification of record pairs into matches and distinct pairs. The toolkit provides the functionality for one-to-one linking and one-to-many linking. It is also possible to detect all connected components which is useful in data deduplication.

class recordlinkage.OneToOneLinking(method='greedy')

[EXPERIMENTAL] One-to-one linking

A record from dataset A can match at most one record from dataset B. For example, (a1, a2) are records from A and (b1, b2) are records from B. A linkage of (a1, b1), (a1, b2), (a2, b1), (a2, b2) is not one-to-one connected. One of the results of one-to-one linking can be (a1, b1), (a2, b2).

Parameters:: method (str) – The method to solve the problem. Only ‘greedy’ is supported at the moment.

Note

This class is experimental and might change in future versions.

compute(links)

Compute the one-to-one linking.

Parameters:: links (pandas.MultiIndex) – The pairs to apply linking to.
Returns:: pandas.MultiIndex – A one-to-one matched MultiIndex of record pairs.

class recordlinkage.OneToManyLinking(level=0, method='greedy')

[EXPERIMENTAL] One-to-many linking

A record from dataset A can link multiple records from dataset B, but a record from B can link to only one record of dataset A. Use the level argument to switch A and B.

Parameters:

level (int) – The level of the MultiIndex to have the one relations. The options are 0 or 1 (incication the level of the MultiIndex). Default 0.
method (str) – The method to solve the problem. Only ‘greedy’ is supported at the moment.

Example

Consider a MultiIndex with record pairs constructed from datasets A and B. To link a record from B to at most one record of B, use the following syntax:

> one_to_many = OneToManyLinking(0) > one_to_many.compute(links)

To link a record from B to at most one record of B, use:

> one_to_many = OneToManyLinking(1) > one_to_many.compute(links)

Note

This class is experimental and might change in future versions.

compute(links)

Compute the one-to-many matching.

Parameters:: links (pandas.MultiIndex) – The pairs to apply linking to.
Returns:: pandas.MultiIndex – A one-to-many matched MultiIndex of record pairs.

class recordlinkage.ConnectedComponents

[EXPERIMENTAL] Connected record pairs

This class identifies connected record pairs. Connected components are especially used in detecting duplicates in a single dataset.

Note

This class is experimental and might change in future versions.

compute(links)

Return the connected components.

Parameters:: links (pandas.MultiIndex) – The links to apply one-to-one matching on.
Returns:: list of pandas.MultiIndex – A list with pandas.MultiIndex objects. Each MultiIndex object represents a set of connected record pairs.