3. Classification¶

Classifiers¶

Classification is the step in the record linkage process were record pairs are classified into matches, non-matches and possible matches [Christen2012]. Classification algorithms can be supervised or unsupervised (with or without training data).

Supervised¶

class recordlinkage.LogisticRegressionClassifier(coefficients=None, intercept=None, **kwargs)¶

Logistic Regression Classifier.

This classifier is an application of the logistic regression model (wikipedia). The classifier partitions candidate record pairs into matches and non-matches.

This algorithm is also known as Deterministic Record Linkage.

The LogisticRegressionClassifier classifier uses the sklearn.linear_model.LogisticRegression classification algorithm from SciKit-learn as kernel.

Parameters:	coefficients (list, numpy.array) – The coefficients of the logistic regression. intercept (float) – The interception value. **kwargs – Additional arguments to pass to `sklearn.linear_model.LogisticRegression`.

kernel¶

The kernel of the classifier. The kernel is sklearn.linear_model.LogisticRegression from SciKit-learn.

Type:	sklearn.linear_model.LogisticRegression

coefficients¶

The coefficients of the logistic regression.

Type:	list

intercept¶

The interception value.

Type:	float

fit(comparison_vectors, match_index=None)¶

Train the classifier.

Parameters:	comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with. match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.

Note

A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict(comparison_vectors, match_index=None)¶

Train the classifier.

Parameters:	comparison_vectors (pandas.DataFrame) – The comparison vectors. match_index (pandas.MultiIndex) – The true matches. return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:	pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

learn(*args, **kwargs)¶: [DEPRECATED] Use ‘fit_predict’.

predict(comparison_vectors)¶

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:	comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors. return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:	pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

prob(comparison_vectors, return_type=None)¶

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:	comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors. return_type (str) – Deprecated. (default ‘series’)
Returns:	pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

class recordlinkage.NaiveBayesClassifier(binarize=None, alpha=0.0001, use_col_names=True, **kwargs)¶

Naive Bayes Classifier.

The Naive Bayes classifier (wikipedia) partitions candidate record pairs into matches and non-matches. The classifier is based on probabilistic principles. The Naive Bayes classification method has a close mathematical connection with the Fellegi and Sunter model.

Note

The NaiveBayesClassifier classifier differs of the Naive Bayes models in SciKit-learn. With binary input vectors, the NaiveBayesClassifier behaves like sklearn.naive_bayes.BernoulliNB.

Parameters:

binarize (float or None, optional (default=None)) – Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to consist of multilevel vectors.
alpha (float) – Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing). Default 1e-4.
use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True.

fit(X, *args, **kwargs)¶

Train the classifier.

Parameters:	comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with. match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.

Note

A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict(comparison_vectors, match_index=None)¶

Train the classifier.

Parameters:	comparison_vectors (pandas.DataFrame) – The comparison vectors. match_index (pandas.MultiIndex) – The true matches. return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:	pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

learn(*args, **kwargs)¶: [DEPRECATED] Use ‘fit_predict’.

log_m_probs¶: Log probability P(x_i=1|Match) as described in the FS framework

log_p¶: Log match probability as described in the FS framework

log_u_probs¶: Log probability P(x_i=1|Non-match) as described in the FS framework

log_weights¶: Log weights as described in the FS framework

m_probs¶: Probability P(x_i=1|Match) as described in the FS framework

p¶: Match probability as described in the FS framework

predict(comparison_vectors)¶

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:	comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors. return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:	pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

prob(comparison_vectors, return_type=None)¶

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:	comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors. return_type (str) – Deprecated. (default ‘series’)
Returns:	pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

u_probs¶: Probability P(x_i=1|Non-match) as described in the FS framework

weights¶: Weights as described in the FS framework

class recordlinkage.SVMClassifier(*args, **kwargs)¶

Support Vector Machines Classifier

The Support Vector Machine classifier (wikipedia) partitions candidate record pairs into matches and non-matches. This implementation is a non-probabilistic binary linear classifier. Support vector machines are supervised learning models. Therefore, SVM classifiers need training- data.

The SVMClassifier classifier uses the sklearn.svm.LinearSVC classification algorithm from SciKit-learn as kernel.

Parameters:	**kwargs – Arguments to pass to `sklearn.svm.LinearSVC`.

kernel¶

The kernel of the classifier. The kernel is sklearn.svm.LinearSVC from SciKit-learn.

Type:	sklearn.svm.LinearSVC

fit(comparison_vectors, match_index=None)¶

Train the classifier.

Parameters:	comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with. match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.

Note

A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict(comparison_vectors, match_index=None)¶

Train the classifier.

Parameters:	comparison_vectors (pandas.DataFrame) – The comparison vectors. match_index (pandas.MultiIndex) – The true matches. return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:	pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

learn(*args, **kwargs)¶: [DEPRECATED] Use ‘fit_predict’.

predict(comparison_vectors)¶

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:	comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors. return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:	pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

prob(*args, **kwargs)¶

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:	comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors. return_type (str) – Deprecated. (default ‘series’)
Returns:	pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

Unsupervised¶

class recordlinkage.ECMClassifier(init='jaro', binarize=None, max_iter=100, atol=0.0001, use_col_names=True, *args, **kwargs)¶

Expectation/Conditional Maxisation classifier (Unsupervised).

Expectation/Conditional Maximisation algorithm used to classify record pairs. This probabilistic record linkage algorithm is used in combination with Fellegi and Sunter model. This classifier doesn’t need training data (unsupervised).

Parameters:

init (str) – Initialisation method for the algorithm. Options are: ‘jaro’ and ‘random’. Default ‘jaro’.
max_iter (int) – The maximum number of iterations of the EM algorithm. Default 100.
binarize (float or None, optional (default=None)) – Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.
atol (float) – The tolerance between parameters between each interation. If the difference between the parameters between the iterations is smaller than this value, the algorithm is considered to be converged. Default 10e-4.
use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True.

References

Herzog, Thomas N, Fritz J Scheuren and William E Winkler. 2007. Data quality and record linkage techniques. Vol. 1 Springer.

Fellegi, Ivan P and Alan B Sunter. 1969. “A theory for record linkage.” Journal of the American Statistical Association 64(328):1183–1210.

Collins, M. “The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm”. http://www.cs.columbia.edu/~mcollins/em.pdf

fit_predict(comparison_vectors, match_index=None)¶

Train the classifier.

Parameters:	comparison_vectors (pandas.DataFrame) – The comparison vectors. match_index (pandas.MultiIndex) – The true matches. return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:	pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

learn(*args, **kwargs)¶: [DEPRECATED] Use ‘fit_predict’.

log_m_probs¶: Log probability P(x_i=1|Match) as described in the FS framework

log_p¶: Log match probability as described in the FS framework

log_u_probs¶: Log probability P(x_i=1|Non-match) as described in the FS framework

log_weights¶: Log weights as described in the FS framework

m_probs¶: Probability P(x_i=1|Match) as described in the FS framework

p¶: Match probability as described in the FS framework

predict(comparison_vectors)¶

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:	comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors. return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:	pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

prob(comparison_vectors, return_type=None)¶

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:	comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors. return_type (str) – Deprecated. (default ‘series’)
Returns:	pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

u_probs¶: Probability P(x_i=1|Non-match) as described in the FS framework

weights¶: Weights as described in the FS framework

fit(X, *args, **kwargs)¶

Train the classifier.

Parameters:	comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with. match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.

Note

A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

class recordlinkage.KMeansClassifier(match_cluster_center=None, nonmatch_cluster_center=None, **kwargs)¶

KMeans classifier.

The K-means clusterings algorithm (wikipedia) partitions candidate record pairs into matches and non-matches. Each comparison vector belongs to the cluster with the nearest mean.

The K-means algorithm is an unsupervised learning algorithm. The algorithm doesn’t need trainings data for fitting. The algorithm is calibrated for two clusters: a match cluster and a non-match cluster). The centers of these clusters can be given as arguments or set automatically.

The KMeansClassifier classifier uses the sklearn.cluster.KMeans clustering algorithm from SciKit-learn as kernel.

Parameters:

match_cluster_center (list, numpy.array) – The center of the match cluster. The length of the list/array must equal the number of comparison variables. If None, the match cluster center is set automatically. Default None.
nonmatch_cluster_center (list, numpy.array) – The center of the nonmatch (distinct) cluster. The length of the list/array must equal the number of comparison variables. If None, the non-match cluster center is set automatically. Default None.
**kwargs – Additional arguments to pass to sklearn.cluster.KMeans.

kernel¶

The kernel of the classifier. The kernel is sklearn.cluster.KMeans from SciKit-learn.

Type:	sklearn.cluster.KMeans

match_cluster_center¶

The center of the match cluster.

Type:	numpy.array

nonmatch_cluster_center¶

The center of the nonmatch (distinct) cluster.

Type:	numpy.array

Note

There are better methods for linking records than the k-means clustering algorithm. This algorithm can be useful for an (unsupervised) initial partition.

prob(*args, **kwargs)¶

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:	comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors. return_type (str) – Deprecated. (default ‘series’)
Returns:	pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

fit(comparison_vectors, match_index=None)¶

Train the classifier.

Parameters:	comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with. match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.

Note

A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict(comparison_vectors, match_index=None)¶

Train the classifier.

Parameters:	comparison_vectors (pandas.DataFrame) – The comparison vectors. match_index (pandas.MultiIndex) – The true matches. return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:	pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

learn(*args, **kwargs)¶: [DEPRECATED] Use ‘fit_predict’.

predict(comparison_vectors)¶

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:	comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors. return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:	pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

Adapters¶

Adapters can be used to wrap a machine learning models from external packages like ScitKit-learn and Keras. For example, this makes it possible to classify record pairs with an neural network developed in Keras.

class recordlinkage.adapters.SKLearnAdapter¶

SciKit-learn adapter for record pair classification.

SciKit-learn adapter for record pair classification with SciKit-learn models.

# import ScitKit-Learn classifier
from sklearn.ensemble import RandomForestClassifier

# import BaseClassifier from recordlinkage.base
from recordlinkage.base import BaseClassifier
from recordlinkage.adapters import SKLearnClassifier
from recordlinkage.datasets import binary_vectors

class RandomForest(SKLearnClassifier, BaseClassifier):

    def __init__(*args, **kwargs):
        super(self, RandomForest).__init__()

        # set the kernel
        kernel = RandomForestClassifier(*args, **kwargs)


# make a sample dataset
features, links = binary_vectors(10000, 2000, return_links=True)

# initialise the random forest
cl = RandomForest(n_estimators=20)
cl.fit(features, links)

# predict the matches
cl.predict(...)

class recordlinkage.adapters.KerasAdapter¶

Keras adapter for record pair classification.

Keras adapter for record pair classification with Keras models.

Example of a Keras model used for classification.

from tensorflow.keras import layers
from recordlinkage.base import BaseClassifier
from recordlinkage.adapters import KerasAdapter

class NNClassifier(KerasAdapter, BaseClassifier):
    """Neural network classifier."""
    def __init__(self):
        super(NNClassifier, self).__init__()

        model = tf.keras.Sequential()
        model.add(layers.Dense(16, input_dim=8, activation='relu'))
        model.add(layers.Dense(8, activation='relu'))
        model.add(layers.Dense(1, activation='sigmoid'))
        model.compile(
            optimizer=tf.train.AdamOptimizer(0.001),
            loss='binary_crossentropy',
            metrics=['accuracy']
        )

        self.kernel = model

# initialise the model
cl = NNClassifier()
# fit the model to the data
cl.fit(X_train, links_true)
# predict the class of the data
cl.predict(X_pred)

User-defined algorithms¶

Classifiers can make use of the recordlinkage.base.BaseClassifier for algorithms. ScitKit-learn based models may want recordlinkage.adapters.SKLearnAdapter as subclass as well.

class recordlinkage.base.BaseClassifier¶

Base class for classification of records pairs.

This class contains methods for training the classifier. Distinguish different types of training, such as supervised and unsupervised learning.

learn(*args, **kwargs)¶: [DEPRECATED] Use ‘fit_predict’.

fit(comparison_vectors, match_index=None)¶

Train the classifier.

Parameters:	comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with. match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.

Note

A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict(comparison_vectors, match_index=None)¶

Train the classifier.

Parameters:	comparison_vectors (pandas.DataFrame) – The comparison vectors. match_index (pandas.MultiIndex) – The true matches. return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:	pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

predict(comparison_vectors)¶

Predict the class of the record pairs.

Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.

Parameters:	comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors. return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns:	pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).

prob(comparison_vectors, return_type=None)¶

Compute the probabilities for each record pair.

For each pair of records, estimate the probability of being a match.

Parameters:	comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors. return_type (str) – Deprecated. (default ‘series’)
Returns:	pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

Probabilistic models can use the Fellegi and Sunter base class. This class is used for the recordlinkage.ECMClassifier and the recordlinkage.NaiveBayesClassifier.

class recordlinkage.classifiers.FellegiSunter(use_col_names=True, *args, **kwargs)¶

Fellegi and Sunter (1969) framework.

Meta class for probabilistic classification algorithms. The Fellegi and Sunter class is used for the recordlinkage.NaiveBayesClassifier and recordlinkage.ECMClassifier.

Parameters:	use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True.

References

Fellegi, Ivan P and Alan B Sunter. 1969. “A theory for record linkage.” Journal of the American Statistical Association 64(328):1183–1210.

log_p¶: Log match probability as described in the FS framework

log_m_probs¶: Log probability P(x_i=1|Match) as described in the FS framework

log_u_probs¶: Log probability P(x_i=1|Non-match) as described in the FS framework

log_weights¶: Log weights as described in the FS framework

p¶: Match probability as described in the FS framework

m_probs¶: Probability P(x_i=1|Match) as described in the FS framework

u_probs¶: Probability P(x_i=1|Non-match) as described in the FS framework

weights¶: Weights as described in the FS framework

Examples¶

Unsupervised learning with the ECM algorithm. [See example on Github.](https://github.com/J535D165/recordlinkage/examples/unsupervised_learning.py)

Network¶

The Python Record Linkage Toolkit provides network/graph analysis tools for classification of record pairs into matches and distinct pairs. The toolkit provides the functionality for one-to-one linking and one-to-many linking. It is also possible to detect all connected components which is useful in data deduplication.

class recordlinkage.OneToOneLinking(method='greedy')¶

[EXPERIMENTAL] One-to-one linking

A record from dataset A can match at most one record from dataset B. For example, (a1, a2) are records from A and (b1, b2) are records from B. A linkage of (a1, b1), (a1, b2), (a2, b1), (a2, b2) is not one-to-one connected. One of the results of one-to-one linking can be (a1, b1), (a2, b2).

Parameters:	method (str) – The method to solve the problem. Only ‘greedy’ is supported at the moment.

Note

This class is experimental and might change in future versions.

compute(links)¶

Compute the one-to-one linking.

Parameters:	links (pandas.MultiIndex) – The pairs to apply linking to.
Returns:	pandas.MultiIndex – A one-to-one matched MultiIndex of record pairs.

class recordlinkage.OneToManyLinking(level=0, method='greedy')¶

[EXPERIMENTAL] One-to-many linking

A record from dataset A can link multiple records from dataset B, but a record from B can link to only one record of dataset A. Use the level argument to switch A and B.

Parameters:	level (int) – The level of the MultiIndex to have the one relations. The options are 0 or 1 (incication the level of the MultiIndex). Default 0. method (str) – The method to solve the problem. Only ‘greedy’ is supported at the moment.

Example

Consider a MultiIndex with record pairs constructed from datasets A and B. To link a record from B to at most one record of B, use the following syntax:

> one_to_many = OneToManyLinking(0) > one_to_many.compute(links)

To link a record from B to at most one record of B, use:

> one_to_many = OneToManyLinking(1) > one_to_many.compute(links)

Note

This class is experimental and might change in future versions.

compute(links)¶

Compute the one-to-many matching.

Parameters:	links (pandas.MultiIndex) – The pairs to apply linking to.
Returns:	pandas.MultiIndex – A one-to-many matched MultiIndex of record pairs.

class recordlinkage.ConnectedComponents¶

[EXPERIMENTAL] Connected record pairs

This class identifies connected record pairs. Connected components are especially used in detecting duplicates in a single dataset.

Note

This class is experimental and might change in future versions.

compute(links)¶

Return the connected components.

Parameters:	links (pandas.MultiIndex) – The links to apply one-to-one matching on.
Returns:	list of pandas.MultiIndex – A list with pandas.MultiIndex objects. Each MultiIndex object represents a set of connected record pairs.