3. Classification¶
Classifiers¶
Classification is the step in the record linkage process were record pairs are classified into matches, nonmatches and possible matches [Christen2012]. Classification algorithms can be supervised or unsupervised (with or without training data).
See also
[Christen2012]  Christen, Peter. 2012. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media. 
Supervised¶

class
recordlinkage.
LogisticRegressionClassifier
(coefficients=None, intercept=None, **kwargs)¶ Logistic Regression Classifier.
This classifier is an application of the logistic regression model (wikipedia). The classifier partitions candidate record pairs into matches and nonmatches.
This algorithm is also known as Deterministic Record Linkage.
The LogisticRegressionClassifier classifier uses the
sklearn.linear_model.LogisticRegression
classification algorithm from SciKitlearn as kernel.Parameters:  coefficients (list, numpy.array) – The coefficients of the logistic regression.
 intercept (float) – The interception value.
 **kwargs – Additional arguments to pass to
sklearn.linear_model.LogisticRegression
.

kernel
¶ The kernel of the classifier. The kernel is
sklearn.linear_model.LogisticRegression
from SciKitlearn.Type: sklearn.linear_model.LogisticRegression

fit
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
 match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors.
 match_index (pandas.MultiIndex) – The true matches.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

learn
(*args, **kwargs)¶ [DEPRECATED] Use ‘fit_predict’.

predict
(comparison_vectors)¶ Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, nonmatches and possible matches. The classifier has to be trained to call this method.
Parameters:  comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

prob
(comparison_vectors, return_type=None)¶ Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
Parameters:  comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
 return_type (str) – Deprecated. (default ‘series’)
Returns: pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

class
recordlinkage.
NaiveBayesClassifier
(binarize=None, alpha=0.0001, use_col_names=True, **kwargs)¶ Naive Bayes Classifier.
The Naive Bayes classifier (wikipedia) partitions candidate record pairs into matches and nonmatches. The classifier is based on probabilistic principles. The Naive Bayes classification method has a close mathematical connection with the Fellegi and Sunter model.
Note
The NaiveBayesClassifier classifier differs of the Naive Bayes models in SciKitlearn. With binary input vectors, the NaiveBayesClassifier behaves like
sklearn.naive_bayes.BernoulliNB
.Parameters:  binarize (float or None, optional (default=None)) – Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to consist of multilevel vectors.
 alpha (float) – Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing). Default 1e4.
 use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True.

fit
(X, *args, **kwargs)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
 match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors.
 match_index (pandas.MultiIndex) – The true matches.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

learn
(*args, **kwargs)¶ [DEPRECATED] Use ‘fit_predict’.

log_m_probs
¶ Log probability P(x_i=1Match) as described in the FS framework

log_p
¶ Log match probability as described in the FS framework

log_u_probs
¶ Log probability P(x_i=1Nonmatch) as described in the FS framework

log_weights
¶ Log weights as described in the FS framework

m_probs
¶ Probability P(x_i=1Match) as described in the FS framework

p
¶ Match probability as described in the FS framework

predict
(comparison_vectors)¶ Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, nonmatches and possible matches. The classifier has to be trained to call this method.
Parameters:  comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

prob
(comparison_vectors, return_type=None)¶ Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
Parameters:  comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
 return_type (str) – Deprecated. (default ‘series’)
Returns: pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

u_probs
¶ Probability P(x_i=1Nonmatch) as described in the FS framework

weights
¶ Weights as described in the FS framework

class
recordlinkage.
SVMClassifier
(*args, **kwargs)¶ Support Vector Machines Classifier
The Support Vector Machine classifier (wikipedia) partitions candidate record pairs into matches and nonmatches. This implementation is a nonprobabilistic binary linear classifier. Support vector machines are supervised learning models. Therefore, SVM classifiers need training data.
The SVMClassifier classifier uses the
sklearn.svm.LinearSVC
classification algorithm from SciKitlearn as kernel.Parameters: **kwargs – Arguments to pass to sklearn.svm.LinearSVC
.
kernel
¶ The kernel of the classifier. The kernel is
sklearn.svm.LinearSVC
from SciKitlearn.Type: sklearn.svm.LinearSVC

fit
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
 match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors.
 match_index (pandas.MultiIndex) – The true matches.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

learn
(*args, **kwargs)¶ [DEPRECATED] Use ‘fit_predict’.

predict
(comparison_vectors)¶ Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, nonmatches and possible matches. The classifier has to be trained to call this method.
Parameters:  comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

prob
(*args, **kwargs)¶ Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
Parameters:  comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
 return_type (str) – Deprecated. (default ‘series’)
Returns: pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

Unsupervised¶

class
recordlinkage.
ECMClassifier
(init='jaro', binarize=None, max_iter=100, atol=0.0001, use_col_names=True, *args, **kwargs)¶ Expectation/Conditional Maxisation classifier (Unsupervised).
Expectation/Conditional Maximisation algorithm used to classify record pairs. This probabilistic record linkage algorithm is used in combination with Fellegi and Sunter model. This classifier doesn’t need training data (unsupervised).
Parameters:  init (str) – Initialisation method for the algorithm. Options are: ‘jaro’ and ‘random’. Default ‘jaro’.
 max_iter (int) – The maximum number of iterations of the EM algorithm. Default 100.
 binarize (float or None, optional (default=None)) – Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.
 atol (float) – The tolerance between parameters between each interation. If the difference between the parameters between the iterations is smaller than this value, the algorithm is considered to be converged. Default 10e4.
 use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True.
References
Herzog, Thomas N, Fritz J Scheuren and William E Winkler. 2007. Data quality and record linkage techniques. Vol. 1 Springer.
Fellegi, Ivan P and Alan B Sunter. 1969. “A theory for record linkage.” Journal of the American Statistical Association 64(328):1183–1210.
Collins, M. “The Naive Bayes Model, MaximumLikelihood Estimation, and the EM Algorithm”. http://www.cs.columbia.edu/~mcollins/em.pdf

fit_predict
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors.
 match_index (pandas.MultiIndex) – The true matches.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

learn
(*args, **kwargs)¶ [DEPRECATED] Use ‘fit_predict’.

log_m_probs
¶ Log probability P(x_i=1Match) as described in the FS framework

log_p
¶ Log match probability as described in the FS framework

log_u_probs
¶ Log probability P(x_i=1Nonmatch) as described in the FS framework

log_weights
¶ Log weights as described in the FS framework

m_probs
¶ Probability P(x_i=1Match) as described in the FS framework

p
¶ Match probability as described in the FS framework

predict
(comparison_vectors)¶ Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, nonmatches and possible matches. The classifier has to be trained to call this method.
Parameters:  comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

prob
(comparison_vectors, return_type=None)¶ Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
Parameters:  comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
 return_type (str) – Deprecated. (default ‘series’)
Returns: pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

u_probs
¶ Probability P(x_i=1Nonmatch) as described in the FS framework

weights
¶ Weights as described in the FS framework

fit
(X, *args, **kwargs)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
 match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

class
recordlinkage.
KMeansClassifier
(match_cluster_center=None, nonmatch_cluster_center=None, **kwargs)¶ KMeans classifier.
The Kmeans clusterings algorithm (wikipedia) partitions candidate record pairs into matches and nonmatches. Each comparison vector belongs to the cluster with the nearest mean.
The Kmeans algorithm is an unsupervised learning algorithm. The algorithm doesn’t need trainings data for fitting. The algorithm is calibrated for two clusters: a match cluster and a nonmatch cluster). The centers of these clusters can be given as arguments or set automatically.
The KMeansClassifier classifier uses the
sklearn.cluster.KMeans
clustering algorithm from SciKitlearn as kernel.Parameters:  match_cluster_center (list, numpy.array) – The center of the match cluster. The length of the list/array must equal the number of comparison variables. If None, the match cluster center is set automatically. Default None.
 nonmatch_cluster_center (list, numpy.array) – The center of the nonmatch (distinct) cluster. The length of the list/array must equal the number of comparison variables. If None, the nonmatch cluster center is set automatically. Default None.
 **kwargs – Additional arguments to pass to
sklearn.cluster.KMeans
.

kernel
¶ The kernel of the classifier. The kernel is
sklearn.cluster.KMeans
from SciKitlearn.Type: sklearn.cluster.KMeans

match_cluster_center
¶ The center of the match cluster.
Type: numpy.array

nonmatch_cluster_center
¶ The center of the nonmatch (distinct) cluster.
Type: numpy.array
Note
There are better methods for linking records than the kmeans clustering algorithm. This algorithm can be useful for an (unsupervised) initial partition.

prob
(*args, **kwargs)¶ Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
Parameters:  comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
 return_type (str) – Deprecated. (default ‘series’)
Returns: pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

fit
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
 match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors.
 match_index (pandas.MultiIndex) – The true matches.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

learn
(*args, **kwargs)¶ [DEPRECATED] Use ‘fit_predict’.

predict
(comparison_vectors)¶ Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, nonmatches and possible matches. The classifier has to be trained to call this method.
Parameters:  comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).
Adapters¶
Adapters can be used to wrap a machine learning models from external packages like ScitKitlearn and Keras. For example, this makes it possible to classify record pairs with an neural network developed in Keras.

class
recordlinkage.adapters.
SKLearnAdapter
¶ SciKitlearn adapter for record pair classification.
SciKitlearn adapter for record pair classification with SciKitlearn models.
# import ScitKitLearn classifier
from sklearn.ensemble import RandomForestClassifier
# import BaseClassifier from recordlinkage.base
from recordlinkage.base import BaseClassifier
from recordlinkage.adapters import SKLearnClassifier
from recordlinkage.datasets import binary_vectors
class RandomForest(SKLearnClassifier, BaseClassifier):
def __init__(*args, **kwargs):
super(self, RandomForest).__init__()
# set the kernel
kernel = RandomForestClassifier(*args, **kwargs)
# make a sample dataset
features, links = binary_vectors(10000, 2000, return_links=True)
# initialise the random forest
cl = RandomForest(n_estimators=20)
cl.fit(features, links)
# predict the matches
cl.predict(...)

class
recordlinkage.adapters.
KerasAdapter
¶ Keras adapter for record pair classification.
Keras adapter for record pair classification with Keras models.
Example of a Keras model used for classification.
from tensorflow.keras import layers
from recordlinkage.base import BaseClassifier
from recordlinkage.adapters import KerasAdapter
class NNClassifier(KerasAdapter, BaseClassifier):
"""Neural network classifier."""
def __init__(self):
super(NNClassifier, self).__init__()
model = tf.keras.Sequential()
model.add(layers.Dense(16, input_dim=8, activation='relu'))
model.add(layers.Dense(8, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(
optimizer=tf.train.AdamOptimizer(0.001),
loss='binary_crossentropy',
metrics=['accuracy']
)
self.kernel = model
# initialise the model
cl = NNClassifier()
# fit the model to the data
cl.fit(X_train, links_true)
# predict the class of the data
cl.predict(X_pred)
Userdefined algorithms¶
Classifiers can make use of the recordlinkage.base.BaseClassifier
for
algorithms. ScitKitlearn based models may want
recordlinkage.adapters.SKLearnAdapter
as subclass as well.

class
recordlinkage.base.
BaseClassifier
¶ Base class for classification of records pairs.
This class contains methods for training the classifier. Distinguish different types of training, such as supervised and unsupervised learning.

learn
(*args, **kwargs)¶ [DEPRECATED] Use ‘fit_predict’.

fit
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
 match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors.
 match_index (pandas.MultiIndex) – The true matches.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

predict
(comparison_vectors)¶ Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, nonmatches and possible matches. The classifier has to be trained to call this method.
Parameters:  comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

prob
(comparison_vectors, return_type=None)¶ Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
Parameters:  comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
 return_type (str) – Deprecated. (default ‘series’)
Returns: pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

Probabilistic models can use the Fellegi and Sunter base class. This class is
used for the recordlinkage.ECMClassifier
and the
recordlinkage.NaiveBayesClassifier
.

class
recordlinkage.classifiers.
FellegiSunter
(use_col_names=True, *args, **kwargs)¶ Fellegi and Sunter (1969) framework.
Meta class for probabilistic classification algorithms. The Fellegi and Sunter class is used for the
recordlinkage.NaiveBayesClassifier
andrecordlinkage.ECMClassifier
.Parameters: use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True. References
Fellegi, Ivan P and Alan B Sunter. 1969. “A theory for record linkage.” Journal of the American Statistical Association 64(328):1183–1210.

log_p
¶ Log match probability as described in the FS framework

log_m_probs
¶ Log probability P(x_i=1Match) as described in the FS framework

log_u_probs
¶ Log probability P(x_i=1Nonmatch) as described in the FS framework

log_weights
¶ Log weights as described in the FS framework

p
¶ Match probability as described in the FS framework

m_probs
¶ Probability P(x_i=1Match) as described in the FS framework

u_probs
¶ Probability P(x_i=1Nonmatch) as described in the FS framework

weights
¶ Weights as described in the FS framework

Examples¶
Unsupervised learning with the ECM algorithm. [See example on Github.](https://github.com/J535D165/recordlinkage/examples/unsupervised_learning.py)
Network¶
The Python Record Linkage Toolkit provides network/graph analysis tools for classification of record pairs into matches and distinct pairs. The toolkit provides the functionality for onetoone linking and onetomany linking. It is also possible to detect all connected components which is useful in data deduplication.

class
recordlinkage.
OneToOneLinking
(method='greedy')¶ [EXPERIMENTAL] Onetoone linking
A record from dataset A can match at most one record from dataset B. For example, (a1, a2) are records from A and (b1, b2) are records from B. A linkage of (a1, b1), (a1, b2), (a2, b1), (a2, b2) is not onetoone connected. One of the results of onetoone linking can be (a1, b1), (a2, b2).
Parameters: method (str) – The method to solve the problem. Only ‘greedy’ is supported at the moment. Note
This class is experimental and might change in future versions.

compute
(links)¶ Compute the onetoone linking.
Parameters: links (pandas.MultiIndex) – The pairs to apply linking to. Returns: pandas.MultiIndex – A onetoone matched MultiIndex of record pairs.


class
recordlinkage.
OneToManyLinking
(level=0, method='greedy')¶ [EXPERIMENTAL] Onetomany linking
A record from dataset A can link multiple records from dataset B, but a record from B can link to only one record of dataset A. Use the level argument to switch A and B.
Parameters: Example
Consider a MultiIndex with record pairs constructed from datasets A and B. To link a record from B to at most one record of B, use the following syntax:
> one_to_many = OneToManyLinking(0) > one_to_many.compute(links)
To link a record from B to at most one record of B, use:
> one_to_many = OneToManyLinking(1) > one_to_many.compute(links)
Note
This class is experimental and might change in future versions.

compute
(links)¶ Compute the onetomany matching.
Parameters: links (pandas.MultiIndex) – The pairs to apply linking to. Returns: pandas.MultiIndex – A onetomany matched MultiIndex of record pairs.


class
recordlinkage.
ConnectedComponents
¶ [EXPERIMENTAL] Connected record pairs
This class identifies connected record pairs. Connected components are especially used in detecting duplicates in a single dataset.
Note
This class is experimental and might change in future versions.

compute
(links)¶ Return the connected components.
Parameters: links (pandas.MultiIndex) – The links to apply onetoone matching on. Returns: list of pandas.MultiIndex – A list with pandas.MultiIndex objects. Each MultiIndex object represents a set of connected record pairs.
