3. Classification
Classifiers
Classification is the step in the record linkage process were record pairs are classified into matches, non-matches and possible matches [Christen2012]. Classification algorithms can be supervised or unsupervised (with or without training data).
See also
Christen, Peter. 2012. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media.
Supervised
- class recordlinkage.LogisticRegressionClassifier(coefficients=None, intercept=None, **kwargs)
Logistic Regression Classifier.
This classifier is an application of the logistic regression model (wikipedia). The classifier partitions candidate record pairs into matches and non-matches.
This algorithm is also known as Deterministic Record Linkage.
The LogisticRegressionClassifier classifier uses the
sklearn.linear_model.LogisticRegression
classification algorithm from SciKit-learn as kernel.- Parameters:
coefficients (list, numpy.array) – The coefficients of the logistic regression.
intercept (float) – The interception value.
**kwargs – Additional arguments to pass to
sklearn.linear_model.LogisticRegression
.
- kernel
The kernel of the classifier. The kernel is
sklearn.linear_model.LogisticRegression
from SciKit-learn.
- fit(comparison_vectors, match_index=None)
Train the classifier.
- Parameters:
comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.
- fit_predict(comparison_vectors, match_index=None)
Train the classifier.
- Parameters:
comparison_vectors (pandas.DataFrame) – The comparison vectors.
match_index (pandas.MultiIndex) – The true matches.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
- Returns:
pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
- learn(*args, **kwargs)
[DEPRECATED] Use ‘fit_predict’.
- predict(comparison_vectors)
Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.
- Parameters:
comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
- Returns:
pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
- prob(comparison_vectors, return_type=None)
Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
- Parameters:
comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
return_type (str) – Deprecated. (default ‘series’)
- Returns:
pandas.Series or numpy.ndarray – The probability of being a match for each record pair.
- class recordlinkage.NaiveBayesClassifier(binarize=None, alpha=0.0001, use_col_names=True, **kwargs)
Naive Bayes Classifier.
The Naive Bayes classifier (wikipedia) partitions candidate record pairs into matches and non-matches. The classifier is based on probabilistic principles. The Naive Bayes classification method has a close mathematical connection with the Fellegi and Sunter model.
Note
The NaiveBayesClassifier classifier differs of the Naive Bayes models in SciKit-learn. With binary input vectors, the NaiveBayesClassifier behaves like
sklearn.naive_bayes.BernoulliNB
.- Parameters:
binarize (float or None, optional (default=None)) – Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to consist of multilevel vectors.
alpha (float) – Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing). Default 1e-4.
use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True.
- fit(X, *args, **kwargs)
Train the classifier.
- Parameters:
comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.
- fit_predict(comparison_vectors, match_index=None)
Train the classifier.
- Parameters:
comparison_vectors (pandas.DataFrame) – The comparison vectors.
match_index (pandas.MultiIndex) – The true matches.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
- Returns:
pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
- learn(*args, **kwargs)
[DEPRECATED] Use ‘fit_predict’.
- property log_m_probs
Log probability P(x_i=1|Match) as described in the FS framework
- property log_p
Log match probability as described in the FS framework
- property log_u_probs
Log probability P(x_i=1|Non-match) as described in the FS framework
- property log_weights
Log weights as described in the FS framework
- property m_probs
Probability P(x_i=1|Match) as described in the FS framework
- property p
Match probability as described in the FS framework
- predict(comparison_vectors)
Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.
- Parameters:
comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
- Returns:
pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
- prob(comparison_vectors, return_type=None)
Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
- Parameters:
comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
return_type (str) – Deprecated. (default ‘series’)
- Returns:
pandas.Series or numpy.ndarray – The probability of being a match for each record pair.
- property u_probs
Probability P(x_i=1|Non-match) as described in the FS framework
- property weights
Weights as described in the FS framework
- class recordlinkage.SVMClassifier(*args, **kwargs)
Support Vector Machines Classifier
The Support Vector Machine classifier (wikipedia) partitions candidate record pairs into matches and non-matches. This implementation is a non-probabilistic binary linear classifier. Support vector machines are supervised learning models. Therefore, SVM classifiers need training- data.
The SVMClassifier classifier uses the
sklearn.svm.LinearSVC
classification algorithm from SciKit-learn as kernel.- Parameters:
**kwargs – Arguments to pass to
sklearn.svm.LinearSVC
.
- kernel
The kernel of the classifier. The kernel is
sklearn.svm.LinearSVC
from SciKit-learn.- Type:
- fit(comparison_vectors, match_index=None)
Train the classifier.
- Parameters:
comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.
- fit_predict(comparison_vectors, match_index=None)
Train the classifier.
- Parameters:
comparison_vectors (pandas.DataFrame) – The comparison vectors.
match_index (pandas.MultiIndex) – The true matches.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
- Returns:
pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
- learn(*args, **kwargs)
[DEPRECATED] Use ‘fit_predict’.
- predict(comparison_vectors)
Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.
- Parameters:
comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
- Returns:
pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
- prob(*args, **kwargs)
Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
- Parameters:
comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
return_type (str) – Deprecated. (default ‘series’)
- Returns:
pandas.Series or numpy.ndarray – The probability of being a match for each record pair.
Unsupervised
- class recordlinkage.ECMClassifier(init='jaro', binarize=None, max_iter=100, atol=0.0001, use_col_names=True, **kwargs)
Expectation/Conditional Maxisation classifier (Unsupervised).
Expectation/Conditional Maximisation algorithm used to classify record pairs. This probabilistic record linkage algorithm is used in combination with Fellegi and Sunter model. This classifier doesn’t need training data (unsupervised).
- Parameters:
init (str) – Initialisation method for the algorithm. Options are: ‘jaro’ and ‘random’. Default ‘jaro’.
max_iter (int) – The maximum number of iterations of the EM algorithm. Default 100.
binarize (float or None, optional (default=None)) – Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.
atol (float) – The tolerance between parameters between each interation. If the difference between the parameters between the iterations is smaller than this value, the algorithm is considered to be converged. Default 10e-4.
use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True.
References
Herzog, Thomas N, Fritz J Scheuren and William E Winkler. 2007. Data quality and record linkage techniques. Vol. 1 Springer.
Fellegi, Ivan P and Alan B Sunter. 1969. “A theory for record linkage.” Journal of the American Statistical Association 64(328):1183–1210.
Collins, M. “The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm”. http://www.cs.columbia.edu/~mcollins/em.pdf
- fit_predict(comparison_vectors, match_index=None)
Train the classifier.
- Parameters:
comparison_vectors (pandas.DataFrame) – The comparison vectors.
match_index (pandas.MultiIndex) – The true matches.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
- Returns:
pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
- learn(*args, **kwargs)
[DEPRECATED] Use ‘fit_predict’.
- property log_m_probs
Log probability P(x_i=1|Match) as described in the FS framework
- property log_p
Log match probability as described in the FS framework
- property log_u_probs
Log probability P(x_i=1|Non-match) as described in the FS framework
- property log_weights
Log weights as described in the FS framework
- property m_probs
Probability P(x_i=1|Match) as described in the FS framework
- property p
Match probability as described in the FS framework
- predict(comparison_vectors)
Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.
- Parameters:
comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
- Returns:
pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
- prob(comparison_vectors, return_type=None)
Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
- Parameters:
comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
return_type (str) – Deprecated. (default ‘series’)
- Returns:
pandas.Series or numpy.ndarray – The probability of being a match for each record pair.
- property u_probs
Probability P(x_i=1|Non-match) as described in the FS framework
- property weights
Weights as described in the FS framework
- fit(X, *args, **kwargs)
Train the classifier.
- Parameters:
comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.
- class recordlinkage.KMeansClassifier(match_cluster_center=None, nonmatch_cluster_center=None, **kwargs)
KMeans classifier.
The K-means clusterings algorithm (wikipedia) partitions candidate record pairs into matches and non-matches. Each comparison vector belongs to the cluster with the nearest mean.
The K-means algorithm is an unsupervised learning algorithm. The algorithm doesn’t need trainings data for fitting. The algorithm is calibrated for two clusters: a match cluster and a non-match cluster). The centers of these clusters can be given as arguments or set automatically.
The KMeansClassifier classifier uses the
sklearn.cluster.KMeans
clustering algorithm from SciKit-learn as kernel.- Parameters:
match_cluster_center (list, numpy.array) – The center of the match cluster. The length of the list/array must equal the number of comparison variables. If None, the match cluster center is set automatically. Default None.
nonmatch_cluster_center (list, numpy.array) – The center of the nonmatch (distinct) cluster. The length of the list/array must equal the number of comparison variables. If None, the non-match cluster center is set automatically. Default None.
**kwargs – Additional arguments to pass to
sklearn.cluster.KMeans
.
- kernel
The kernel of the classifier. The kernel is
sklearn.cluster.KMeans
from SciKit-learn.- Type:
- match_cluster_center
The center of the match cluster.
- Type:
numpy.array
- nonmatch_cluster_center
The center of the nonmatch (distinct) cluster.
- Type:
numpy.array
Note
There are better methods for linking records than the k-means clustering algorithm. This algorithm can be useful for an (unsupervised) initial partition.
- prob(*args, **kwargs)
Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
- Parameters:
comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
return_type (str) – Deprecated. (default ‘series’)
- Returns:
pandas.Series or numpy.ndarray – The probability of being a match for each record pair.
- fit(comparison_vectors, match_index=None)
Train the classifier.
- Parameters:
comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.
- fit_predict(comparison_vectors, match_index=None)
Train the classifier.
- Parameters:
comparison_vectors (pandas.DataFrame) – The comparison vectors.
match_index (pandas.MultiIndex) – The true matches.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
- Returns:
pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
- learn(*args, **kwargs)
[DEPRECATED] Use ‘fit_predict’.
- predict(comparison_vectors)
Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.
- Parameters:
comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
- Returns:
pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
Adapters
Adapters can be used to wrap a machine learning models from external packages like ScitKit-learn and Keras. For example, this makes it possible to classify record pairs with an neural network developed in Keras.
- class recordlinkage.adapters.SKLearnAdapter
SciKit-learn adapter for record pair classification.
SciKit-learn adapter for record pair classification with SciKit-learn models.
# import ScitKit-Learn classifier
from sklearn.ensemble import RandomForestClassifier
# import BaseClassifier from recordlinkage.base
from recordlinkage.base import BaseClassifier
from recordlinkage.adapters import SKLearnClassifier
from recordlinkage.datasets import binary_vectors
class RandomForest(SKLearnClassifier, BaseClassifier):
def __init__(*args, **kwargs):
super(self, RandomForest).__init__()
# set the kernel
kernel = RandomForestClassifier(*args, **kwargs)
# make a sample dataset
features, links = binary_vectors(10000, 2000, return_links=True)
# initialise the random forest
cl = RandomForest(n_estimators=20)
cl.fit(features, links)
# predict the matches
cl.predict(...)
- class recordlinkage.adapters.KerasAdapter
Keras adapter for record pair classification.
Keras adapter for record pair classification with Keras models.
Example of a Keras model used for classification.
from tensorflow.keras import layers
from recordlinkage.base import BaseClassifier
from recordlinkage.adapters import KerasAdapter
class NNClassifier(KerasAdapter, BaseClassifier):
"""Neural network classifier."""
def __init__(self):
super(NNClassifier, self).__init__()
model = tf.keras.Sequential()
model.add(layers.Dense(16, input_dim=8, activation='relu'))
model.add(layers.Dense(8, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(
optimizer=tf.train.AdamOptimizer(0.001),
loss='binary_crossentropy',
metrics=['accuracy']
)
self.kernel = model
# initialise the model
cl = NNClassifier()
# fit the model to the data
cl.fit(X_train, links_true)
# predict the class of the data
cl.predict(X_pred)
User-defined algorithms
Classifiers can make use of the recordlinkage.base.BaseClassifier
for
algorithms. ScitKit-learn based models may want
recordlinkage.adapters.SKLearnAdapter
as subclass as well.
- class recordlinkage.base.BaseClassifier
Base class for classification of records pairs.
This class contains methods for training the classifier. Distinguish different types of training, such as supervised and unsupervised learning.
- learn(*args, **kwargs)
[DEPRECATED] Use ‘fit_predict’.
- fit(comparison_vectors, match_index=None)
Train the classifier.
- Parameters:
comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.
- fit_predict(comparison_vectors, match_index=None)
Train the classifier.
- Parameters:
comparison_vectors (pandas.DataFrame) – The comparison vectors.
match_index (pandas.MultiIndex) – The true matches.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
- Returns:
pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
- predict(comparison_vectors)
Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.
- Parameters:
comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
- Returns:
pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
- prob(comparison_vectors, return_type=None)
Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
- Parameters:
comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
return_type (str) – Deprecated. (default ‘series’)
- Returns:
pandas.Series or numpy.ndarray – The probability of being a match for each record pair.
Probabilistic models can use the Fellegi and Sunter base class. This class is
used for the recordlinkage.ECMClassifier
and the
recordlinkage.NaiveBayesClassifier
.
- class recordlinkage.classifiers.FellegiSunter(use_col_names=True, *args, **kwargs)
Fellegi and Sunter (1969) framework.
Meta class for probabilistic classification algorithms. The Fellegi and Sunter class is used for the
recordlinkage.NaiveBayesClassifier
andrecordlinkage.ECMClassifier
.- Parameters:
use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True.
References
Fellegi, Ivan P and Alan B Sunter. 1969. “A theory for record linkage.” Journal of the American Statistical Association 64(328):1183–1210.
- property log_p
Log match probability as described in the FS framework
- property log_m_probs
Log probability P(x_i=1|Match) as described in the FS framework
- property log_u_probs
Log probability P(x_i=1|Non-match) as described in the FS framework
- property log_weights
Log weights as described in the FS framework
- property p
Match probability as described in the FS framework
- property m_probs
Probability P(x_i=1|Match) as described in the FS framework
- property u_probs
Probability P(x_i=1|Non-match) as described in the FS framework
- property weights
Weights as described in the FS framework
Examples
Unsupervised learning with the ECM algorithm. [See example on Github.](https://github.com/J535D165/recordlinkage/examples/unsupervised_learning.py)
Network
The Python Record Linkage Toolkit provides network/graph analysis tools for classification of record pairs into matches and distinct pairs. The toolkit provides the functionality for one-to-one linking and one-to-many linking. It is also possible to detect all connected components which is useful in data deduplication.
- class recordlinkage.OneToOneLinking(method='greedy')
[EXPERIMENTAL] One-to-one linking
A record from dataset A can match at most one record from dataset B. For example, (a1, a2) are records from A and (b1, b2) are records from B. A linkage of (a1, b1), (a1, b2), (a2, b1), (a2, b2) is not one-to-one connected. One of the results of one-to-one linking can be (a1, b1), (a2, b2).
- Parameters:
method (str) – The method to solve the problem. Only ‘greedy’ is supported at the moment.
Note
This class is experimental and might change in future versions.
- compute(links)
Compute the one-to-one linking.
- Parameters:
links (pandas.MultiIndex) – The pairs to apply linking to.
- Returns:
pandas.MultiIndex – A one-to-one matched MultiIndex of record pairs.
- class recordlinkage.OneToManyLinking(level=0, method='greedy')
[EXPERIMENTAL] One-to-many linking
A record from dataset A can link multiple records from dataset B, but a record from B can link to only one record of dataset A. Use the level argument to switch A and B.
- Parameters:
Example
Consider a MultiIndex with record pairs constructed from datasets A and B. To link a record from B to at most one record of B, use the following syntax:
> one_to_many = OneToManyLinking(0) > one_to_many.compute(links)
To link a record from B to at most one record of B, use:
> one_to_many = OneToManyLinking(1) > one_to_many.compute(links)
Note
This class is experimental and might change in future versions.
- compute(links)
Compute the one-to-many matching.
- Parameters:
links (pandas.MultiIndex) – The pairs to apply linking to.
- Returns:
pandas.MultiIndex – A one-to-many matched MultiIndex of record pairs.
- class recordlinkage.ConnectedComponents
[EXPERIMENTAL] Connected record pairs
This class identifies connected record pairs. Connected components are especially used in detecting duplicates in a single dataset.
Note
This class is experimental and might change in future versions.
- compute(links)
Return the connected components.
- Parameters:
links (pandas.MultiIndex) – The links to apply one-to-one matching on.
- Returns:
list of pandas.MultiIndex – A list with pandas.MultiIndex objects. Each MultiIndex object represents a set of connected record pairs.