3. Classification¶
Classifiers¶
Classification is the step in the record linkage process were record pairs are classified into matches, nonmatches and possible matches [Christen2012]. Classification algorithms can be supervised or unsupervised (with or without training data).
See also
[Christen2012]  Christen, Peter. 2012. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media. 
Supervised¶

class
recordlinkage.
LogisticRegressionClassifier
(coefficients=None, intercept=None, **kwargs)¶ Logistic Regression Classifier.
This classifier is an application of the logistic regression model (wikipedia). The classifier partitions candidate record pairs into matches and nonmatches.
This algorithm is also known as Deterministic Record Linkage.
The LogisticRegressionClassifier classifier uses the
sklearn.linear_model.LogisticRegression
classification algorithm from SciKitlearn as kernel.Parameters:  coefficients (list, numpy.array) – The coefficients of the logistic regression.
 intercept (float) – The interception value.
 **kwargs – Additional arguments to pass to
sklearn.linear_model.LogisticRegression
.

kernel
¶ sklearn.linear_model.LogisticRegression – The kernel of the classifier. The kernel is
sklearn.linear_model.LogisticRegression
from SciKitlearn.

coefficients
¶ list – The coefficients of the logistic regression.

intercept
¶ float – The interception value.

fit
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
 match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example deduplication). Unsure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors.
 match_index (pandas.MultiIndex) – The true matches.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

learn
(*args, **kwargs)¶ [DEPRECATED] Use ‘fit_predict’.

predict
(comparison_vectors)¶ Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, nonmatches and possible matches. The classifier has to be trained to call this method.
Parameters:  comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

prob
(comparison_vectors, return_type=None)¶ Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
Parameters:  comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
 return_type (str) – Deprecated. (default ‘series’)
Returns: pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

class
recordlinkage.
NaiveBayesClassifier
(binarize=None, alpha=0.0001, use_col_names=True, **kwargs)¶ Naive Bayes Classifier.
The Naive Bayes classifier (wikipedia) partitions candidate record pairs into matches and nonmatches. The classifier is based on probabilistic principles. The Naive Bayes classification method has a close mathematical connection with the Fellegi and Sunter model.
Note
The NaiveBayesClassifier classifier differs of the Naive Bayes models in SciKitlearn. With binary input vectors, the NaiveBayesClassifier behaves like
sklearn.naive_bayes.BernoulliNB
.Parameters:  binarize (float or None, optional (default=None)) – Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to consist of multilevel vectors.
 alpha (float) – Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing). Default 1e4.
 use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True.

kernel
¶ sklearn.naive_bayes.BernoulliNB – The kernel of the classifier. The kernel is
sklearn.naive_bayes.BernoulliNB
from SciKitlearn.

log_p
¶ float – Log match probability as described in the FS framework.

log_m_probs
¶ np.ndarray – Log probability P(x_i==1Match) as described in the FS framework.

log_u_probs
¶ np.ndarray – Log probability P(x_i==1Nonmatch) as described in the FS framework.

log_weights
¶ np.ndarray – Log weights as described in the FS framework.

p
¶ float – Match probability as described in the FS framework.

m_probs
¶ np.ndarray – Probability P(x_i==1Match) as described in the FS framework.

u_probs
¶ np.ndarray – Probability P(x_i==1Nonmatch) as described in the FS framework.

weights
¶ np.ndarray – Weights as described in the FS framework.

fit
(X, *args, **kwargs)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
 match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example deduplication). Unsure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors.
 match_index (pandas.MultiIndex) – The true matches.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

learn
(*args, **kwargs)¶ [DEPRECATED] Use ‘fit_predict’.

log_m_probs
Log probability P(x_i==1Match) as described in the FS framework.

log_p
Log match probability as described in the FS framework.

log_u_probs
Log probability P(x_i==1Nonmatch) as described in the FS framework.

log_weights
Log weights as described in the FS framework.

m_probs
Probability P(x_i==1Match) as described in the FS framework.

p
Match probability as described in the FS framework.

predict
(comparison_vectors)¶ Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, nonmatches and possible matches. The classifier has to be trained to call this method.
Parameters:  comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

prob
(comparison_vectors, return_type=None)¶ Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
Parameters:  comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
 return_type (str) – Deprecated. (default ‘series’)
Returns: pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

u_probs
Probability P(x_i==1Nonmatch) as described in the FS framework.

weights
Weights as described in the FS framework.

class
recordlinkage.
SVMClassifier
(*args, **kwargs)¶ Support Vector Machines Classifier
The Support Vector Machine classifier (wikipedia) partitions candidate record pairs into matches and nonmatches. This implementation is a nonprobabilistic binary linear classifier. Support vector machines are supervised learning models. Therefore, SVM classifiers need training data.
The SVMClassifier classifier uses the
sklearn.svm.LinearSVC
classification algorithm from SciKitlearn as kernel.Parameters: **kwargs – Arguments to pass to sklearn.svm.LinearSVC
.
kernel
¶ sklearn.svm.LinearSVC – The kernel of the classifier. The kernel is
sklearn.svm.LinearSVC
from SciKitlearn.

fit
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
 match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example deduplication). Unsure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors.
 match_index (pandas.MultiIndex) – The true matches.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

learn
(*args, **kwargs)¶ [DEPRECATED] Use ‘fit_predict’.

predict
(comparison_vectors)¶ Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, nonmatches and possible matches. The classifier has to be trained to call this method.
Parameters:  comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

prob
(*args, **kwargs)¶ Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
Parameters:  comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
 return_type (str) – Deprecated. (default ‘series’)
Returns: pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

Unsupervised¶

class
recordlinkage.
ECMClassifier
(init='jaro', binarize=None, max_iter=100, atol=0.0001, use_col_names=True, *args, **kwargs)¶ Expectation/Conditional Maxisation classifier (Unsupervised).
Expectation/Conditional Maximisation algorithm used to classify record pairs. This probabilistic record linkage algorithm is used in combination with Fellegi and Sunter model. This classifier doesn’t need training data (unsupervised).
Parameters:  init (str) – Initialisation method for the algorithm. Options are: ‘jaro’ and ‘random’. Default ‘jaro’.
 max_iter (int) – The maximum number of iterations of the EM algorithm. Default 100.
 binarize (float or None, optional (default=None)) – Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.
 atol (float) – The tolerance between parameters between each interation. If the difference between the parameters between the iterations is smaller than this value, the algorithm is considered to be converged. Default 10e4.
 use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True.

kernel
¶ recordlinkage.algorithms.em_sklearn.ECM – The kernel of the classifier.

log_p
¶ float – Log match probability as described in the FS framework.

log_m_probs
¶ np.ndarray – Log probability P(x_i==1Match) as described in the FS framework.

log_u_probs
¶ np.ndarray – Log probability P(x_i==1Nonmatch) as described in the FS framework.

log_weights
¶ np.ndarray – Log weights as described in the FS framework.

p
¶ float – Match probability as described in the FS framework.

m_probs
¶ np.ndarray – Probability P(x_i==1Match) as described in the FS framework.

u_probs
¶ np.ndarray – Probability P(x_i==1Nonmatch) as described in the FS framework.

weights
¶ np.ndarray – Weights as described in the FS framework.
References
Herzog, Thomas N, Fritz J Scheuren and William E Winkler. 2007. Data quality and record linkage techniques. Vol. 1 Springer.
Fellegi, Ivan P and Alan B Sunter. 1969. “A theory for record linkage.” Journal of the American Statistical Association 64(328):1183–1210.
Collins, M. “The Naive Bayes Model, MaximumLikelihood Estimation, and the EM Algorithm”. http://www.cs.columbia.edu/~mcollins/em.pdf

fit_predict
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors.
 match_index (pandas.MultiIndex) – The true matches.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

learn
(*args, **kwargs)¶ [DEPRECATED] Use ‘fit_predict’.

log_m_probs
Log probability P(x_i==1Match) as described in the FS framework.

log_p
Log match probability as described in the FS framework.

log_u_probs
Log probability P(x_i==1Nonmatch) as described in the FS framework.

log_weights
Log weights as described in the FS framework.

m_probs
Probability P(x_i==1Match) as described in the FS framework.

p
Match probability as described in the FS framework.

predict
(comparison_vectors)¶ Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, nonmatches and possible matches. The classifier has to be trained to call this method.
Parameters:  comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

prob
(comparison_vectors, return_type=None)¶ Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
Parameters:  comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
 return_type (str) – Deprecated. (default ‘series’)
Returns: pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

u_probs
Probability P(x_i==1Nonmatch) as described in the FS framework.

weights
Weights as described in the FS framework.

fit
(X, *args, **kwargs)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
 match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example deduplication). Unsure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

class
recordlinkage.
KMeansClassifier
(match_cluster_center=None, nonmatch_cluster_center=None, **kwargs)¶ KMeans classifier.
The Kmeans clusterings algorithm (wikipedia) partitions candidate record pairs into matches and nonmatches. Each comparison vector belongs to the cluster with the nearest mean.
The Kmeans algorithm is an unsupervised learning algorithm. The algorithm doesn’t need trainings data for fitting. The algorithm is calibrated for two clusters: a match cluster and a nonmatch cluster). The centers of these clusters can be given as arguments or set automatically.
The KMeansClassifier classifier uses the
sklearn.cluster.KMeans
clustering algorithm from SciKitlearn as kernel.Parameters:  match_cluster_center (list, numpy.array) – The center of the match cluster. The length of the list/array must equal the number of comparison variables. If None, the match cluster center is set automatically. Default None.
 nonmatch_cluster_center (list, numpy.array) – The center of the nonmatch (distinct) cluster. The length of the list/array must equal the number of comparison variables. If None, the nonmatch cluster center is set automatically. Default None.
 **kwargs – Additional arguments to pass to
sklearn.cluster.KMeans
.

kernel
¶ sklearn.cluster.KMeans – The kernel of the classifier. The kernel is
sklearn.cluster.KMeans
from SciKitlearn.

match_cluster_center
¶ numpy.array – The center of the match cluster.

nonmatch_cluster_center
¶ numpy.array – The center of the nonmatch (distinct) cluster.
Note
There are better methods for linking records than the kmeans clustering algorithm. This algorithm can be useful for an (unsupervised) initial partition.

prob
(*args, **kwargs)¶ Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
Parameters:  comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
 return_type (str) – Deprecated. (default ‘series’)
Returns: pandas.Series or numpy.ndarray – The probability of being a match for each record pair.

fit
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
 match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example deduplication). Unsure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.

fit_predict
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters:  comparison_vectors (pandas.DataFrame) – The comparison vectors.
 match_index (pandas.MultiIndex) – The true matches.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).

learn
(*args, **kwargs)¶ [DEPRECATED] Use ‘fit_predict’.

predict
(comparison_vectors)¶ Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, nonmatches and possible matches. The classifier has to be trained to call this method.
Parameters:  comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
 return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the nonmatches).
Userdefined algorithms¶
Classifiers can make use of the recordlinkage.base.BaseClassifier
for
algorithms. ScitKitlearn based models may want
recordlinkage.adapters.SKLearnAdapter
as subclass as well.

class
recordlinkage.base.
BaseClassifier
¶ Base class for classification of records pairs.
This class contains methods for training the classifier. Distinguish different types of training, such as supervised and unsupervised learning.
Probabilistic models can use the Fellegi and Sunter base class. This class is
used for the recordlinkage.ECMClassifier
and the
recordlinkage.NaiveBayesClassifier
.

class
recordlinkage.classifiers.
FellegiSunter
(use_col_names=True, *args, **kwargs)¶ Fellegi and Sunter (1969) framework.
Meta class for probabilistic classification algorithms. The Fellegi and Sunter class is used for the
recordlinkage.NaiveBayesClassifier
andrecordlinkage.ECMClassifier
.Parameters: use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True. 
log_p
¶ float – Log match probability as described in the FS framework.

log_m_probs
¶ np.ndarray – Log probability P(x_i==1Match) as described in the FS framework.

log_u_probs
¶ np.ndarray – Log probability P(x_i==1Nonmatch) as described in the FS framework.

log_weights
¶ np.ndarray – Log weights as described in the FS framework.

p
¶ float – Match probability as described in the FS framework.

m_probs
¶ np.ndarray – Probability P(x_i==1Match) as described in the FS framework.

u_probs
¶ np.ndarray – Probability P(x_i==1Nonmatch) as described in the FS framework.

weights
¶ np.ndarray – Weights as described in the FS framework.
References
Fellegi, Ivan P and Alan B Sunter. 1969. “A theory for record linkage.” Journal of the American Statistical Association 64(328):1183–1210.

Examples¶
Unsupervised learning with the ECM algorithm. [See example on Github.](https://github.com/J535D165/recordlinkage/examples/unsupervised_learning.py)
Network¶
The Python Record Linkage Toolkit provides network/graph analysis tools for classification of record pairs into matches and distinct pairs. The toolkit provides the functionality for onetoone linking and onetomany linking. It is also possible to detect all connected components which is useful in data deduplication.

class
recordlinkage.
OneToOneLinking
(method='greedy')¶ [EXPERIMENTAL] Onetoone linking
A record from dataset A can match at most one record from dataset B. For example, (a1, a2) are records from A and (b1, b2) are records from B. A linkage of (a1, b1), (a1, b2), (a2, b1), (a2, b2) is not onetoone connected. One of the results of onetoone linking can be (a1, b1), (a2, b2).
Parameters: method (str) – The method to solve the problem. Only ‘greedy’ is supported at the moment. Note
This class is experimental and might change in future versions.

class
recordlinkage.
OneToManyLinking
(level=0, method='greedy')¶ [EXPERIMENTAL] Onetomany linking
A record from dataset A can link multiple records from dataset B, but a record from B can link to only one record of dataset A. Use the level argument to switch A and B.
Parameters: Example
Consider a MultiIndex with record pairs constructed from datasets A and B. To link a record from B to at most one record of B, use the following syntax:
> one_to_many = OneToManyLinking(0) > one_to_many.compute(links)
To link a record from B to at most one record of B, use:
> one_to_many = OneToManyLinking(1) > one_to_many.compute(links)
Note
This class is experimental and might change in future versions.

class
recordlinkage.
ConnectedComponents
¶ [EXPERIMENTAL] Connected record pairs
This class identifies connected record pairs. Connected components are especially used in detecting duplicates in a single dataset.
Note
This class is experimental and might change in future versions.