3. Classification¶
Classifiers¶
Classification is the step in the record linkage process were record pairs are classified into matches, non-matches and possible matches [Christen2012]. Classification algorithms can be supervised or unsupervised (with or without training data).
See also
[Christen2012] | Christen, Peter. 2012. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media. |
Supervised¶
-
class
recordlinkage.
LogisticRegressionClassifier
(coefficients=None, intercept=None, **kwargs)¶ Logistic Regression Classifier.
This classifier is an application of the logistic regression model (wikipedia). The classifier partitions candidate record pairs into matches and non-matches.
This algorithm is also known as Deterministic Record Linkage.
The LogisticRegressionClassifier classifier uses the
sklearn.linear_model.LogisticRegression
classification algorithm from SciKit-learn as kernel.Parameters: - coefficients (list, numpy.array) – The coefficients of the logistic regression.
- intercept (float) – The interception value.
- **kwargs – Additional arguments to pass to
sklearn.linear_model.LogisticRegression
.
-
kernel
¶ The kernel of the classifier. The kernel is
sklearn.linear_model.LogisticRegression
from SciKit-learn.Type: sklearn.linear_model.LogisticRegression
-
fit
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters: - comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
- match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.
-
fit_predict
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters: - comparison_vectors (pandas.DataFrame) – The comparison vectors.
- match_index (pandas.MultiIndex) – The true matches.
- return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
-
learn
(*args, **kwargs)¶ [DEPRECATED] Use ‘fit_predict’.
-
predict
(comparison_vectors)¶ Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.
Parameters: - comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
- return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
-
prob
(comparison_vectors, return_type=None)¶ Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
Parameters: - comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
- return_type (str) – Deprecated. (default ‘series’)
Returns: pandas.Series or numpy.ndarray – The probability of being a match for each record pair.
-
class
recordlinkage.
NaiveBayesClassifier
(binarize=None, alpha=0.0001, use_col_names=True, **kwargs)¶ Naive Bayes Classifier.
The Naive Bayes classifier (wikipedia) partitions candidate record pairs into matches and non-matches. The classifier is based on probabilistic principles. The Naive Bayes classification method has a close mathematical connection with the Fellegi and Sunter model.
Note
The NaiveBayesClassifier classifier differs of the Naive Bayes models in SciKit-learn. With binary input vectors, the NaiveBayesClassifier behaves like
sklearn.naive_bayes.BernoulliNB
.Parameters: - binarize (float or None, optional (default=None)) – Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to consist of multilevel vectors.
- alpha (float) – Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing). Default 1e-4.
- use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True.
-
fit
(X, *args, **kwargs)¶ Train the classifier.
Parameters: - comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
- match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.
-
fit_predict
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters: - comparison_vectors (pandas.DataFrame) – The comparison vectors.
- match_index (pandas.MultiIndex) – The true matches.
- return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
-
learn
(*args, **kwargs)¶ [DEPRECATED] Use ‘fit_predict’.
-
log_m_probs
¶ Log probability P(x_i=1|Match) as described in the FS framework
-
log_p
¶ Log match probability as described in the FS framework
-
log_u_probs
¶ Log probability P(x_i=1|Non-match) as described in the FS framework
-
log_weights
¶ Log weights as described in the FS framework
-
m_probs
¶ Probability P(x_i=1|Match) as described in the FS framework
-
p
¶ Match probability as described in the FS framework
-
predict
(comparison_vectors)¶ Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.
Parameters: - comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
- return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
-
prob
(comparison_vectors, return_type=None)¶ Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
Parameters: - comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
- return_type (str) – Deprecated. (default ‘series’)
Returns: pandas.Series or numpy.ndarray – The probability of being a match for each record pair.
-
u_probs
¶ Probability P(x_i=1|Non-match) as described in the FS framework
-
weights
¶ Weights as described in the FS framework
-
class
recordlinkage.
SVMClassifier
(*args, **kwargs)¶ Support Vector Machines Classifier
The Support Vector Machine classifier (wikipedia) partitions candidate record pairs into matches and non-matches. This implementation is a non-probabilistic binary linear classifier. Support vector machines are supervised learning models. Therefore, SVM classifiers need training- data.
The SVMClassifier classifier uses the
sklearn.svm.LinearSVC
classification algorithm from SciKit-learn as kernel.Parameters: **kwargs – Arguments to pass to sklearn.svm.LinearSVC
.-
kernel
¶ The kernel of the classifier. The kernel is
sklearn.svm.LinearSVC
from SciKit-learn.Type: sklearn.svm.LinearSVC
-
fit
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters: - comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
- match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.
-
fit_predict
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters: - comparison_vectors (pandas.DataFrame) – The comparison vectors.
- match_index (pandas.MultiIndex) – The true matches.
- return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
-
learn
(*args, **kwargs)¶ [DEPRECATED] Use ‘fit_predict’.
-
predict
(comparison_vectors)¶ Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.
Parameters: - comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
- return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
-
prob
(*args, **kwargs)¶ Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
Parameters: - comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
- return_type (str) – Deprecated. (default ‘series’)
Returns: pandas.Series or numpy.ndarray – The probability of being a match for each record pair.
-
Unsupervised¶
-
class
recordlinkage.
ECMClassifier
(init='jaro', binarize=None, max_iter=100, atol=0.0001, use_col_names=True, *args, **kwargs)¶ Expectation/Conditional Maxisation classifier (Unsupervised).
Expectation/Conditional Maximisation algorithm used to classify record pairs. This probabilistic record linkage algorithm is used in combination with Fellegi and Sunter model. This classifier doesn’t need training data (unsupervised).
Parameters: - init (str) – Initialisation method for the algorithm. Options are: ‘jaro’ and ‘random’. Default ‘jaro’.
- max_iter (int) – The maximum number of iterations of the EM algorithm. Default 100.
- binarize (float or None, optional (default=None)) – Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.
- atol (float) – The tolerance between parameters between each interation. If the difference between the parameters between the iterations is smaller than this value, the algorithm is considered to be converged. Default 10e-4.
- use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True.
References
Herzog, Thomas N, Fritz J Scheuren and William E Winkler. 2007. Data quality and record linkage techniques. Vol. 1 Springer.
Fellegi, Ivan P and Alan B Sunter. 1969. “A theory for record linkage.” Journal of the American Statistical Association 64(328):1183–1210.
Collins, M. “The Naive Bayes Model, Maximum-Likelihood Estimation, and the EM Algorithm”. http://www.cs.columbia.edu/~mcollins/em.pdf
-
fit_predict
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters: - comparison_vectors (pandas.DataFrame) – The comparison vectors.
- match_index (pandas.MultiIndex) – The true matches.
- return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
-
learn
(*args, **kwargs)¶ [DEPRECATED] Use ‘fit_predict’.
-
log_m_probs
¶ Log probability P(x_i=1|Match) as described in the FS framework
-
log_p
¶ Log match probability as described in the FS framework
-
log_u_probs
¶ Log probability P(x_i=1|Non-match) as described in the FS framework
-
log_weights
¶ Log weights as described in the FS framework
-
m_probs
¶ Probability P(x_i=1|Match) as described in the FS framework
-
p
¶ Match probability as described in the FS framework
-
predict
(comparison_vectors)¶ Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.
Parameters: - comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
- return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
-
prob
(comparison_vectors, return_type=None)¶ Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
Parameters: - comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
- return_type (str) – Deprecated. (default ‘series’)
Returns: pandas.Series or numpy.ndarray – The probability of being a match for each record pair.
-
u_probs
¶ Probability P(x_i=1|Non-match) as described in the FS framework
-
weights
¶ Weights as described in the FS framework
-
fit
(X, *args, **kwargs)¶ Train the classifier.
Parameters: - comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
- match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.
-
class
recordlinkage.
KMeansClassifier
(match_cluster_center=None, nonmatch_cluster_center=None, **kwargs)¶ KMeans classifier.
The K-means clusterings algorithm (wikipedia) partitions candidate record pairs into matches and non-matches. Each comparison vector belongs to the cluster with the nearest mean.
The K-means algorithm is an unsupervised learning algorithm. The algorithm doesn’t need trainings data for fitting. The algorithm is calibrated for two clusters: a match cluster and a non-match cluster). The centers of these clusters can be given as arguments or set automatically.
The KMeansClassifier classifier uses the
sklearn.cluster.KMeans
clustering algorithm from SciKit-learn as kernel.Parameters: - match_cluster_center (list, numpy.array) – The center of the match cluster. The length of the list/array must equal the number of comparison variables. If None, the match cluster center is set automatically. Default None.
- nonmatch_cluster_center (list, numpy.array) – The center of the nonmatch (distinct) cluster. The length of the list/array must equal the number of comparison variables. If None, the non-match cluster center is set automatically. Default None.
- **kwargs – Additional arguments to pass to
sklearn.cluster.KMeans
.
-
kernel
¶ The kernel of the classifier. The kernel is
sklearn.cluster.KMeans
from SciKit-learn.Type: sklearn.cluster.KMeans
-
match_cluster_center
¶ The center of the match cluster.
Type: numpy.array
-
nonmatch_cluster_center
¶ The center of the nonmatch (distinct) cluster.
Type: numpy.array
Note
There are better methods for linking records than the k-means clustering algorithm. This algorithm can be useful for an (unsupervised) initial partition.
-
prob
(*args, **kwargs)¶ Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
Parameters: - comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
- return_type (str) – Deprecated. (default ‘series’)
Returns: pandas.Series or numpy.ndarray – The probability of being a match for each record pair.
-
fit
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters: - comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
- match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.
-
fit_predict
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters: - comparison_vectors (pandas.DataFrame) – The comparison vectors.
- match_index (pandas.MultiIndex) – The true matches.
- return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
-
learn
(*args, **kwargs)¶ [DEPRECATED] Use ‘fit_predict’.
-
predict
(comparison_vectors)¶ Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.
Parameters: - comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
- return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
Adapters¶
Adapters can be used to wrap a machine learning models from external packages like ScitKit-learn and Keras. For example, this makes it possible to classify record pairs with an neural network developed in Keras.
-
class
recordlinkage.adapters.
SKLearnAdapter
¶ SciKit-learn adapter for record pair classification.
SciKit-learn adapter for record pair classification with SciKit-learn models.
# import ScitKit-Learn classifier
from sklearn.ensemble import RandomForestClassifier
# import BaseClassifier from recordlinkage.base
from recordlinkage.base import BaseClassifier
from recordlinkage.adapters import SKLearnClassifier
from recordlinkage.datasets import binary_vectors
class RandomForest(SKLearnClassifier, BaseClassifier):
def __init__(*args, **kwargs):
super(self, RandomForest).__init__()
# set the kernel
kernel = RandomForestClassifier(*args, **kwargs)
# make a sample dataset
features, links = binary_vectors(10000, 2000, return_links=True)
# initialise the random forest
cl = RandomForest(n_estimators=20)
cl.fit(features, links)
# predict the matches
cl.predict(...)
-
class
recordlinkage.adapters.
KerasAdapter
¶ Keras adapter for record pair classification.
Keras adapter for record pair classification with Keras models.
Example of a Keras model used for classification.
from tensorflow.keras import layers
from recordlinkage.base import BaseClassifier
from recordlinkage.adapters import KerasAdapter
class NNClassifier(KerasAdapter, BaseClassifier):
"""Neural network classifier."""
def __init__(self):
super(NNClassifier, self).__init__()
model = tf.keras.Sequential()
model.add(layers.Dense(16, input_dim=8, activation='relu'))
model.add(layers.Dense(8, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(
optimizer=tf.train.AdamOptimizer(0.001),
loss='binary_crossentropy',
metrics=['accuracy']
)
self.kernel = model
# initialise the model
cl = NNClassifier()
# fit the model to the data
cl.fit(X_train, links_true)
# predict the class of the data
cl.predict(X_pred)
User-defined algorithms¶
Classifiers can make use of the recordlinkage.base.BaseClassifier
for
algorithms. ScitKit-learn based models may want
recordlinkage.adapters.SKLearnAdapter
as subclass as well.
-
class
recordlinkage.base.
BaseClassifier
¶ Base class for classification of records pairs.
This class contains methods for training the classifier. Distinguish different types of training, such as supervised and unsupervised learning.
-
learn
(*args, **kwargs)¶ [DEPRECATED] Use ‘fit_predict’.
-
fit
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters: - comparison_vectors (pandas.DataFrame) – The comparison vectors (or features) to train the model with.
- match_index (pandas.MultiIndex) – A pandas.MultiIndex object with the true matches. The MultiIndex contains only the true matches. Default None.
Note
A note in case of finding links within a single dataset (for example duplicate detection). Ensure that the training record pairs are from the lower triangular part of the dataset/matrix. See detailed information here: link.
-
fit_predict
(comparison_vectors, match_index=None)¶ Train the classifier.
Parameters: - comparison_vectors (pandas.DataFrame) – The comparison vectors.
- match_index (pandas.MultiIndex) – The true matches.
- return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
-
predict
(comparison_vectors)¶ Predict the class of the record pairs.
Classify a set of record pairs based on their comparison vectors into matches, non-matches and possible matches. The classifier has to be trained to call this method.
Parameters: - comparison_vectors (pandas.DataFrame) – Dataframe with comparison vectors.
- return_type (str) – Deprecated. Use recordlinkage.options instead. Use the option recordlinkage.set_option(‘classification.return_type’, ‘index’) instead.
Returns: pandas.Series – A pandas Series with the labels 1 (for the matches) and 0 (for the non-matches).
-
prob
(comparison_vectors, return_type=None)¶ Compute the probabilities for each record pair.
For each pair of records, estimate the probability of being a match.
Parameters: - comparison_vectors (pandas.DataFrame) – The dataframe with comparison vectors.
- return_type (str) – Deprecated. (default ‘series’)
Returns: pandas.Series or numpy.ndarray – The probability of being a match for each record pair.
-
Probabilistic models can use the Fellegi and Sunter base class. This class is
used for the recordlinkage.ECMClassifier
and the
recordlinkage.NaiveBayesClassifier
.
-
class
recordlinkage.classifiers.
FellegiSunter
(use_col_names=True, *args, **kwargs)¶ Fellegi and Sunter (1969) framework.
Meta class for probabilistic classification algorithms. The Fellegi and Sunter class is used for the
recordlinkage.NaiveBayesClassifier
andrecordlinkage.ECMClassifier
.Parameters: use_col_names (bool) – Use the column names of the pandas.DataFrame to identify the parameters. If False, the column index of the feature is used. Default True. References
Fellegi, Ivan P and Alan B Sunter. 1969. “A theory for record linkage.” Journal of the American Statistical Association 64(328):1183–1210.
-
log_p
¶ Log match probability as described in the FS framework
-
log_m_probs
¶ Log probability P(x_i=1|Match) as described in the FS framework
-
log_u_probs
¶ Log probability P(x_i=1|Non-match) as described in the FS framework
-
log_weights
¶ Log weights as described in the FS framework
-
p
¶ Match probability as described in the FS framework
-
m_probs
¶ Probability P(x_i=1|Match) as described in the FS framework
-
u_probs
¶ Probability P(x_i=1|Non-match) as described in the FS framework
-
weights
¶ Weights as described in the FS framework
-
Examples¶
Unsupervised learning with the ECM algorithm. [See example on Github.](https://github.com/J535D165/recordlinkage/examples/unsupervised_learning.py)
Network¶
The Python Record Linkage Toolkit provides network/graph analysis tools for classification of record pairs into matches and distinct pairs. The toolkit provides the functionality for one-to-one linking and one-to-many linking. It is also possible to detect all connected components which is useful in data deduplication.
-
class
recordlinkage.
OneToOneLinking
(method='greedy')¶ [EXPERIMENTAL] One-to-one linking
A record from dataset A can match at most one record from dataset B. For example, (a1, a2) are records from A and (b1, b2) are records from B. A linkage of (a1, b1), (a1, b2), (a2, b1), (a2, b2) is not one-to-one connected. One of the results of one-to-one linking can be (a1, b1), (a2, b2).
Parameters: method (str) – The method to solve the problem. Only ‘greedy’ is supported at the moment. Note
This class is experimental and might change in future versions.
-
compute
(links)¶ Compute the one-to-one linking.
Parameters: links (pandas.MultiIndex) – The pairs to apply linking to. Returns: pandas.MultiIndex – A one-to-one matched MultiIndex of record pairs.
-
-
class
recordlinkage.
OneToManyLinking
(level=0, method='greedy')¶ [EXPERIMENTAL] One-to-many linking
A record from dataset A can link multiple records from dataset B, but a record from B can link to only one record of dataset A. Use the level argument to switch A and B.
Parameters: Example
Consider a MultiIndex with record pairs constructed from datasets A and B. To link a record from B to at most one record of B, use the following syntax:
> one_to_many = OneToManyLinking(0) > one_to_many.compute(links)
To link a record from B to at most one record of B, use:
> one_to_many = OneToManyLinking(1) > one_to_many.compute(links)
Note
This class is experimental and might change in future versions.
-
compute
(links)¶ Compute the one-to-many matching.
Parameters: links (pandas.MultiIndex) – The pairs to apply linking to. Returns: pandas.MultiIndex – A one-to-many matched MultiIndex of record pairs.
-
-
class
recordlinkage.
ConnectedComponents
¶ [EXPERIMENTAL] Connected record pairs
This class identifies connected record pairs. Connected components are especially used in detecting duplicates in a single dataset.
Note
This class is experimental and might change in future versions.
-
compute
(links)¶ Return the connected components.
Parameters: links (pandas.MultiIndex) – The links to apply one-to-one matching on. Returns: list of pandas.MultiIndex – A list with pandas.MultiIndex objects. Each MultiIndex object represents a set of connected record pairs.
-