Classification algorithms

In the context of record linkage, classification refers to the process of dividing record pairs into matches and non-matches (distinct pairs). There are dozens of classification algorithms for record linkage. Roughly speaking, classification algorithms fall into two groups:

supervised learning algorithms - These algorithms make use of trainings data. If you do have trainings data, then you can use supervised learning algorithms. Most supervised learning algorithms offer good accuracy and reliability. Examples of supervised learning algorithms in the Python Record Linkage Toolkit are Logistic Regression, Naive Bayes and Support Vector Machines.
unsupervised learning algorithms - These algorithms do not need training data. The Python Record Linkage Toolkit supports K-means clustering and an Expectation/Conditional Maximisation classifier.

First things first

The examples below make use of the Krebs register (German for cancer registry) dataset. The Krebs register dataset contains comparison vectors of a large set of record pairs. For each record pair, it is known if the records represent the same person (match) or not (non-match). This was done with a massive clerical review. First, import the recordlinkage module and load the Krebs register data. The dataset contains 5749132 compared record pairs and has the following variables: first name, last name, sex, birthday, birth month, birth year and zip code. The Krebs register contains len(krebs_true_links) == 20931 matching record pairs.

In [1]: import pandas
   ...: import recordlinkage as rl
   ...: from recordlinkage.datasets import load_krebsregister
   ...: 

In [2]: krebs_X, krebs_true_links = load_krebsregister(missing_values=0)
   ...: krebs_X
   ...: 
Downloading data to /home/docs/rl_data/krebsregister.
Data download succesfull.
Out[2]: 
             cmp_firstname1  cmp_firstname2  ...  cmp_birthyear  cmp_zipcode
id1   id2                                    ...                            
22161 38467        1.000000             0.0  ...            0.0          0.0
38713 75352        0.000000             0.0  ...            0.0          0.0
13699 32825        0.166667             0.0  ...            1.0          0.0
22709 37682        0.285714             0.0  ...            0.0          0.0
2342  69060        0.250000             0.0  ...            1.0          0.0
...                     ...             ...  ...            ...          ...
52124 53629        1.000000             0.0  ...            1.0          0.0
30007 76846        0.750000             0.0  ...            0.0          0.0
50546 59461        0.750000             0.0  ...            0.0          0.0
43175 62151        1.000000             0.0  ...            0.0          0.0
11651 57925        1.000000             0.0  ...            1.0          0.0

[5749132 rows x 9 columns]

Most classifiers can not handle comparison vectors with missing values. To prevent issues with the classification algorithms, we convert the missing values into disagreeing comparisons (using argument missing_values=0). This approach for handling missing values is widely used in record linkage applications.

In [3]: krebs_X.describe()
Out[3]: 
       cmp_firstname1  cmp_firstname2  ...  cmp_birthyear   cmp_zipcode
count    5.749132e+06    5.749132e+06  ...   5.749132e+06  5.749132e+06
mean     7.127776e-01    1.623376e-02  ...   2.227178e-01  5.516311e-03
std      3.888388e-01    1.251994e-01  ...   4.160704e-01  7.406674e-02
min      0.000000e+00    0.000000e+00  ...   0.000000e+00  0.000000e+00
25%      2.857143e-01    0.000000e+00  ...   0.000000e+00  0.000000e+00
50%      1.000000e+00    0.000000e+00  ...   0.000000e+00  0.000000e+00
75%      1.000000e+00    0.000000e+00  ...   0.000000e+00  0.000000e+00
max      1.000000e+00    1.000000e+00  ...   1.000000e+00  1.000000e+00

[8 rows x 9 columns]

Supervised learning

As described before, supervised learning algorithms do need training data. Training data is data for which the true match status is known for each comparison vector. In the example in this section, we consider that the true match status of the first 5000 record pairs of the Krebs register data is known.

In [4]: golden_pairs = krebs_X[0:5000]
   ...: # 2093 matching pairs
   ...: golden_matches_index = golden_pairs.index.intersection(krebs_true_links)
   ...: golden_matches_index
   ...: 
Out[4]: 
MultiIndex([(89874, 89876),
            (79126, 84983),
            (40350, 83715),
            (75394, 92002),
            (23323, 27823),
            (31059, 72216),
            (28464, 69899),
            (33613, 64971),
            (23546, 27978),
            (29922, 46075),
            (22436, 23281),
            (34064, 43424),
            (14811, 14882),
            (34287, 81544),
            (28539, 34715),
            (17937, 63083),
            (49588, 71543),
            (88108, 94380),
            (34171, 46602),
            (35967, 71229),
            (69924, 74737),
            (46933, 47037),
            (32487, 61025),
            (20713, 48727)],
           names=['id1', 'id2'])

Logistic regression

The recordlinkage.LogisticRegressionClassifier classifier is an application of the logistic regression model. This supervised learning method is one of the oldest classification algorithms used in record linkage. In situations with enough training data, the algorithm gives relatively good results.

In [5]: # Initialize the classifier
   ...: logreg = rl.LogisticRegressionClassifier()
   ...: 

In [6]: # Train the classifier
   ...: logreg.fit(golden_pairs, golden_matches_index)
   ...: print ("Intercept: ", logreg.intercept)
   ...: print ("Coefficients: ", logreg.coefficients)
   ...: 
Intercept:  -11.759937013139629
Coefficients:  [ 1.62052151e+00  7.85074832e-02  2.93697610e+00 -5.22801106e-04
  4.29397318e-01  1.96144730e+00  1.43218255e+00  1.98293940e+00
  3.17172106e+00]

Predict the match status for all record pairs.

In [7]: result_logreg = logreg.predict(krebs_X)
   ...: len(result_logreg)
   ...: 
Out[7]: 19871

In [8]: rl.confusion_matrix(krebs_true_links, result_logreg, len(krebs_X))
Out[8]: 
array([[  19867,    1064],
       [      4, 5728197]])

The F-score for this prediction is

In [9]: rl.fscore(krebs_true_links, result_logreg)
Out[9]: 0.9738248125091908

The predicted number of matches is not much more than the 20931 true matches. The result was achieved with a small training dataset of 5000 record pairs.

In (older) literature, record linkage procedures are often divided in deterministic record linkage and probabilistic record linkage. The Logistic Regression Classifier belongs to deterministic record linkage methods. Each feature/variable has a certain importance (named weight). The weight is multiplied with the comparison/similarity vector. If the total sum exceeds a certain threshold, it as considered to be a match.

In [10]: intercept = -9
   ....: coefficients = [2.0, 1.0, 3.0, 1.0, 1.0, 1.0, 1.0, 2.0, 3.0]
   ....: 

In [11]: logreg = rl.LogisticRegressionClassifier(coefficients, intercept)
   ....: result_logreg_pretrained = logreg.predict(krebs_X)
   ....: len(result_logreg_pretrained)
   ....: 
Out[11]: 21303

In [12]: rl.confusion_matrix(krebs_true_links, result_logreg_pretrained, len(krebs_X))
Out[12]: 
array([[  20857,      74],
       [    446, 5727755]])

The F-score for this classification is

In [13]: rl.fscore(krebs_true_links, result_logreg_pretrained)
Out[13]: 0.987687645025335

For the given coefficients, the F-score is better than the situation without trainings data. Surprising? No (use more trainings data and the result will improve)

Naive Bayes

In contrast to the logistic regression classifier, the Naive Bayes classifier is a probabilistic classifier. The probabilistic record linkage framework by Fellegi and Sunter (1969) is the most well-known probabilistic classification method for record linkage. Later, it was proved that the Fellegi and Sunter method is mathematically equivalent to the Naive Bayes method in case of assuming independence between comparison variables.

In [14]: # Train the classifier
   ....: nb = rl.NaiveBayesClassifier(binarize=0.3)
   ....: nb.fit(golden_pairs, golden_matches_index)
   ....: 

In [15]: # Predict the match status for all record pairs
   ....: result_nb = nb.predict(krebs_X)
   ....: len(result_nb)
   ....: 
Out[15]: 19837

In [16]: rl.confusion_matrix(krebs_true_links, result_nb, len(krebs_X))
Out[16]: 
array([[  19825,    1106],
       [     12, 5728189]])

The F-score for this classification is

In [17]: rl.fscore(krebs_true_links, result_nb)
Out[17]: 0.9725765306122448

Support Vector Machines

Support Vector Machines (SVM) have become increasingly popular in record linkage. The algorithm performs well there is only a small amount of training data available. The implementation of SVM in the Python Record Linkage Toolkit is a linear SVM algorithm.

In [18]: # Train the classifier
   ....: svm = rl.SVMClassifier()
   ....: svm.fit(golden_pairs, golden_matches_index)
   ....: 

In [19]: # Predict the match status for all record pairs
   ....: result_svm = svm.predict(krebs_X)
   ....: len(result_svm)
   ....: 
Out[19]: 20839

In [20]: rl.confusion_matrix(krebs_true_links, result_svm, len(krebs_X))
Out[20]: 
array([[  20825,     106],
       [     14, 5728187]])

The F-score for this classification is

In [21]: rl.fscore(krebs_true_links, result_svm)
Out[21]: 0.997127124730668

Unsupervised learning

In situations without training data, unsupervised learning can be a solution for record linkage problems. In this section, we discuss two unsupervised learning methods. One algorithm is K-means clustering, and the other algorithm is an implementation of the Expectation-Maximisation algorithm. Most of the time, unsupervised learning algorithms take more computational time because of the iterative structure in these algorithms.

K-means clustering

The K-means clustering algorithm is well-known and widely used in big data analysis. The K-means classifier in the Python Record Linkage Toolkit package is configured in such a way that it can be used for linking records. For more info about the K-means clustering see Wikipedia.

In [22]: kmeans = rl.KMeansClassifier()
   ....: result_kmeans = kmeans.fit_predict(krebs_X)
   ....: len(result_kmeans)
   ....: 
Out[22]: 371525

The classifier is now trained and the comparison vectors are classified.

In [23]: rl.confusion_matrix(krebs_true_links, result_kmeans, len(krebs_X))
Out[23]: 
array([[  20797,     134],
       [ 350728, 5377473]])

In [24]: rl.fscore(krebs_true_links, result_kmeans)
Out[24]: 0.10598385551501316

Expectation/Conditional Maximization Algorithm

The ECM-algorithm is an Expectation-Maximisation algorithm with some additional constraints. This algorithm is closely related to the Naive Bayes algorithm. The ECM algorithm is also closely related to estimating the parameters in the Fellegi and Sunter (1969) framework. The algorithms assume that the attributes are independent of each other. The Naive Bayes algorithm uses the same principles.

In [25]: # Train the classifier
   ....: ecm = rl.ECMClassifier(binarize=0.8)
   ....: result_ecm = ecm.fit_predict(krebs_X)
   ....: len(result_ecm)
   ....: 
Out[25]: 19834

In [26]: rl.confusion_matrix(krebs_true_links, result_ecm, len(krebs_X))
Out[26]: 
array([[  19830,    1101],
       [      4, 5728197]])

The F-score for this classification is

In [27]: rl.fscore(krebs_true_links, result_ecm)
Out[27]: 0.9728934134674353