Note

This page was generated from docs/guides/link_two_dataframes.ipynb. Run an online interactive version of this page with or .

Link two datasets

Introduction

This example shows how two datasets with data about persons can be linked. We will try to link the data based on attributes like first name, surname, sex, date of birth, place and address. The data used in this example is part of Febrl and is fictitious.

First, start with importing the recordlinkage module. The submodule recordlinkage.datasets contains several datasets that can be used for testing. For this example, we use the Febrl datasets 4A and 4B. These datasets can be loaded with the function load_febrl4.

[1]:

import recordlinkage
from recordlinkage.datasets import load_febrl4

The datasets are loaded with the following code. The returned datasets are of type pandas.DataFrame. This makes it easy to manipulate the data if desired. For details about data manipulation with pandas, see their comprehensive documentation http://pandas.pydata.org/.

[2]:

dfA, dfB = load_febrl4()
dfA

[2]:

	given_name	surname	street_number	address_1	address_2	suburb	postcode	state	date_of_birth	soc_sec_id
rec_id
rec-1070-org	michaela	neumann	8	stanley street	miami	winston hills	4223	nsw	19151111	5304218
rec-1016-org	courtney	painter	12	pinkerton circuit	bega flats	richlands	4560	vic	19161214	4066625
rec-4405-org	charles	green	38	salkauskas crescent	kela	dapto	4566	nsw	19480930	4365168
rec-1288-org	vanessa	parr	905	macquoid place	broadbridge manor	south grafton	2135	sa	19951119	9239102
rec-3585-org	mikayla	malloney	37	randwick road	avalind	hoppers crossing	4552	vic	19860208	7207688
...	...	...	...	...	...	...	...	...	...	...
rec-2153-org	annabel	grierson	97	mclachlan crescent	lantana lodge	broome	2480	nsw	19840224	7676186
rec-1604-org	sienna	musolino	22	smeaton circuit	pangani	mckinnon	2700	nsw	19890525	4971506
rec-1003-org	bradley	matthews	2	jondol place	horseshoe ck	jacobs well	7018	sa	19481122	8927667
rec-4883-org	brodee	egan	88	axon street	greenslopes	wamberal	2067	qld	19121113	6039042
rec-66-org	koula	houweling	3	mileham street	old airdmillan road	williamstown	2350	nsw	19440718	6375537

5000 rows × 10 columns

Make record pairs

It is very intuitive to compare each record in DataFrame dfA with all records of DataFrame dfB. In fact, we want to make record pairs. Each record pair should contain one record of dfA and one record of dfB. This process of making record pairs is also called “indexing”. With the recordlinkage module, indexing is easy. First, load the index.Index class and call the .full method. This object generates a full index on a .index(...) call. In case of deduplication of a single dataframe, one dataframe is sufficient as argument.

[3]:

indexer = recordlinkage.Index()
indexer.full()
pairs = indexer.index(dfA, dfB)

WARNING:recordlinkage:indexing - performance warning - A full index can result in large number of record pairs.

With the method index, all possible (and unique) record pairs are made. The method returns a pandas.MultiIndex. The number of pairs is equal to the number of records in dfA times the number of records in dfB.

[4]:

print(len(dfA), len(dfB), len(pairs))

5000 5000 25000000

Many of these record pairs do not belong to the same person. In case of one-to-one matching, the number of matches should be no more than the number of records in the smallest dataframe. In case of full indexing, min(len(dfA), len(N_dfB)) is much smaller than len(pairs). The recordlinkage module has some more advanced indexing methods to reduce the number of record pairs. Obvious non-matches are left out of the index. Note that if a matching record pair is not included in the index, it can not be matched anymore.

One of the most well known indexing methods is named blocking. This method includes only record pairs that are identical on one or more stored attributes of the person (or entity in general). The blocking method can be used in the recordlinkage module.

[5]:

indexer = recordlinkage.Index()
indexer.block("given_name")
candidate_links = indexer.index(dfA, dfB)
len(candidate_links)

[5]:

The argument “given_name” is the blocking variable. This variable has to be the name of a column in dfA and dfB. It is possible to parse a list of columns names to block on multiple variables. Blocking on multiple variables will reduce the number of record pairs even further.

Another implemented indexing method is Sorted Neighbourhood Indexing (recordlinkage.index.SortedNeighbourhood). This method is very useful when there are many misspellings in the string were used for indexing. In fact, sorted neighbourhood indexing is a generalisation of blocking. See the documentation for details about sorted neighbourd indexing.

Compare records

Each record pair is a candidate match. To classify the candidate record pairs into matches and non-matches, compare the records on all attributes both records have in common. The recordlinkage module has a class named Compare. This class is used to compare the records. The following code shows how to compare attributes.

[6]:

compare_cl = recordlinkage.Compare()
compare_cl.exact("given_name", "given_name", label="given_name")
compare_cl.string(
    "surname", "surname", method="jarowinkler", threshold=0.85, label="surname"
)
compare_cl.exact("date_of_birth", "date_of_birth", label="date_of_birth")
compare_cl.exact("suburb", "suburb", label="suburb")
compare_cl.exact("state", "state", label="state")
compare_cl.string("address_1", "address_1", threshold=0.85, label="address_1")
features = compare_cl.compute(candidate_links, dfA, dfB)

The comparing of record pairs starts when the compute method is called. All attribute comparisons are stored in a DataFrame with horizontally the features and vertically the record pairs.

[7]:

features

[7]:

		given_name	surname	date_of_birth	suburb	state	address_1
rec_id_1	rec_id_2
rec-1070-org	rec-3024-dup-0	1	0.0	0	0	1	0.0
	rec-2371-dup-0	1	0.0	0	0	0	0.0
	rec-4652-dup-0	1	0.0	0	0	0	0.0
	rec-4795-dup-0	1	0.0	0	0	1	0.0
	rec-1314-dup-0	1	0.0	0	0	1	0.0
...	...	...	...	...	...	...	...
rec-4528-org	rec-4528-dup-0	1	1.0	1	1	1	1.0
rec-4887-org	rec-4887-dup-0	1	1.0	1	0	1	1.0
rec-4350-org	rec-4350-dup-0	1	1.0	1	1	1	1.0
rec-4569-org	rec-4569-dup-0	1	1.0	1	1	1	0.0
rec-3125-org	rec-3125-dup-0	1	1.0	1	0	1	1.0

77249 rows × 6 columns

[8]:

features.describe()

[8]:

	given_name	surname	date_of_birth	suburb	state	address_1
count	77249.0	77249.000000	77249.000000	77249.000000	77249.000000	77249.000000
mean	1.0	0.044428	0.037929	0.032259	0.248767	0.036700
std	0.0	0.206045	0.191027	0.176689	0.432301	0.188024
min	1.0	0.000000	0.000000	0.000000	0.000000	0.000000
25%	1.0	0.000000	0.000000	0.000000	0.000000	0.000000
50%	1.0	0.000000	0.000000	0.000000	0.000000	0.000000
75%	1.0	0.000000	0.000000	0.000000	0.000000	0.000000
max	1.0	1.000000	1.000000	1.000000	1.000000	1.000000

The last step is to decide which records belong to the same person. In this example, we keep it simple:

[9]:

features.sum(axis=1).value_counts().sort_index(ascending=False)

[9]:

6.0     1566
5.0     1332
4.0      343
3.0      146
2.0    16427
1.0    57435
dtype: int64

[10]:

features[features.sum(axis=1) > 3]

[10]:

		given_name	surname	date_of_birth	suburb	state	address_1
rec_id_1	rec_id_2
rec-2371-org	rec-2371-dup-0	1	1.0	1	1	1	1.0
rec-3024-org	rec-3024-dup-0	1	1.0	1	0	1	0.0
rec-4652-org	rec-4652-dup-0	1	1.0	1	0	1	1.0
rec-4795-org	rec-4795-dup-0	1	1.0	1	1	1	1.0
rec-1016-org	rec-1016-dup-0	1	1.0	1	1	0	1.0
...	...	...	...	...	...	...	...
rec-4528-org	rec-4528-dup-0	1	1.0	1	1	1	1.0
rec-4887-org	rec-4887-dup-0	1	1.0	1	0	1	1.0
rec-4350-org	rec-4350-dup-0	1	1.0	1	1	1	1.0
rec-4569-org	rec-4569-dup-0	1	1.0	1	1	1	0.0
rec-3125-org	rec-3125-dup-0	1	1.0	1	0	1	1.0

3241 rows × 6 columns

Full code

[11]:

import recordlinkage
from recordlinkage.datasets import load_febrl4

dfA, dfB = load_febrl4()

# Indexation step
indexer = recordlinkage.Index()
indexer.block("given_name")
candidate_links = indexer.index(dfA, dfB)

# Comparison step
compare_cl = recordlinkage.Compare()

compare_cl.exact("given_name", "given_name", label="given_name")
compare_cl.string(
    "surname", "surname", method="jarowinkler", threshold=0.85, label="surname"
)
compare_cl.exact("date_of_birth", "date_of_birth", label="date_of_birth")
compare_cl.exact("suburb", "suburb", label="suburb")
compare_cl.exact("state", "state", label="state")
compare_cl.string("address_1", "address_1", threshold=0.85, label="address_1")

features = compare_cl.compute(candidate_links, dfA, dfB)

# Classification step
matches = features[features.sum(axis=1) > 3]
print(len(matches))