Datasets

The Python Record Linkage Toolkit contains several open public datasets. Four datasets were generated by the developers of Febrl. In the future, we are developing tools to generate your own datasets.

recordlinkage.datasets.load_krebsregister(block=None, missing_values=None, shuffle=True)

Load the Krebsregister dataset.

This dataset of comparison patterns was obtained in a epidemiological cancer study in Germany. The comparison patterns were created by the Institute for Medical Biostatistics, Epidemiology and Informatics (IMBEI) and the University Medical Center of Johannes Gutenberg University (Mainz, Germany). The dataset is available for research online.

“The records represent individual data including first and family name, sex, date of birth and postal code, which were collected through iterative insertions in the course of several years. The comparison patterns in this data set are based on a sample of 100.000 records dating from 2005 to 2008. Data pairs were classified as “match” or “non-match” during an extensive manual review where several documentarists were involved. The resulting classification formed the basis for assessing the quality of the registry’s own record linkage procedure.

In order to limit the amount of patterns a blocking procedure was applied, which selects only record pairs that meet specific agreement conditions. The results of the following six blocking iterations were merged together:

  • Phonetic equality of first name and family name, equality of date of birth.

  • Phonetic equality of first name, equality of day of birth.

  • Phonetic equality of first name, equality of month of birth.

  • Phonetic equality of first name, equality of year of birth.

  • Equality of complete date of birth.

  • Phonetic equality of family name, equality of sex.

This procedure resulted in 5.749.132 record pairs, of which 20.931 are matches. The data set is split into 10 blocks of (approximately) equal size and ratio of matches to non-matches.”

Parameters:
  • block (int, list) – An integer or a list with integers between 1 and 10. The blocks are the blocks explained in the description. Default all 1 to 10.

  • missing_values (object, int, float) – The value of the missing values. Default NaN.

  • shuffle (bool) – Shuffle the record pairs. Default True.

Returns:

(pandas.DataFrame, pandas.MultiIndex) – A pandas.DataFrame with comparison vectors and a pandas.MultiIndex with the indices of the matches.

recordlinkage.datasets.load_febrl1(return_links=False)

Load the FEBRL 1 dataset.

The Freely Extensible Biomedical Record Linkage (Febrl) package is distributed with a dataset generator and four datasets generated with the generator. This function returns the first Febrl dataset as a pandas.DataFrame.

“This data set contains 1000 records (500 original and 500 duplicates, with exactly one duplicate per original record.”

Parameters:

return_links (bool) – When True, the function returns also the true links.

Returns:

pandas.DataFrame – A pandas.DataFrame with Febrl dataset1.csv. When return_links is True, the function returns also the true links. The true links are all links in the lower triangular part of the matrix.

recordlinkage.datasets.load_febrl2(return_links=False)

Load the FEBRL 2 dataset.

The Freely Extensible Biomedical Record Linkage (Febrl) package is distributed with a dataset generator and four datasets generated with the generator. This function returns the second Febrl dataset as a pandas.DataFrame.

“This data set contains 5000 records (4000 originals and 1000 duplicates), with a maximum of 5 duplicates based on one original record (and a poisson distribution of duplicate records). Distribution of duplicates: 19 originals records have 5 duplicate records 47 originals records have 4 duplicate records 107 originals records have 3 duplicate records 141 originals records have 2 duplicate records 114 originals records have 1 duplicate record 572 originals records have no duplicate record”

Parameters:

return_links (bool) – When True, the function returns also the true links.

Returns:

pandas.DataFrame – A pandas.DataFrame with Febrl dataset2.csv. When return_links is True, the function returns also the true links. The true links are all links in the lower triangular part of the matrix.

recordlinkage.datasets.load_febrl3(return_links=False)

Load the FEBRL 3 dataset.

The Freely Extensible Biomedical Record Linkage (Febrl) package is distributed with a dataset generator and four datasets generated with the generator. This function returns the third Febrl dataset as a pandas.DataFrame.

“This data set contains 5000 records (2000 originals and 3000 duplicates), with a maximum of 5 duplicates based on one original record (and a Zipf distribution of duplicate records). Distribution of duplicates: 168 originals records have 5 duplicate records 161 originals records have 4 duplicate records 212 originals records have 3 duplicate records 256 originals records have 2 duplicate records 368 originals records have 1 duplicate record 1835 originals records have no duplicate record”

Parameters:

return_links (bool) – When True, the function returns also the true links.

Returns:

pandas.DataFrame – A pandas.DataFrame with Febrl dataset3.csv. When return_links is True, the function returns also the true links. The true links are all links in the lower triangular part of the matrix.

recordlinkage.datasets.load_febrl4(return_links=False)

Load the FEBRL 4 datasets.

The Freely Extensible Biomedical Record Linkage (Febrl) package is distributed with a dataset generator and four datasets generated with the generator. This function returns the fourth Febrl dataset as a pandas.DataFrame.

“Generated as one data set with 10000 records (5000 originals and 5000 duplicates, with one duplicate per original), the originals have been split from the duplicates, into dataset4a.csv (containing the 5000 original records) and dataset4b.csv (containing the 5000 duplicate records) These two data sets can be used for testing linkage procedures.”

Parameters:

return_links (bool) – When True, the function returns also the true links.

Returns:

(pandas.DataFrame, pandas.DataFrame) – A pandas.DataFrame with Febrl dataset4a.csv and a pandas dataframe with Febrl dataset4b.csv. When return_links is True, the function returns also the true links.

recordlinkage.datasets.binary_vectors(n, n_match, m=[0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9], u=[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1], random_state=None, return_links=False, dtype=<class 'numpy.int8'>)

Generate random binary comparison vectors.

This function is used to generate random comparison vectors. The result of each comparison is a binary value (0 or 1).

Parameters:
  • n (int) – The total number of comparison vectors.

  • n_match (int) – The number of matching record pairs.

  • m (list, default [0.9] * 8, optional) – A list of m probabilities of each partially identifying variable. The m probability is the probability that an identifier in matching record pairs agrees.

  • u (list, default [0.9] * 8, optional) – A list of u probabilities of each partially identifying variable. The u probability is the probability that an identifier in non-matching record pairs agrees.

  • random_state (int or numpy.random.RandomState, optional) – Seed for the random number generator with an integer or numpy RandomState object.

  • return_links (bool) – When True, the function returns also the true links.

  • dtype (numpy.dtype) – The dtype of each column in the returned DataFrame.

Returns:

pandas.DataFrame – A dataframe with comparison vectors.