blocking: An R Package for Blocking of Records for Record Linkage and Deduplication

article
preprint
record linkage
Author

Beręsewicz & Struzik

Abstract

Entity resolution (probabilistic record linkage, deduplication) is essential for estimation based on multiple sources. It aims to link records without common identifiers that refer to the same entity (e.g., person, company). Without identifiers, researchers must specify which records to compare to calculate matching probability and reduce computational complexity. Traditional deterministic blocking uses common variables like first letters of names, sex, or dates of birth, but assumes error-free, complete data. To address this limitation, we developed the R package blocking, which uses approximate nearest neighbor search and graph algorithms to reduce the number of comparisons. This paper presents the package design, functionalities, and two case studies.

Citation

Beręsewicz, M. & Struzuk A. (2025). blocking: An R Package for Blocking of Records for Record Linkage and Deduplication, Submitted to the R journal

BibTeX

@misc{beresewicz2025blockingr,
      title={blocking: An R Package for Blocking of Records for Record Linkage and Deduplication}, 
      author={Beręsewicz, Maciej  and Struzik, Adam},
      year={2025},
      url={https://github.com/ncn-foreigners/paper-blocking-rpkg/blob/main/paper/paper-blocking.pdf}, 
}