Estimate errors due to blocking in record linkage — est_block

Function computes estimators for false positive rate (FPR) and false negative rate (FNR) due to blocking in record linkage, as proposed by Dasylva and Goussanou (2021). Assumes duplicate-free data sources, complete coverage of the reference data set and blocking decisions based solely on record pairs.

Usage

est_block_error(
  x = NULL,
  y = NULL,
  blocking_result = NULL,
  n = NULL,
  N = NULL,
  G,
  alpha = NULL,
  p = NULL,
  lambda = NULL,
  tol = 10^(-4),
  maxiter = 100,
  sample_size = NULL
)

Arguments

x: Reference data (required if n and N are not provided).
y: Query data (required if n is not provided).
blocking_result: data.frame or data.table containing blocking results (required if n is not provided).
n: Integer vector of numbers of accepted pairs formed by each record in the query data set with records in the reference data set, based on blocking criteria (if NULL, derived from blocking_result).
N: Total number of records in the reference data set (if NULL, derived as length(x)).
G: Number of classes in the finite mixture model.
alpha: Numeric vector of initial class proportions (length G; if NULL, initialized as rep(1/G, G)).
p: Numeric vector of initial matching probabilities in each class of the mixture model (length G; if NULL, randomly initialized from runif(G, 0.5, 1)).
lambda: Numeric vector of initial Poisson distribution parameters for non-matching records in each class of the mixture model (length G; if NULL, randomly initialized from runif(G, 0.1, 2)).
tol: Convergence tolerance for the EM algorithm (default 10^(-6)).
maxiter: Maximum number of iterations for the EM algorithm (default 1000).
sample_size: Bootstrap sample (from n) size used for calculations (if NULL, uses all data).

Value

Returns a list containing:

FPR – estimated false positive rate,
FNR – estimated false negative rate,
iter – number of the EM algorithm iterations performed,
convergence – logical, indicating whether the EM algorithm converged within maxiter iterations.

Details

Consider a large finite population that comprises of $N$ individuals, and two duplicate-free data sources: a register and a file. Assume that the register has no undercoverage, i.e. each record from the file corresponds to exactly one record from the same individual in the register. Let $n_i$ denote the number of register records which form an accepted (by the blocking criteria) pair with record $i$ on the file. Assume that:

two matched records are neighbours with a probability that is bounded away from $0$ regardless of $N$,
two unmatched records are accidental neighbours with a probability of $O(\frac{1}{N})$.

The finite mixture model $n_i \sim \sum_{g=1}^G \alpha_g(\text{Bernoulli}(p_g) \ast \text{Poisson}(\lambda_g))$ is assumed. When $G$ is fixed, the unknown model parameters are given by the vector $\psi = [(\alpha_g, p_g, \lambda_g)]_{1 \leq g \leq G}$ that may be estimated with the Expectation-Maximization (EM) procedure.

Let $n_i = n_{i|M} + n_{i|U}$, where $n_{i|M}$ is the number of matched neighbours and $n_{i|U}$ is the number of unmatched neighbours, and let $c_{ig}$ denote the indicator that record $i$ is from class $g$. For the E-step of the EM procedure, the equations are as follows $$ \begin{aligned} P(n_i | c_{ig} = 1) &= I(n_i = 0)(1-p_g)e^{-\lambda_g}+I(n_i > 0)\Bigl(p_g+(1-p_g)\frac{\lambda_g}{n_i}\Bigr)\frac{e^{-\lambda_g}\lambda_g^{n_i-1}}{(n_i-1)!}, \\ P(c_{ig} = 1 | n_i) &= \frac{\alpha_gP(n_i | c_{ig} = 1)}{\sum_{g'=1}^G\alpha_{g'}P(n_i | c_{ig'} = 1)}, \\ P(n_{i|M} = 1 | n_i,c_{ig} = 1) &= \frac{p_gn_i}{p_gn_i + (1-p_g)\lambda_g}, \\ P(n_{i|U} = n_i | n_i,c_{ig} = 1) &= I(n_i = 0) + I(n_i > 0)\frac{(1-p_g)\lambda_g}{p_gn_i + (1-p_g)\lambda_g}, \\ P(n_{i|U} = n_i-1 | n_i,c_{ig} = 1) &= \frac{p_gn_i}{p_gn_i + (1-p_g)\lambda_g}, \\ E[c_{ig}n_{i|M} | n_i] &= P(c_{ig} = 1 | n_i)P(n_{i|M} = 1 | n_i,c_{ig} = 1), \\ E[n_{i|U} | n_i,c_{ig} = 1] &= \Bigl(\frac{p_g(n_i-1) + (1-p_g)\lambda_g}{p_gn_i + (1-p_g)\lambda_g}\Bigr)n_i, \\ E[c_{ig}n_{i|U} | n_i] &= P(c_{ig} = 1 | n_i)E[n_{i|U} | n_i,c_{ig} = 1]. \end{aligned} $$ The M-step is given by following equations $$ \begin{aligned} \hat{p}_g &= \frac{\sum_{i=1}^mE[c_{ig}n_{i|M} | n_i;\psi]}{\sum_{i=1}^mE[c_{ig} | n_i; \psi]}, \\ \hat{\lambda}_g &= \frac{\sum_{i=1}^mE[c_{ig}n_{i|U} | n_i; \psi]}{\sum_{i=1}^mE[c_{ig} | n_i; \psi]}, \\ \hat{\alpha}_g &= \frac{1}{m}\sum_{i=1}^mE[c_{ig} | n_i; \psi]. \end{aligned} $$ As $N \to \infty$, the error rates and the model parameters are related as follows $$ \begin{aligned} \text{FNR} &\xrightarrow{p} 1 - E[p(v_i)], \\ (N-1)\text{FPR} &\xrightarrow{p} E[\lambda(v_i)], \end{aligned} $$ where $E[p(v_i)] = \sum_{g=1}^G\alpha_gp_g$ and $E[\lambda(v_i)] = \sum_{g=1}^G\alpha_g\lambda_g$.

References

Dasylva, A., Goussanou, A. (2021). Estimating the false negatives due to blocking in record linkage. Survey Methodology, Statistics Canada, Catalogue No. 12-001-X, Vol. 47, No. 2.

Dasylva, A., Goussanou, A. (2022). On the consistent estimation of linkage errors without training data. Jpn J Stat Data Sci 5, 181–216. doi:10.1007/s42081-022-00153-3

Examples

## an example proposed by Dasylva and Goussanou (2021)
## we obtain results very close to those reported in the paper

set.seed(111)

neighbors <- rep(0:5, c(1659, 53951, 6875, 603, 62, 5))

errors <- est_block_error(n = neighbors,
                          N = 63155,
                          G = 2,
                          tol = 10^(-3),
                          maxiter = 50)

errors
#> FPR:  2.137919e-06 
#> FNR:  0.03006997 
#> ========================================================
#> EM algorithm converged successfully within 9 iterations.