Function computes estimators for false positive rate (FPR) and false negative rate (FNR) due to blocking in record linkage, as proposed by Dasylva and Goussanou (2021). Assumes duplicate-free data sources, complete coverage of the reference data set and blocking decisions based solely on record pairs.
Usage
est_block_error(
x = NULL,
y = NULL,
blocking_result = NULL,
n = NULL,
N = NULL,
G,
alpha = NULL,
p = NULL,
lambda = NULL,
tol = 10^(-4),
maxiter = 100,
sample_size = NULL
)Arguments
- x
Reference data (required if
nandNare not provided).- y
Query data (required if
nis not provided).- blocking_result
data.frameordata.tablecontaining blocking results (required ifnis not provided).- n
Integer vector of numbers of accepted pairs formed by each record in the query data set with records in the reference data set, based on blocking criteria (if
NULL, derived fromblocking_result).- N
Total number of records in the reference data set (if
NULL, derived aslength(x)).- G
Number of classes in the finite mixture model.
- alpha
Numeric vector of initial class proportions (length
G; ifNULL, initialized asrep(1/G, G)).- p
Numeric vector of initial matching probabilities in each class of the mixture model (length
G; ifNULL, randomly initialized fromrunif(G, 0.5, 1)).- lambda
Numeric vector of initial Poisson distribution parameters for non-matching records in each class of the mixture model (length
G; ifNULL, randomly initialized fromrunif(G, 0.1, 2)).- tol
Convergence tolerance for the EM algorithm (default
10^(-6)).- maxiter
Maximum number of iterations for the EM algorithm (default
1000).- sample_size
Bootstrap sample (from
n) size used for calculations (ifNULL, uses all data).
Value
Returns a list containing:
FPR– estimated false positive rate,FNR– estimated false negative rate,iter– number of the EM algorithm iterations performed,convergence– logical, indicating whether the EM algorithm converged withinmaxiteriterations.
Details
Consider a large finite population that comprises of \(N\) individuals, and two duplicate-free data sources: a register and a file.
Assume that the register has no undercoverage,
i.e. each record from the file corresponds to exactly one record from the same individual in the register.
Let \(n_i\) denote the number of register records which form an accepted (by the blocking criteria) pair with
record \(i\) on the file. Assume that:
two matched records are neighbours with a probability that is bounded away from \(0\) regardless of \(N\),
two unmatched records are accidental neighbours with a probability of \(O(\frac{1}{N})\).
The finite mixture model \(n_i \sim \sum_{g=1}^G \alpha_g(\text{Bernoulli}(p_g) \ast \text{Poisson}(\lambda_g))\) is assumed. When \(G\) is fixed, the unknown model parameters are given by the vector \(\psi = [(\alpha_g, p_g, \lambda_g)]_{1 \leq g \leq G}\) that may be estimated with the Expectation-Maximization (EM) procedure.
Let \(n_i = n_{i|M} + n_{i|U}\), where \(n_{i|M}\) is the number of matched neighbours and \(n_{i|U}\) is the number of unmatched neighbours, and let \(c_{ig}\) denote the indicator that record \(i\) is from class \(g\). For the E-step of the EM procedure, the equations are as follows $$ \begin{aligned} P(n_i | c_{ig} = 1) &= I(n_i = 0)(1-p_g)e^{-\lambda_g}+I(n_i > 0)\Bigl(p_g+(1-p_g)\frac{\lambda_g}{n_i}\Bigr)\frac{e^{-\lambda_g}\lambda_g^{n_i-1}}{(n_i-1)!}, \\ P(c_{ig} = 1 | n_i) &= \frac{\alpha_gP(n_i | c_{ig} = 1)}{\sum_{g'=1}^G\alpha_{g'}P(n_i | c_{ig'} = 1)}, \\ P(n_{i|M} = 1 | n_i,c_{ig} = 1) &= \frac{p_gn_i}{p_gn_i + (1-p_g)\lambda_g}, \\ P(n_{i|U} = n_i | n_i,c_{ig} = 1) &= I(n_i = 0) + I(n_i > 0)\frac{(1-p_g)\lambda_g}{p_gn_i + (1-p_g)\lambda_g}, \\ P(n_{i|U} = n_i-1 | n_i,c_{ig} = 1) &= \frac{p_gn_i}{p_gn_i + (1-p_g)\lambda_g}, \\ E[c_{ig}n_{i|M} | n_i] &= P(c_{ig} = 1 | n_i)P(n_{i|M} = 1 | n_i,c_{ig} = 1), \\ E[n_{i|U} | n_i,c_{ig} = 1] &= \Bigl(\frac{p_g(n_i-1) + (1-p_g)\lambda_g}{p_gn_i + (1-p_g)\lambda_g}\Bigr)n_i, \\ E[c_{ig}n_{i|U} | n_i] &= P(c_{ig} = 1 | n_i)E[n_{i|U} | n_i,c_{ig} = 1]. \end{aligned} $$ The M-step is given by following equations $$ \begin{aligned} \hat{p}_g &= \frac{\sum_{i=1}^mE[c_{ig}n_{i|M} | n_i;\psi]}{\sum_{i=1}^mE[c_{ig} | n_i; \psi]}, \\ \hat{\lambda}_g &= \frac{\sum_{i=1}^mE[c_{ig}n_{i|U} | n_i; \psi]}{\sum_{i=1}^mE[c_{ig} | n_i; \psi]}, \\ \hat{\alpha}_g &= \frac{1}{m}\sum_{i=1}^mE[c_{ig} | n_i; \psi]. \end{aligned} $$ As \(N \to \infty\), the error rates and the model parameters are related as follows $$ \begin{aligned} \text{FNR} &\xrightarrow{p} 1 - E[p(v_i)], \\ (N-1)\text{FPR} &\xrightarrow{p} E[\lambda(v_i)], \end{aligned} $$ where \(E[p(v_i)] = \sum_{g=1}^G\alpha_gp_g\) and \(E[\lambda(v_i)] = \sum_{g=1}^G\alpha_g\lambda_g\).
References
Dasylva, A., Goussanou, A. (2021). Estimating the false negatives due to blocking in record linkage. Survey Methodology, Statistics Canada, Catalogue No. 12-001-X, Vol. 47, No. 2.
Dasylva, A., Goussanou, A. (2022). On the consistent estimation of linkage errors without training data. Jpn J Stat Data Sci 5, 181–216. doi:10.1007/s42081-022-00153-3
Examples
## an example proposed by Dasylva and Goussanou (2021)
## we obtain results very close to those reported in the paper
set.seed(111)
neighbors <- rep(0:5, c(1659, 53951, 6875, 603, 62, 5))
errors <- est_block_error(n = neighbors,
N = 63155,
G = 2,
tol = 10^(-3),
maxiter = 50)
errors
#> FPR: 2.137919e-06
#> FNR: 0.03006997
#> ========================================================
#> EM algorithm converged successfully within 9 iterations.