Function computes estimators for false positive rate (FPR) and false negative rate (FNR) due to blocking in record linkage, as proposed by Dasylva and Goussanou (2021). Assumes duplicate-free data sources, complete coverage of the reference data set and blocking decisions based solely on record pairs.
Usage
est_block_error(
x = NULL,
y = NULL,
blocking_result = NULL,
n = NULL,
N = NULL,
G,
alpha = NULL,
p = NULL,
lambda = NULL,
tol = 10^(-4),
maxiter = 100,
sample_size = NULL
)
Arguments
- x
Reference data (required if
n
andN
are not provided).- y
Query data (required if
n
is not provided).- blocking_result
data.frame
ordata.table
containing blocking results (required ifn
is not provided).- n
Integer vector of numbers of accepted pairs formed by each record in the query data set with records in the reference data set, based on blocking criteria (if
NULL
, derived fromblocking_result
).- N
Total number of records in the reference data set (if
NULL
, derived aslength(x)
).- G
Number of classes in the finite mixture model.
- alpha
Numeric vector of initial class proportions (length
G
; ifNULL
, initialized asrep(1/G, G)
).- p
Numeric vector of initial matching probabilities in each class of the mixture model (length
G
; ifNULL
, randomly initialized fromrunif(G, 0.5, 1)
).- lambda
Numeric vector of initial Poisson distribution parameters for non-matching records in each class of the mixture model (length
G
; ifNULL
, randomly initialized fromrunif(G, 0.1, 2)
).- tol
Convergence tolerance for the EM algorithm (default
10^(-6)
).- maxiter
Maximum number of iterations for the EM algorithm (default
1000
).- sample_size
Bootstrap sample (from
n
) size used for calculations (ifNULL
, uses all data).
Value
Returns a list containing:
FPR
– estimated false positive rate,FNR
– estimated false negative rate,iter
– number of the EM algorithm iterations performed,convergence
– logical, indicating whether the EM algorithm converged withinmaxiter
iterations.
Details
Consider a large finite population that comprises of \(N\) individuals, and two duplicate-free data sources: a register and a file.
Assume that the register has no undercoverage,
i.e. each record from the file corresponds to exactly one record from the same individual in the register.
Let \(n_i\) denote the number of register records which form an accepted (by the blocking criteria) pair with
record \(i\) on the file. Assume that:
two matched records are neighbours with a probability that is bounded away from \(0\) regardless of \(N\),
two unmatched records are accidental neighbours with a probability of \(O(\frac{1}{N})\).
The finite mixture model \(n_i \sim \sum_{g=1}^G \alpha_g(\text{Bernoulli}(p_g) \ast \text{Poisson}(\lambda_g))\) is assumed. When \(G\) is fixed, the unknown model parameters are given by the vector \(\psi = [(\alpha_g, p_g, \lambda_g)]_{1 \leq g \leq G}\) that may be estimated with the Expectation-Maximization (EM) procedure.
Let \(n_i = n_{i|M} + n_{i|U}\), where \(n_{i|M}\) is the number of matched neighbours and \(n_{i|U}\) is the number of unmatched neighbours, and let \(c_{ig}\) denote the indicator that record \(i\) is from class \(g\). For the E-step of the EM procedure, the equations are as follows $$ \begin{aligned} P(n_i | c_{ig} = 1) &= I(n_i = 0)(1-p_g)e^{-\lambda_g}+I(n_i > 0)\Bigl(p_g+(1-p_g)\frac{\lambda_g}{n_i}\Bigr)\frac{e^{-\lambda_g}\lambda_g^{n_i-1}}{(n_i-1)!}, \\ P(c_{ig} = 1 | n_i) &= \frac{\alpha_gP(n_i | c_{ig} = 1)}{\sum_{g'=1}^G\alpha_{g'}P(n_i | c_{ig'} = 1)}, \\ P(n_{i|M} = 1 | n_i,c_{ig} = 1) &= \frac{p_gn_i}{p_gn_i + (1-p_g)\lambda_g}, \\ P(n_{i|U} = n_i | n_i,c_{ig} = 1) &= I(n_i = 0) + I(n_i > 0)\frac{(1-p_g)\lambda_g}{p_gn_i + (1-p_g)\lambda_g}, \\ P(n_{i|U} = n_i-1 | n_i,c_{ig} = 1) &= \frac{p_gn_i}{p_gn_i + (1-p_g)\lambda_g}, \\ E[c_{ig}n_{i|M} | n_i] &= P(c_{ig} = 1 | n_i)P(n_{i|M} = 1 | n_i,c_{ig} = 1), \\ E[n_{i|U} | n_i,c_{ig} = 1] &= \Bigl(\frac{p_g(n_i-1) + (1-p_g)\lambda_g}{p_gn_i + (1-p_g)\lambda_g}\Bigr)n_i, \\ E[c_{ig}n_{i|U} | n_i] &= P(c_{ig} = 1 | n_i)E[n_{i|U} | n_i,c_{ig} = 1]. \end{aligned} $$ The M-step is given by following equations $$ \begin{aligned} \hat{p}_g &= \frac{\sum_{i=1}^mE[c_{ig}n_{i|M} | n_i;\psi]}{\sum_{i=1}^mE[c_{ig} | n_i; \psi]}, \\ \hat{\lambda}_g &= \frac{\sum_{i=1}^mE[c_{ig}n_{i|U} | n_i; \psi]}{\sum_{i=1}^mE[c_{ig} | n_i; \psi]}, \\ \hat{\alpha}_g &= \frac{1}{m}\sum_{i=1}^mE[c_{ig} | n_i; \psi]. \end{aligned} $$ As \(N \to \infty\), the error rates and the model parameters are related as follows $$ \begin{aligned} \text{FNR} &\xrightarrow{p} 1 - E[p(v_i)], \\ (N-1)\text{FPR} &\xrightarrow{p} E[\lambda(v_i)], \end{aligned} $$ where \(E[p(v_i)] = \sum_{g=1}^G\alpha_gp_g\) and \(E[\lambda(v_i)] = \sum_{g=1}^G\alpha_g\lambda_g\).
References
Dasylva, A., Goussanou, A. (2021). Estimating the false negatives due to blocking in record linkage. Survey Methodology, Statistics Canada, Catalogue No. 12-001-X, Vol. 47, No. 2.
Dasylva, A., Goussanou, A. (2022). On the consistent estimation of linkage errors without training data. Jpn J Stat Data Sci 5, 181–216. doi:10.1007/s42081-022-00153-3
Examples
## an example proposed by Dasylva and Goussanou (2021)
## we obtain results very close to those reported in the paper
set.seed(111)
neighbors <- rep(0:5, c(1659, 53951, 6875, 603, 62, 5))
errors <- est_block_error(n = neighbors,
N = 63155,
G = 2,
tol = 10^(-3),
maxiter = 50)
errors
#> FPR: 2.137919e-06
#> FNR: 0.03006997
#> ========================================================
#> EM algorithm converged successfully within 9 iterations.