Unsupervised Maximum Entropy Classifier for Record Linkage
mec.Rd
Implements an unsupervised maximum entropy classification algorithm for record linkage, iteratively estimating probability / density ratios to classify record pairs into matches and non-matches based on comparison vectors.
Usage
mec(
A,
B,
variables,
comparators = NULL,
methods = NULL,
start_params = NULL,
nonpar_hurdle = FALSE,
set_construction = NULL,
target_flr = 0.03,
max_iter_bisection = 100,
tol = 0.005,
delta = 0.5,
eps = 0.05,
controls_nleqslv = list(),
controls_kliep = control_kliep(),
true_matches = NULL
)
Arguments
- A
A duplicate-free
data.frame
ordata.table
.- B
A duplicate-free
data.frame
ordata.table
.- variables
A character vector of key variables used to create comparison vectors.
- comparators
A named list of functions for comparing pairs of records.
- methods
A named list of methods used for estimation (
"binary"
,"continuous_parametric"
or"continuous_nonparametric"
).- start_params
Start parameters for the
binary
andcontinuous_parametric
methods.- nonpar_hurdle
Logical indicating whether to use a hurdle model or not (used only if the
"continuous_nonparametric"
method has been chosen for at least one variable).- set_construction
A method for constructing the predicted set of matches (
"size"
or"flr"
).- target_flr
A target false link rate (FLR) (used only if
set_construction == "flr"
).- max_iter_bisection
A maximum number of iterations for the bisection procedure (used only if
set_construction == "flr"
).- tol
Error tolerance in the bisection procedure (used only if
set_construction == "flr"
).- delta
A numeric value specifying the tolerance for the change in the estimated number of matches between iterations.
- eps
A numeric value specifying the tolerance for the change in model parameters between iterations.
- controls_nleqslv
Controls passed to the nleqslv function (only if the
"continuous_parametric"
method has been chosen for at least one variable).- controls_kliep
Controls passed to the kliep function (only if the
"continuous_nonparametric"
method has been chosen for at least one variable).- true_matches
A
data.frame
ordata.table
indicating known matches.
Value
Returns a list containing:
M_est
– adata.table
with predicted matches,n_M_est
– estimated classification set size,flr_est
– estimated false link rate (FLR),mmr_est
– estimated missing match rate (MMR),iter_bisection
– the number of iterations in the bisection procedure,b_vars
– a character vector of variables used for the"binary"
method (with the prefix"gamma_"
),cpar_vars
– a character vector of variables used for the"continuous_parametric"
method (with the prefix"gamma_"
),cnonpar_vars
– a character vector of variables used for the"continuous_nonparametric"
method (with the prefix"gamma_"
),b_params
– parameters estimated using the"binary"
method,cpar_params
– parameters estimated using the"continuous_parametric"
method,ratio_kliep
– a result of the kliep function,variables
– a character vector of key variables used for comparison,set_construction
– a method for constructing the predicted set of matches,eval_metrics
– metrics for quality assessment (iftrue_matches
is provided),confusion
– confusion matrix (iftrue_matches
is provided).
References
Lee, D., Zhang, L.-C. and Kim, J. K. (2022). Maximum entropy classification for record linkage. Survey Methodology, Statistics Canada, Catalogue No. 12-001-X, Vol. 48, No. 1.
Vo, T. H., Chauvet, G., Happe, A., Oger, E., Paquelet, S., and Garès, V. (2023). Extending the Fellegi-Sunter record linkage model for mixed-type data with application to the French national health data system. Computational Statistics & Data Analysis, 179, 107656.
Sugiyama, M., Suzuki, T., Nakajima, S. et al. Direct importance estimation for covariate shift adaptation. Ann Inst Stat Math 60, 699–746 (2008). doi:10.1007/s10463-008-0197-x
Examples
df_1 <- data.frame(
name = c("Emma", "Liam", "Olivia", "Noah", "Ava",
"Ethan", "Sophia", "Mason", "Isabella", "James"),
surname = c("Smith", "Johnson", "Williams", "Brown", "Jones",
"Garcia", "Miller", "Davis", "Rodriguez", "Wilson"),
city = c("New York", "Los Angeles", "Chicago", "Houston", "Phoenix",
"Philadelphia", "San Antonio", "San Diego", "Dallas", "San Jose")
)
df_2 <- data.frame(
name = c(
"Emma", "Liam", "Olivia", "Noah",
"Ava", "Ehtan", "Sopia", "Mson",
"Charlotte", "Benjamin", "Amelia", "Lucas"
),
surname = c(
"Smith", "Johnson", "Williams", "Brown",
"Jnes", "Garca", "Miler", "Dvis",
"Martinez", "Lee", "Hernandez", "Clark"
),
city = c(
"New York", "Los Angeles", "Chicago", "Houston",
"Phonix", "Philadelpia", "San Antnio", "San Dieg",
"Seattle", "Miami", "Boston", "Denver"
)
)
true_matches <- data.frame(
"a" = 1:8,
"b" = 1:8
)
variables <- c("name", "surname", "city")
comparators <- list(
"name" = jarowinkler_complement(),
"surname" = jarowinkler_complement(),
"city" = jarowinkler_complement()
)
methods <- list(
"name" = "continuous_parametric",
"surname" = "continuous_parametric",
"city" = "continuous_parametric"
)
set.seed(1)
result <- mec(A = df_1, B = df_2,
variables = variables,
comparators = comparators,
methods = methods,
true_matches = true_matches)
result
#> Record linkage based on the following variables: name, surname, city.
#> ========================================================
#> The algorithm predicted 8 matches.
#> The first 6 predicted matches are:
#> a b ratio / 1000
#> <num> <num> <num>
#> 1: 6 6 1.433031e+08
#> 2: 8 8 3.198692e+07
#> 3: 7 7 9.673745e+05
#> 4: 5 5 2.813745e+04
#> 5: 1 1 3.375000e+00
#> 6: 2 2 3.375000e+00
#> ========================================================
#> The construction of the classification set was based on estimates of its size.
#> Estimated false link rate (FLR): 0.2066 %.
#> Estimated missing match rate (MMR): 0.0000 %.
#> ========================================================
#> Variables selected for the continuous parametric method: name, surname, city.
#> Estimated parameters for the continuous parametric method:
#> variable p_0_M alpha_M beta_M p_0_U alpha_U beta_U
#> <char> <num> <num> <num> <num> <num> <num>
#> 1: gamma_name 0.625 138.462279 2199.107 0.04166667 6.516736 11.173089
#> 2: gamma_surname 0.500 120.665706 1974.530 0.03333333 4.622775 7.167261
#> 3: gamma_city 0.500 6.512723 135.163 0.03333333 5.233194 9.313035
#> ========================================================
#> Evaluation metrics (presented in percentages):
#> recall precision fpr fnr accuracy specificity
#> 100 100 0 0 100 100
#> f1_score
#> 100