nonprobsvy 0.3.0
- changed the default of
num_bootincontrol_inf()from 500 to 100 andnlambdaincontrol_out()from 100 to 50 to align with the documentation - vignette based on the JSS paper added
- documentation notation adjusted to match the JSS paper
- we thank the authors of StatsClaw.ai, the tool that allowed us to identify bugs and improve the code
- documented and stabilised the doubly robust analytic variance for non-logit propensity links: the plug-in doubly robust variance (Chen, Li & Wu 2020, Theorem 2, eq. 14) is derived under the logistic model, where the probability-sample (design) variance factor
piis bounded by 1; forprobit(inverse-Mills factorphi / (1 - pi)) andcloglog(factorlog(1 - pi)==-exp(eta)) it is conservative (over-estimates the standard error) and previously could overflow to a huge or non-finite SE when a fitted propensity approached 0 or 1 (in a small-propensity simulation the probit analytic SE reached ~1e27). The non-logit doubly robust variance terms (t_vec,var_nonprob, and theb-vector weightspsd / pi^2and(1 - pi) / pi^2 * exp(eta)) now floor the fitted propensity away from{0, 1}at1 / sqrt(N)so they stay finite, a one-time note recommendscontrol_inf(var_method = "bootstrap")for probit/cloglog doubly robust inference, and the documentation states the conservativeness. The logistic path is left untouched (its standard error is unchanged). The previously commented-out probit/cloglog doubly robust tests are re-enabled (point estimates pinned, analytic SE asserted finite), validated by a coverage simulation in which logit stays at nominal coverage (~0.95, SE/SD ~1) and the probit/cloglog SE no longer explodes - added input validation at the boundary: a factor or character outcome/target now raises a clear error (instead of silently returning
NA), and the not-yet-implemented overlap controlscontrol_sel(dependence = TRUE)/key = ...now error with a clear message instead of being silently accepted - fixed a crash in IPW/DR variable selection on the population-totals-only path (no
svydesign): once cross-validation dropped a covariate, the response was re-extracted by routing the still-full outcome formula through the population-totals model frame, whose name check then failed with"Selection and population totals have different names."; the response is now read directly from the data, independent of the reduced totals. A warning is also emitted flagging this path as experimental, because the penalty (lambda) selection is unreliable here and can collapse to an over-sparse (even intercept-only) model that biases the estimate – supplying a unit-levelsvydesignor a fixedcontrol_sel(lambda = ...)is recommended instead - fixed the cross-validation loss for IPW variable selection in the population-totals-only path: the known population totals were normalised by
sum(weights)(which equals the non-probability sample size when no probability sample is present) instead ofsum(1/pi_hat)(the HT estimate ofN), mis-scaling the calibration target by ~N/nand mis-directing the penalty (lambda) selection; the totals term now uses the sameN_nonsnormalisation as the estimating equation - stabilised the nonparametric (
npar) mass-imputation analytic variance: the loess inclusion-propensity proxy can return improper fitted values (<= 0or near zero), which made the1/pi_hat^2weight spurious or non-finite; the propensity is now floored at1/sqrt(N)(so each unit’s inverse-propensity weight is capped atN) before inverting - corrected the documented general
c-vector formula in the GLM mass-imputation analytic variance (method_glm()): the equation had been transcribed from Kim et al. (2021, eq. 9) in that paper’s labelling, where sampleBis the non-probability sample andAthe probability sample – the opposite of this package’s convention (A= non-probability). The sample roles in the general formula are now swapped to match the package convention, the linear special case, and the (already correct) implementation; this is a documentation-only fix with no change to computed variances - documentation polishing: corrected the documented
epsilondefaults incontrol_out()(1e-8) andcontrol_sel()(1e-4) to match the code, fixed avar_tot->var_totalreturn-value name in themethod_nn()documentation, removed a stray character in themethod_npar()family_outcomeparameter doc, and dropped a redundantsum()in the printed IPW-weights total - validated that a supplied
pop_totalsis a named vector whose first element is(Intercept), erroring early with a clear message instead of silently producing anNApopulation size andNaNstandard errors - fixed
confint()to honor the requestedlevelregardless of the fit’scontrol_inf(alpha); previouslyconfint(level = 0.95)returned the stored interval built with the fit’salpha, so a model fitted with e.g.alpha = 0.1had its 90% interval mislabelled as 95% - enabled the Rcpp variable-selection non-convergence warning (previously dead code) and fixed an off-by-one so the solver runs the full
maxititerations; the warning now fires underverbose = TRUE - corrected the displayed mass-imputation GLM
V_1variance formula in the documentation (squared residual and squared bracket); the implementation was already correct - fixed the doubly robust analytic standard error under
control_inf(bias_correction = TRUE): the non-probability variance component (Yang-Kim-Song 2020, eq. 25) used the frequencycase_weightsinstead of the inverse-probability weights1/pi, which made the standard error anticonservative (95% confidence-interval coverage ~0.87 instead of 0.95 in simulation); it now uses the bias-corrected inverse-probability weights and is computed per outcome so an outcome’s SE no longer depends on the other outcomes fitted alongside it - fixed the probit propensity-score analytic variance under
control_sel(est_method = "gee"): a scalarifelse()truncated the propensity-score derivative vectors (ps_nons_der,est_ps_rand_der) to length 1, which corrupted the probability-sample variance component and inflated the standard error; replaced it with a scalarif/elseand added a probit-GEE analytic-SE regression test - warm-started the single-core IPW
pop_totalsGEE bootstrap to match the multicore path (closes #120) - warm-started the DR
bias_correctionbootstrap solver from the original-data fit, speeding it up with identical estimates (closes #119) - added
control_out(pmm_k_max = ...)to cap thepmm_k_choice = "min_var"search grid and documented its cost (closes #116) - added a regression test for finite analytic SE with
pop_totalsand a non-gaussianfamily_outcome(closes #71) - fixed parallel bootstrap (cores > 1) so
set.seed()makes results reproducible by switching the MI, IPW, and DR multicore loops fromforeach::%dopar%todoRNG::%dorng%; this also seeds NN/PMM tie-breaking inside workers (closes #115) - fixed an indefinite hang in the MI bootstrap variance path for
method_outcome %in% c("nn", "pmm"):boot_mi()previously hand-mutated a subset of thesvyrep.designslots, which left$selfrepat the original length and madesurvey::svytotal()insidemethod_nn()/method_pmm()fail deterministically with"(subscript) logical subscript too long"; a silenttryCatcharound the per-replicate loop swallowed the error without advancing the counter, so the call ran forever – the subset now goes through[.svyrep.designand a bounded retry surfaces any deterministic failure as a proper error - implemented the Kim & Haziza (2014) low-dimensional joint estimator for
control_inf(bias_correction = TRUE)without variable selection, replacing the previousobject 'bias_corr_ys_rand_pred' not foundcrash; the DR analytic and bootstrap variance paths now route through the same joint-estimation helper used by the Yang-Kim-Song (2020) high-dimensional path, andbias_correction = TRUEwithoutsvydesignis rejected up front (closes #114) – thanks to Tommy Nyberg for reporting. - shortened CRAN-facing examples, added spelling wordlist entries, and made
R CMD checkrun a fast tinytest subset by default while keeping the full suite available on GitHub viaNONPROBSVY_FULL_TESTS=true - corrected IPW-GEE analytic variance components for
h(x)=x/pi, the GEEb-vector sign, and probit/cloglog probability-sample scaling forh(x)=x, and added plain-R/Rcpp GEE formula checks - aligned MLE IPW analytic variance components for logit, probit, and cloglog propensity models with the shared
pi_dotformula and nonprobability-sample IPW plug-in scale, including the known-Nprobit adjustment - allowed outcome variable-selection models to use a user-specified
control_out(lambda = ...)value instead of always running cross-validation (closes #66) - returned bootstrap IPW weights in
boot_ipw_weightswhen bootstrap variance estimation is used and bootstrap results are kept (closes #76) - improved the Rcpp variable-selection cross-validation code, fixed SCAD penalty tapering, switched fold sampling to R’s RNG for
set.seed()reproducibility, and added benchmark evidence for the speedup (closes #103) - added DR regression tests for one-outcome versus multi-outcome analytic uncertainty-component consistency and multi-outcome bootstrap output shapes (addresses part of #101)
- added MI regression tests for one-outcome versus multi-outcome output, confidence-interval, and uncertainty-component consistency across GLM, NN, PMM, and NPAR backends (addresses part of #101)
- added IPW regression tests for one-outcome versus multi-outcome analytic variance, HT versus Hajek denominator metadata, and probit/cloglog variable-selection smoke coverage (addresses part of #101)
- made
make_outcomes()return names explicit so callers no longer rely on partial$matching (closes #100) - stabilized extreme-propensity IPW computations by clamping fitted probabilities, using tail-stable log-likelihood formulas, and guarding Rcpp variable-selection weights (closes #102)
- clarified supported data structures, estimator-by-outcome scope, README examples, and
verboseargument documentation (closes #95, #97, #98) - tightened validation for
pop_means,case_weights, and logical inference-control flags so unsupported inputs fail early with clearer messages (closes #96) - fixed multicore IPW bootstrap for population-totals-only runs and kept bootstrap replicate output shapes consistent (closes #94)
- aligned IPW point-estimator denominators with Horvitz-Thompson vs Hajek estimator behavior, added estimator-family metadata and print output, and documented when each estimator is used (closes #89)
- fixed incorrect analytic uncertainty for multi-outcome IPW and DR fits by aligning outcome-specific variance and confidence-interval indexing (closes #87, #88)
- fixed the NN mass-imputation
k = 1path by avoiding dimension drop errors and using leave-one-out matching for the non-probability variance proxy (closes #92) - fixed
nn_exact_se = TRUEso the mini-bootstrap uses bootstrap-specific nearest-neighbor matches instead of reusing the original donor matches (closes #91) - fixed PMM
pmm_k_choice = "min_var"so the bestkfound so far is returned instead of the first non-improvingk(closes #93) - fixed
nonprob_mi()model-frame construction to passcase_weightsinstead of an undefined internalweightssymbol; this is a plumbing fix and does not otherwise change current case-weight semantics (closes #99) - reused nearest-neighbor search results across outcomes for multi-outcome NN mass-imputation fits (closes #104)
- randomized equal-distance nearest-neighbor tie handling before donor aggregation, including hidden cutoff ties beyond the neighbours initially returned by
RANN::nn2()(closes #105) - changed PMM
pmm_k_choice = "min_var"to perform a full search over the candidatekgrid rather than stopping at the first non-improving value (closes #106) - aligned NN and PMM point estimates and probability-side analytic variances with known-
NHorvitz-Thompson means whenpop_sizeis supplied (closes #107) - updated NN/PMM function documentation for known-
Nmeans, leave-one-out NN variance proxy, weighted mini-bootstrap, random tie handling, and full-grid PMMkselection (closes #108) - added regression tests for multi-outcome analytic variance, NN
khandling, bootstrap-based NN exact SE, and PMMkchoice
nonprobsvy 0.2.3
CRAN release: 2025-08-20
- changes to the documentation to meet the JSS requirements
- documentation polishing
nonprobsvy 0.2.2
CRAN release: 2025-05-24
- new hex sticker by Oliwia Awuku
- minor changes to the code, e.g.
control_out(eps=1e-8) - fixing a bug in the bootstrap variance estimator the
method_nnandmethod_pmm - fixing bootstrap for doubly robust estimators
- more unit-tests for doubly robust estimators and other methods
- more informative vignette for
method_glm
nonprobsvy 0.2.1
CRAN release: 2025-04-22
- titles corrected
- new S3 method
extractadded which allows to extract results from thenonprobobject - new S3 method
coefadded which allows to obtain the coefficients of underlying models (if possible) - fixed CRAN notes (unit tests for the IPW estimator
cloglog) - removed
samplingpackage from suggested package - added simple
plotmethod - improvements in the linear algebra
- corrected the
check_balanceerror (closes #75) - code cleaning
nonprobsvy 0.2.0
CRAN release: 2025-03-27
Breaking changes
- functions
pop.size,controlSel,controlOutandcontrolInfwere renamed topop_size,control_sel,control_outandcontrol_infrespectively. - function
genSimDataremoved completely as it is not used anywhere in the package. - argument
maxLik_methodrenamed tomaxlik_methodin thecontrol_selfunction. -
control_outfunction:-
predictive_matchrenamed topmm_match_typeto align with the PMM (Predictive Mean Matching) estimator naming convention, where all related parameters start withpmm_
-
-
control_selfunction:- argument
methodremoved as it was not used - argument
est_method_selrenamed toest_method - argument
hrenamed togee_h_funto make this more readable to the user -
start_typenow accepts onlyzeroandmle(forgeemodels only).
- argument
-
control_inffunction:-
bias_infrenamed tovars_combineand type changed tological.TRUEif variables (its levels) should be combined after variable selection algorithm for the doubly robust approach. -
pi_ij– argument removed as it is not used.
-
-
nonprobsvyclass renamed tononproband all related method adjusted to this change - functions
logit_model_nonprobsvy,probit_model_nonprobsvyandcloglog_model_nonprobsvyremoved in the favour of more readablemethod_psfunction that specifies the propensity score model - new option
control_inference=control_inf(vars_combine=TRUE)which allows doubly robust estimator to combine variables prior estimation i.e. ifselection=~x1+x2andy~x1+x3then the following models are fittedselection=~x1+x2+x3andy~x1+x2+x3. By default we setcontrol_inference=control_inf(vars_combine=FALSE). Note that this behaviour is assumed independently from variable selection. - argument
nonprob(weights=NULL)replaced tononprob(case_weights=NULL)to stress that this refer to case weights not sampling or other weights in non-probability sample
Features
- two additional datasets have been included:
jvs(Job Vacancy Survey; a probability sample survey) andadmin(Central Job Offers Database; a non-probability sample survey). The units and auxiliary variables have been aligned in a way that allows the data to be integrated using the methods implemented in this package. - a
check_balancefunction was added to check the balance in the totals of the variables based on the weighted weights between the non-probability and probability samples. - citation file added.
- argument
na_actionwith defaultna.omit - new generic methods added:
-
weights– returns IPW weights -
update– allows to update thenonprobclass object
-
- new functions added and exported:
-
method_ps– for modelling propensity score -
method_glm– for modelling y usingglmfunction -
method_nn– for the NN method -
method_pmm– for the PMM method -
method_npar– for the non-parametric method
-
- new
print.nonprob,summary.nonprobandprint.nonprob_summarymethods
> result_mi
A nonprob object
- estimator type: mass imputation
- method: glm (gaussian)
- auxiliary variables source: survey
- vars selection: false
- variance estimator: analytic
- population size fixed: false
- naive (uncorrected) estimators:
- variable y1: 3.1817
- variable y2: 1.8087
- selected estimators:
- variable y1: 2.9498 (se=0.0420, ci=(2.8674, 3.0322))
- variable y2: 1.5760 (se=0.0326, ci=(1.5122, 1.6399))number of digits can be changed using print(x, digits) as shown below
> print(result_mi,2)
A nonprob object
- estimator type: mass imputation
- method: glm (gaussian)
- auxiliary variables source: survey
- vars selection: false
- variance estimator: analytic
- population size fixed: false
- naive (uncorrected) estimators:
- variable y1: 3.18
- variable y2: 1.81
- selected estimators:
- variable y1: 2.95 (se=0.04, ci=(2.87, 3.03))
- variable y2: 1.58 (se=0.03, ci=(1.51, 1.64))> summary(result_mi) |> print(digits=2)
A nonprob_summary object
- call: nonprob(data = subset(population, flag_bd1 == 1), outcome = y1 +
y2 ~ x1 + x2, svydesign = sample_prob)
- estimator type: mass imputation
- nonprob sample size: 693011 (69.3%)
- prob sample size: 1000 (0.1%)
- population size: 1000000 (fixed: false)
- detailed information about models are stored in list element(s): "outcome"
----------------------------------------------------------------
- distribution of outcome residuals:
- y1: min: -4.79; mean: 0.00; median: 0.00; max: 4.54
- y2: min: -4.96; mean: -0.00; median: -0.07; max: 12.25
- distribution of outcome predictions (nonprob sample):
- y1: min: -2.72; mean: 3.18; median: 3.04; max: 16.28
- y2: min: -1.55; mean: 1.81; median: 1.58; max: 13.92
- distribution of outcome predictions (prob sample):
- y1: min: -0.46; mean: 2.95; median: 2.84; max: 10.31
- y2: min: -0.58; mean: 1.58; median: 1.39; max: 7.87
----------------------------------------------------------------Bugfixes
- basic methods and functions related to variance estimation, weights and probability linking methods have been rewritten in a more optimal and readable way.
Other
- more informative error messages added.
- documentation improved.
- switching completely to snake_case.
- extensive cleaning of the code.
- more unit-tests added.
- new dependencies:
formula.tools
Replication materials
- to verify the quality of the software please refer to the replication materials available here: https://github.com/ncn-foreigners/software-tutorials
nonprobsvy 0.1.1
CRAN release: 2024-11-14
Bugfixes
- bug Fix occurring when estimation was based on auxiliary variable, which led to compression of the data from the frame to the vector.
- bug Fix related to not passing
maxitargument fromcontrolSelfunction to internally usednleqslvfunction - bug Fix related to storing
vectorinmodel_framewhen predictingy_hatin mass imputationglmmodel when X is based in one auxiliary variable only - fix provided converting it todata.frameobject.
Features
- added information to
summaryabout quality of estimation basing on difference between estimated and known total values of auxiliary variables - added estimation of exact standard error for k-nearest neighbor estimator.
- added breaking change to
controlOutfunction by switching values forpredictive_matchargument. From now on, thepredictive_match = 1means \hat{y}-\hat{y} in predictive mean matching imputation andpredictive_match = 2corresponds to \hat{y}-y matching. - implemented
divoption when variable selection (more in documentation) for doubly robust estimation. - added more insights to
nonproboutput such as gradient, hessian and jacobian derived from IPW estimation formleandgeemethods whenIPWorDRmodel executed. - added estimated inclusion probabilities and its derivatives for probability and non-probability samples to
nonproboutput whenIPWorDRmodel executed. - added
model_framematrix data from probability sample used for mass imputation tononprobwhenMIorDRmodel executed.
nonprobsvy 0.1.0
CRAN release: 2024-04-04
Features
- implemented population mean estimation using doubly robust, inverse probability weighting and mass imputation methods
- implemented inverse probability weighting models with Maximum Likelihood Estimation and Generalized Estimating Equations methods with
logit,complementary log-logandprobitlink functions. - implemented
generalized linear models,nearest neighboursandpredictive mean matchingmethods for Mass Imputation - implemented bias correction estimators for doubly-robust approach
- implemented estimation methods when vector of population means/totals is available
- implemented variables selection with
SCAD,LASSOandMCPpenalization equations - implemented
analyticandbootstrap(with parallel computation -doSNOWpackage) variance for described estimators - added control parameters for models
- added S3 methods for object of
nonprobclass such as-
nobsfor samples size -
pop.sizefor population size estimation -
residualsfor residuals of the inverse probability weighting model -
cooks.distancefor identifying influential observations that have a significant impact on the parameter estimates -
hatvaluesfor measuring the leverage of individual observations -
logLikfor computing the log-likelihood of the model, -
AIC(Akaike Information Criterion) for evaluating the model based on the trade-off between goodness of fit and complexity, helping in model selection -
BIC(Bayesian Information Criterion) for a similar purpose as AIC but with a stronger penalty for model complexity -
confintfor calculating confidence intervals around parameter estimates -
vcovfor obtaining the variance-covariance matrix of the parameter estimates -
deviancefor assessing the goodness of fit of the model
-
