Skip to contents

nonprobsvy (development version)

nonprobsvy 0.3.0

  • changed the default of num_boot in control_inf() from 500 to 100 and nlambda in control_out() from 100 to 50 to align with the documentation
  • vignette based on the JSS paper added
  • documentation notation adjusted to match the JSS paper
  • we thank the authors of StatsClaw.ai, the tool that allowed us to identify bugs and improve the code
  • documented and stabilised the doubly robust analytic variance for non-logit propensity links: the plug-in doubly robust variance (Chen, Li & Wu 2020, Theorem 2, eq. 14) is derived under the logistic model, where the probability-sample (design) variance factor pi is bounded by 1; for probit (inverse-Mills factor phi / (1 - pi)) and cloglog (factor log(1 - pi) == -exp(eta)) it is conservative (over-estimates the standard error) and previously could overflow to a huge or non-finite SE when a fitted propensity approached 0 or 1 (in a small-propensity simulation the probit analytic SE reached ~1e27). The non-logit doubly robust variance terms (t_vec, var_nonprob, and the b-vector weights psd / pi^2 and (1 - pi) / pi^2 * exp(eta)) now floor the fitted propensity away from {0, 1} at 1 / sqrt(N) so they stay finite, a one-time note recommends control_inf(var_method = "bootstrap") for probit/cloglog doubly robust inference, and the documentation states the conservativeness. The logistic path is left untouched (its standard error is unchanged). The previously commented-out probit/cloglog doubly robust tests are re-enabled (point estimates pinned, analytic SE asserted finite), validated by a coverage simulation in which logit stays at nominal coverage (~0.95, SE/SD ~1) and the probit/cloglog SE no longer explodes
  • added input validation at the boundary: a factor or character outcome/target now raises a clear error (instead of silently returning NA), and the not-yet-implemented overlap controls control_sel(dependence = TRUE) / key = ... now error with a clear message instead of being silently accepted
  • fixed a crash in IPW/DR variable selection on the population-totals-only path (no svydesign): once cross-validation dropped a covariate, the response was re-extracted by routing the still-full outcome formula through the population-totals model frame, whose name check then failed with "Selection and population totals have different names."; the response is now read directly from the data, independent of the reduced totals. A warning is also emitted flagging this path as experimental, because the penalty (lambda) selection is unreliable here and can collapse to an over-sparse (even intercept-only) model that biases the estimate – supplying a unit-level svydesign or a fixed control_sel(lambda = ...) is recommended instead
  • fixed the cross-validation loss for IPW variable selection in the population-totals-only path: the known population totals were normalised by sum(weights) (which equals the non-probability sample size when no probability sample is present) instead of sum(1/pi_hat) (the HT estimate of N), mis-scaling the calibration target by ~N/n and mis-directing the penalty (lambda) selection; the totals term now uses the same N_nons normalisation as the estimating equation
  • stabilised the nonparametric (npar) mass-imputation analytic variance: the loess inclusion-propensity proxy can return improper fitted values (<= 0 or near zero), which made the 1/pi_hat^2 weight spurious or non-finite; the propensity is now floored at 1/sqrt(N) (so each unit’s inverse-propensity weight is capped at N) before inverting
  • corrected the documented general c-vector formula in the GLM mass-imputation analytic variance (method_glm()): the equation had been transcribed from Kim et al. (2021, eq. 9) in that paper’s labelling, where sample B is the non-probability sample and A the probability sample – the opposite of this package’s convention (A = non-probability). The sample roles in the general formula are now swapped to match the package convention, the linear special case, and the (already correct) implementation; this is a documentation-only fix with no change to computed variances
  • documentation polishing: corrected the documented epsilon defaults in control_out() (1e-8) and control_sel() (1e-4) to match the code, fixed a var_tot -> var_total return-value name in the method_nn() documentation, removed a stray character in the method_npar() family_outcome parameter doc, and dropped a redundant sum() in the printed IPW-weights total
  • validated that a supplied pop_totals is a named vector whose first element is (Intercept), erroring early with a clear message instead of silently producing an NA population size and NaN standard errors
  • fixed confint() to honor the requested level regardless of the fit’s control_inf(alpha); previously confint(level = 0.95) returned the stored interval built with the fit’s alpha, so a model fitted with e.g. alpha = 0.1 had its 90% interval mislabelled as 95%
  • enabled the Rcpp variable-selection non-convergence warning (previously dead code) and fixed an off-by-one so the solver runs the full maxit iterations; the warning now fires under verbose = TRUE
  • corrected the displayed mass-imputation GLM V_1 variance formula in the documentation (squared residual and squared bracket); the implementation was already correct
  • fixed the doubly robust analytic standard error under control_inf(bias_correction = TRUE): the non-probability variance component (Yang-Kim-Song 2020, eq. 25) used the frequency case_weights instead of the inverse-probability weights 1/pi, which made the standard error anticonservative (95% confidence-interval coverage ~0.87 instead of 0.95 in simulation); it now uses the bias-corrected inverse-probability weights and is computed per outcome so an outcome’s SE no longer depends on the other outcomes fitted alongside it
  • fixed the probit propensity-score analytic variance under control_sel(est_method = "gee"): a scalar ifelse() truncated the propensity-score derivative vectors (ps_nons_der, est_ps_rand_der) to length 1, which corrupted the probability-sample variance component and inflated the standard error; replaced it with a scalar if/else and added a probit-GEE analytic-SE regression test
  • warm-started the single-core IPW pop_totals GEE bootstrap to match the multicore path (closes #120)
  • warm-started the DR bias_correction bootstrap solver from the original-data fit, speeding it up with identical estimates (closes #119)
  • added control_out(pmm_k_max = ...) to cap the pmm_k_choice = "min_var" search grid and documented its cost (closes #116)
  • added a regression test for finite analytic SE with pop_totals and a non-gaussian family_outcome (closes #71)
  • fixed parallel bootstrap (cores > 1) so set.seed() makes results reproducible by switching the MI, IPW, and DR multicore loops from foreach::%dopar% to doRNG::%dorng%; this also seeds NN/PMM tie-breaking inside workers (closes #115)
  • fixed an indefinite hang in the MI bootstrap variance path for method_outcome %in% c("nn", "pmm"): boot_mi() previously hand-mutated a subset of the svyrep.design slots, which left $selfrep at the original length and made survey::svytotal() inside method_nn()/method_pmm() fail deterministically with "(subscript) logical subscript too long"; a silent tryCatch around the per-replicate loop swallowed the error without advancing the counter, so the call ran forever – the subset now goes through [.svyrep.design and a bounded retry surfaces any deterministic failure as a proper error
  • implemented the Kim & Haziza (2014) low-dimensional joint estimator for control_inf(bias_correction = TRUE) without variable selection, replacing the previous object 'bias_corr_ys_rand_pred' not found crash; the DR analytic and bootstrap variance paths now route through the same joint-estimation helper used by the Yang-Kim-Song (2020) high-dimensional path, and bias_correction = TRUE without svydesign is rejected up front (closes #114) – thanks to Tommy Nyberg for reporting.
  • shortened CRAN-facing examples, added spelling wordlist entries, and made R CMD check run a fast tinytest subset by default while keeping the full suite available on GitHub via NONPROBSVY_FULL_TESTS=true
  • corrected IPW-GEE analytic variance components for h(x)=x/pi, the GEE b-vector sign, and probit/cloglog probability-sample scaling for h(x)=x, and added plain-R/Rcpp GEE formula checks
  • aligned MLE IPW analytic variance components for logit, probit, and cloglog propensity models with the shared pi_dot formula and nonprobability-sample IPW plug-in scale, including the known-N probit adjustment
  • allowed outcome variable-selection models to use a user-specified control_out(lambda = ...) value instead of always running cross-validation (closes #66)
  • returned bootstrap IPW weights in boot_ipw_weights when bootstrap variance estimation is used and bootstrap results are kept (closes #76)
  • improved the Rcpp variable-selection cross-validation code, fixed SCAD penalty tapering, switched fold sampling to R’s RNG for set.seed() reproducibility, and added benchmark evidence for the speedup (closes #103)
  • added DR regression tests for one-outcome versus multi-outcome analytic uncertainty-component consistency and multi-outcome bootstrap output shapes (addresses part of #101)
  • added MI regression tests for one-outcome versus multi-outcome output, confidence-interval, and uncertainty-component consistency across GLM, NN, PMM, and NPAR backends (addresses part of #101)
  • added IPW regression tests for one-outcome versus multi-outcome analytic variance, HT versus Hajek denominator metadata, and probit/cloglog variable-selection smoke coverage (addresses part of #101)
  • made make_outcomes() return names explicit so callers no longer rely on partial $ matching (closes #100)
  • stabilized extreme-propensity IPW computations by clamping fitted probabilities, using tail-stable log-likelihood formulas, and guarding Rcpp variable-selection weights (closes #102)
  • clarified supported data structures, estimator-by-outcome scope, README examples, and verbose argument documentation (closes #95, #97, #98)
  • tightened validation for pop_means, case_weights, and logical inference-control flags so unsupported inputs fail early with clearer messages (closes #96)
  • fixed multicore IPW bootstrap for population-totals-only runs and kept bootstrap replicate output shapes consistent (closes #94)
  • aligned IPW point-estimator denominators with Horvitz-Thompson vs Hajek estimator behavior, added estimator-family metadata and print output, and documented when each estimator is used (closes #89)
  • fixed incorrect analytic uncertainty for multi-outcome IPW and DR fits by aligning outcome-specific variance and confidence-interval indexing (closes #87, #88)
  • fixed the NN mass-imputation k = 1 path by avoiding dimension drop errors and using leave-one-out matching for the non-probability variance proxy (closes #92)
  • fixed nn_exact_se = TRUE so the mini-bootstrap uses bootstrap-specific nearest-neighbor matches instead of reusing the original donor matches (closes #91)
  • fixed PMM pmm_k_choice = "min_var" so the best k found so far is returned instead of the first non-improving k (closes #93)
  • fixed nonprob_mi() model-frame construction to pass case_weights instead of an undefined internal weights symbol; this is a plumbing fix and does not otherwise change current case-weight semantics (closes #99)
  • reused nearest-neighbor search results across outcomes for multi-outcome NN mass-imputation fits (closes #104)
  • randomized equal-distance nearest-neighbor tie handling before donor aggregation, including hidden cutoff ties beyond the neighbours initially returned by RANN::nn2() (closes #105)
  • changed PMM pmm_k_choice = "min_var" to perform a full search over the candidate k grid rather than stopping at the first non-improving value (closes #106)
  • aligned NN and PMM point estimates and probability-side analytic variances with known-N Horvitz-Thompson means when pop_size is supplied (closes #107)
  • updated NN/PMM function documentation for known-N means, leave-one-out NN variance proxy, weighted mini-bootstrap, random tie handling, and full-grid PMM k selection (closes #108)
  • added regression tests for multi-outcome analytic variance, NN k handling, bootstrap-based NN exact SE, and PMM k choice

nonprobsvy 0.2.3

CRAN release: 2025-08-20

  • changes to the documentation to meet the JSS requirements
  • documentation polishing

nonprobsvy 0.2.2

CRAN release: 2025-05-24

  • new hex sticker by Oliwia Awuku
  • minor changes to the code, e.g. control_out(eps=1e-8)
  • fixing a bug in the bootstrap variance estimator the method_nn and method_pmm
  • fixing bootstrap for doubly robust estimators
  • more unit-tests for doubly robust estimators and other methods
  • more informative vignette for method_glm

nonprobsvy 0.2.1

CRAN release: 2025-04-22


  • titles corrected
  • new S3 method extract added which allows to extract results from the nonprob object
  • new S3 method coef added which allows to obtain the coefficients of underlying models (if possible)
  • fixed CRAN notes (unit tests for the IPW estimator cloglog)
  • removed sampling package from suggested package
  • added simple plot method
  • improvements in the linear algebra
  • corrected the check_balance error (closes #75)
  • code cleaning

nonprobsvy 0.2.0

CRAN release: 2025-03-27


Breaking changes

  • functions pop.size, controlSel, controlOut and controlInf were renamed to pop_size, control_sel, control_out and control_inf respectively.
  • function genSimData removed completely as it is not used anywhere in the package.
  • argument maxLik_method renamed to maxlik_method in the control_sel function.
  • control_out function:
    • predictive_match renamed to pmm_match_type to align with the PMM (Predictive Mean Matching) estimator naming convention, where all related parameters start with pmm_
  • control_sel function:
    • argument method removed as it was not used
    • argument est_method_sel renamed to est_method
    • argument h renamed to gee_h_fun to make this more readable to the user
    • start_type now accepts only zero and mle (for gee models only).
  • control_inf function:
    • bias_inf renamed to vars_combine and type changed to logical. TRUE if variables (its levels) should be combined after variable selection algorithm for the doubly robust approach.
    • pi_ij – argument removed as it is not used.
  • nonprobsvy class renamed to nonprob and all related method adjusted to this change
  • functions logit_model_nonprobsvy, probit_model_nonprobsvy and cloglog_model_nonprobsvy removed in the favour of more readable method_ps function that specifies the propensity score model
  • new option control_inference=control_inf(vars_combine=TRUE) which allows doubly robust estimator to combine variables prior estimation i.e. if selection=~x1+x2 and y~x1+x3 then the following models are fitted selection=~x1+x2+x3 and y~x1+x2+x3. By default we set control_inference=control_inf(vars_combine=FALSE). Note that this behaviour is assumed independently from variable selection.
  • argument nonprob(weights=NULL) replaced to nonprob(case_weights=NULL) to stress that this refer to case weights not sampling or other weights in non-probability sample

Features

  • two additional datasets have been included: jvs (Job Vacancy Survey; a probability sample survey) and admin (Central Job Offers Database; a non-probability sample survey). The units and auxiliary variables have been aligned in a way that allows the data to be integrated using the methods implemented in this package.
  • a check_balance function was added to check the balance in the totals of the variables based on the weighted weights between the non-probability and probability samples.
  • citation file added.
  • argument na_action with default na.omit
  • new generic methods added:
    • weights – returns IPW weights
    • update – allows to update the nonprob class object
  • new functions added and exported:
    • method_ps – for modelling propensity score
    • method_glm – for modelling y using glm function
    • method_nn – for the NN method
    • method_pmm – for the PMM method
    • method_npar – for the non-parametric method
  • new print.nonprob, summary.nonprob and print.nonprob_summary methods
> result_mi
A nonprob object
 - estimator type: mass imputation
 - method: glm (gaussian)
 - auxiliary variables source: survey
 - vars selection: false
 - variance estimator: analytic
 - population size fixed: false
 - naive (uncorrected) estimators:
   - variable y1: 3.1817
   - variable y2: 1.8087
 - selected estimators:
   - variable y1: 2.9498 (se=0.0420, ci=(2.8674, 3.0322))
   - variable y2: 1.5760 (se=0.0326, ci=(1.5122, 1.6399))

number of digits can be changed using print(x, digits) as shown below

> print(result_mi,2)
A nonprob object
 - estimator type: mass imputation
 - method: glm (gaussian)
 - auxiliary variables source: survey
 - vars selection: false
 - variance estimator: analytic
 - population size fixed: false
 - naive (uncorrected) estimators:
   - variable y1: 3.18
   - variable y2: 1.81
 - selected estimators:
   - variable y1: 2.95 (se=0.04, ci=(2.87, 3.03))
   - variable y2: 1.58 (se=0.03, ci=(1.51, 1.64))
> summary(result_mi) |> print(digits=2)
A nonprob_summary object
 - call: nonprob(data = subset(population, flag_bd1 == 1), outcome = y1 + 
    y2 ~ x1 + x2, svydesign = sample_prob)
 - estimator type: mass imputation
 - nonprob sample size: 693011 (69.3%)
 - prob sample size: 1000 (0.1%)
 - population size: 1000000 (fixed: false)
 - detailed information about models are stored in list element(s): "outcome"
----------------------------------------------------------------
 - distribution of outcome residuals:
   - y1: min: -4.79; mean: 0.00; median: 0.00; max: 4.54
   - y2: min: -4.96; mean: -0.00; median: -0.07; max: 12.25
 - distribution of outcome predictions (nonprob sample):
   - y1: min: -2.72; mean: 3.18; median: 3.04; max: 16.28
   - y2: min: -1.55; mean: 1.81; median: 1.58; max: 13.92
 - distribution of outcome predictions (prob sample):
   - y1: min: -0.46; mean: 2.95; median: 2.84; max: 10.31
   - y2: min: -0.58; mean: 1.58; median: 1.39; max: 7.87
----------------------------------------------------------------

Bugfixes

  • basic methods and functions related to variance estimation, weights and probability linking methods have been rewritten in a more optimal and readable way.

Other

  • more informative error messages added.
  • documentation improved.
  • switching completely to snake_case.
  • extensive cleaning of the code.
  • more unit-tests added.
  • new dependencies: formula.tools

Documentation

  • annotation has been added that argument strata is not supported for the time being.

Replication materials

nonprobsvy 0.1.1

CRAN release: 2024-11-14


Bugfixes

  • bug Fix occurring when estimation was based on auxiliary variable, which led to compression of the data from the frame to the vector.
  • bug Fix related to not passing maxit argument from controlSel function to internally used nleqslv function
  • bug Fix related to storing vector in model_frame when predicting y_hat in mass imputation glm model when X is based in one auxiliary variable only - fix provided converting it to data.frame object.

Features

  • added information to summary about quality of estimation basing on difference between estimated and known total values of auxiliary variables
  • added estimation of exact standard error for k-nearest neighbor estimator.
  • added breaking change to controlOut function by switching values for predictive_match argument. From now on, the predictive_match = 1 means \hat{y}-\hat{y} in predictive mean matching imputation and predictive_match = 2 corresponds to \hat{y}-y matching.
  • implemented div option when variable selection (more in documentation) for doubly robust estimation.
  • added more insights to nonprob output such as gradient, hessian and jacobian derived from IPW estimation for mle and gee methods when IPW or DR model executed.
  • added estimated inclusion probabilities and its derivatives for probability and non-probability samples to nonprob output when IPW or DR model executed.
  • added model_frame matrix data from probability sample used for mass imputation to nonprob when MI or DR model executed.

Unit tests

  • added unit tests for variable selection models and mi estimation with vector of population totals available

nonprobsvy 0.1.0

CRAN release: 2024-04-04


Features

  • implemented population mean estimation using doubly robust, inverse probability weighting and mass imputation methods
  • implemented inverse probability weighting models with Maximum Likelihood Estimation and Generalized Estimating Equations methods with logit, complementary log-log and probit link functions.
  • implemented generalized linear models, nearest neighbours and predictive mean matching methods for Mass Imputation
  • implemented bias correction estimators for doubly-robust approach
  • implemented estimation methods when vector of population means/totals is available
  • implemented variables selection with SCAD, LASSO and MCP penalization equations
  • implemented analytic and bootstrap (with parallel computation - doSNOW package) variance for described estimators
  • added control parameters for models
  • added S3 methods for object of nonprob class such as
    • nobs for samples size
    • pop.size for population size estimation
    • residuals for residuals of the inverse probability weighting model
    • cooks.distance for identifying influential observations that have a significant impact on the parameter estimates
    • hatvalues for measuring the leverage of individual observations
    • logLik for computing the log-likelihood of the model,
    • AIC (Akaike Information Criterion) for evaluating the model based on the trade-off between goodness of fit and complexity, helping in model selection
    • BIC (Bayesian Information Criterion) for a similar purpose as AIC but with a stronger penalty for model complexity
    • confint for calculating confidence intervals around parameter estimates
    • vcov for obtaining the variance-covariance matrix of the parameter estimates
    • deviance for assessing the goodness of fit of the model

Unit tests

  • added unit tests for IPW estimators.

Github repository

  • added automated R-cmd check

Documentation

  • added documentation for nonprob function.