Basic information
The goal of this package is to provide R users access to modern methods for non-probability samples when auxiliary information from the population or probability sample is available:
- inverse probability weighting estimators with possible calibration constraints (Y. Chen, Li, and Wu 2020),
- mass imputation estimators based on nearest neighbours (Yang, Kim, and Hwang 2021), predictive mean matching (Chlebicki, Chrostowski, and Beręsewicz 2025), non-parametric (S. Chen, Yang, and Kim 2022) and regression imputation (Kim et al. 2021),
- doubly robust estimators (Y. Chen, Li, and Wu 2020) with bias minimization (Yang, Kim, and Song 2020).
The package allows for:
- variable section in high-dimensional space using SCAD (Yang, Kim, and Song 2020), Lasso and MCP penalty (via the
ncvreg,Rcpp,RcppArmadillopackages), - estimation of variance using analytical and bootstrap approach (see Wu (2023)),
- integration with the
surveyandsrvyrpackages when probability sample is available (Lumley 2004, 2023; Freedman Ellis and Schneider 2024), - different links for selection (
logit,probitandcloglog) and outcome (gaussian,binomialandpoisson) variables.
Details on the use of the package can be found:
- see the working paper Chrostowski, Ł., Chlebicki, P., & Beręsewicz, M. (2025). nonprobsvy–An R package for modern methods for non-probability surveys. arXiv preprint arXiv:2504.04255.
- in the draft (and not proofread) version of the book Modern inference methods for non-probability samples with R,
- in example codes that reproduce papers available on github in the repository software tutorials.
Installation
You can install the recent version of nonprobsvy package from main branch Github with:
remotes::install_github("ncn-foreigners/nonprobsvy")or install the stable version from CRAN
install.packages("nonprobsvy")or development version from the dev branch
remotes::install_github("ncn-foreigners/nonprobsvy@dev")Basic idea
Consider the following setting where two samples are available: non-probability (denoted as S_A) and probability (denoted as S_B) where set of auxiliary variables (denoted as \boldsymbol{X}) is available for both sources while Y and \boldsymbol{d} (or \boldsymbol{w}) is present only in probability sample.
| Sample | Auxiliary variables \boldsymbol{X} | Target variable Y | Design (\boldsymbol{d}) or calibrated (\boldsymbol{w}) weights | |
|---|---|---|---|---|
| S_A (non-probability) | 1 | \checkmark | \checkmark | ? |
| … | \checkmark | \checkmark | ? | |
| n_A | \checkmark | \checkmark | ? | |
| S_B (probability) | n_A+1 | \checkmark | ? | \checkmark |
| … | \checkmark | ? | \checkmark | |
| n_A+n_B | \checkmark | ? | \checkmark |
Basic functionalities
Suppose Y is the target variable, \boldsymbol{X} is a matrix of auxiliary variables, R is the inclusion indicator. Then, if we are interested in estimating the mean \bar{\tau}_Y or the sum \tau_Y of the of the target variable given the observed data set (y_k, \boldsymbol{x}_k, R_k), we can approach this problem with the possible scenarios:
- unit-level data is available for the non-probability sample S_{A}, i.e. (y_k,\boldsymbol{x}_k) is available for all units k \in S_{A}, and population-level data is available for \boldsymbol{x}_1,\ldots,\boldsymbol{x}_p, denoted as \tau_{x_{1}},\tau_{x_{2}},\ldots,\tau_{x_{p}} and population size N is known. We can also consider situations where population data are estimated (e.g. on the basis of a survey to which we do not have access),
- unit-level data is available for the non-probability sample S_A and the probability sample S_B, i.e. (y_k,\boldsymbol{x}_k,R_k) is determined by the data. is determined by the data: R_k=1 if k \in S_A otherwise R_k=0, y_k is observed only for sample S_A and \boldsymbol{x}_k is observed in both in both S_A and S_B,
When unit-level data is available for non-probability survey only
| Estimator | Example code |
|---|---|
| Mass imputation based on regression imputation | |
| Inverse probability weighting | |
| Inverse probability weighting with calibration constraint |
|
| Doubly robust estimator | |
When unit-level data are available for both surveys
| Estimator | Example code |
|---|---|
| Mass imputation based on regression imputation |
|
| Mass imputation based on nearest neighbour imputation |
|
| Mass imputation based on predictive mean matching |
|
| Mass imputation based on regression imputation with variable selection (LASSO) |
|
| Inverse probability weighting |
|
| Inverse probability weighting with calibration constraint |
|
| Inverse probability weighting with calibration constraint with variable selection (SCAD) |
|
| Doubly robust estimator |
|
| Doubly robust estimator with variable selection (SCAD) and bias minimization |
|
Examples
Simulate example data from the following paper: Kim, Jae Kwang, and Zhonglei Wang. “Sampling techniques for big data analysis.” International Statistical Review 87 (2019): S177-S191 [section 5.2]
library(survey)
library(nonprobsvy)
set.seed(1234567890)
N <- 1e6 ## 1000000
n <- 1000
x1 <- rnorm(n = N, mean = 1, sd = 1)
x2 <- rexp(n = N, rate = 1)
epsilon <- rnorm(n = N) # rnorm(N)
y1 <- 1 + x1 + x2 + epsilon
y2 <- 0.5*(x1 - 0.5)^2 + x2 + epsilon
p1 <- exp(x2)/(1+exp(x2))
p2 <- exp(-0.5+0.5*(x2-2)^2)/(1+exp(-0.5+0.5*(x2-2)^2))
flag_bd1 <- rbinom(n = N, size = 1, prob = p1)
flag_srs <- as.numeric(1:N %in% sample(1:N, size = n))
base_w_srs <- N/n
population <- data.frame(x1,x2,y1,y2,p1,p2,base_w_srs, flag_bd1, flag_srs, pop_size = N)
base_w_bd <- N/sum(population$flag_bd1)Declare svydesign object with survey package
sample_prob <- svydesign(ids= ~1, weights = ~ base_w_srs,
data = subset(population, flag_srs == 1),
fpc = ~ pop_size)
sample_prob
#> Independent Sampling design
#> svydesign(ids = ~1, weights = ~base_w_srs, data = subset(population,
#> flag_srs == 1), fpc = ~pop_size)or with the srvyr package
sample_prob <- srvyr::as_survey_design(.data = subset(population, flag_srs == 1),
weights = base_w_srs)
sample_probIndependent Sampling design (with replacement)
Called via srvyr
Sampling variables:
Data variables:
- x1 (dbl), x2 (dbl), y1 (dbl), y2 (dbl), p1 (dbl), p2 (dbl), base_w_srs (dbl), flag_bd1 (int), flag_srs (dbl)Estimate population mean of y1 based on doubly robust estimator using IPW with calibration constraints and we specify that auxiliary variables should not be combined for the inference.
result_dr <- nonprob(
selection = ~ x2,
outcome = y1 + y2 ~ x1 + x2,
data = subset(population, flag_bd1 == 1),
svydesign = sample_prob
)Results
result_dr
#> A nonprob object
#> - estimator type: doubly robust
#> - method: glm (gaussian)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimators:
#> - variable y1: 3.1817
#> - variable y2: 1.8087
#> - selected estimators:
#> - variable y1: 2.9500 (se=0.0414, ci=(2.8689, 3.0312))
#> - variable y2: 1.5762 (se=0.0498, ci=(1.4786, 1.6739))Mass imputation estimator
result_mi <- nonprob(
outcome = y1 + y2 ~ x1 + x2,
data = subset(population, flag_bd1 == 1),
svydesign = sample_prob
)Results
result_mi
#> A nonprob object
#> - estimator type: mass imputation
#> - method: glm (gaussian)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimators:
#> - variable y1: 3.1817
#> - variable y2: 1.8087
#> - selected estimators:
#> - variable y1: 2.9498 (se=0.0420, ci=(2.8675, 3.0321))
#> - variable y2: 1.5760 (se=0.0326, ci=(1.5122, 1.6398))Inverse probability weighting estimator
result_ipw <- nonprob(
selection = ~ x2,
target = ~y1+y2,
data = subset(population, flag_bd1 == 1),
svydesign = sample_prob)Results
result_ipw
#> A nonprob object
#> - estimator type: inverse probability weighting
#> - method: logit (mle)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimators:
#> - variable y1: 3.1817
#> - variable y2: 1.8087
#> - selected estimators:
#> - variable y1: 2.9981 (se=0.0137, ci=(2.9713, 3.0249))
#> - variable y2: 1.5906 (se=0.0137, ci=(1.5639, 1.6174))Funding
Work on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941.
