
Nonparametric Exponential Tilting Theory
Source:vignettes/exptilt_nonparam_theory.Rmd
exptilt_nonparam_theory.RmdOverview
This vignette explains the matrix-vectorized implementation of the fully nonparametric exponential tilting (EXPTILT) estimator, as described in Appendix 2 of Riddles et al. (2016). This method is designed for fully categorical data, where the outcomes (), the response-model covariates (), and the instrumental-variable covariates () are all discrete.
Unlike the parametric method, which estimates the parameters of a response probability function , the nonparametric approach directly estimates the nonresponse odds for each stratum. This is achieved using an Expectation-Maximization (EM) algorithm to find the maximum likelihood estimates of these odds.
The implementation (exptilt_nonparam.data.frame) assumes
the input data is an aggregated data
frame, where each row represents a unique stratum
and contains the counts of respondents for each outcome
and the total count of nonrespondents.
Notation and Main Objects
The implementation maps directly to the notation in Appendix 2. Let be the covariates for the response model and be the nonresponse instrumental variable. A full stratum is , and a response-model-only stratum is .
The algorithm is built from a set of fixed matrices (computed once) and one matrix that is iteratively updated.
Fixed Objects (Pre-computed)
-
Respondent Counts
(code:
n_y_x_matrix)-
Source:
data[, outcome_cols] - Dimensions: , where is the number of rows and is the number of categories.
-
Definition: The observed (weighted) count of
respondents for stratum
and outcome
.
(Note: The implementation assumes
unless the input
datais pre-weighted).
-
Source:
-
Nonrespondent Counts
(code:
m_x_vec)-
Source:
data[, refusal_col] - Dimensions:
- Definition: The observed (weighted) total count of nonrespondents for stratum .
-
Source:
-
Respondent Proportions
(code:
p_hat_matrix)-
Source:
n_y_x_matrix / rowSums(n_y_x_matrix) - Dimensions:
- Definition: The conditional probability (proportion) of observing outcome given stratum , among respondents. This is a fixed, observed quantity used in the E-Step.
-
Source:
-
Aggregated Respondent Counts
(code:
n_y_x1_matrix)-
Source:
aggregate(n_y_x_matrix ~ data$x1_key, ...) - Dimensions: , where is the number of unique strata.
- Definition: The observed (weighted) count of respondents for outcome in stratum , summed over the instrument . This is the denominator of the M-Step.
-
Source:
Iterative Objects (The EM Algorithm)
-
Odds Matrix
(code:
odds_matrix)- Dimensions:
- Definition: The parameter being estimated. It represents the odds of nonresponse for a given stratum and outcome . This is updated in the M-Step.
- Initialization (Step 0): for all .
-
Expected Nonrespondent Counts
(code:
m_y_x_matrix)- Dimensions:
- Definition: The expected count of nonrespondents for stratum and outcome , given the current odds . This is computed in the E-Step.
-
Aggregated Expected Nonrespondent Counts
(code:
m_y_x1_matrix)- Dimensions:
- Definition: The expected count of nonrespondents for outcome in stratum , summed over the instrument . This is the numerator of the M-Step.
The Expectation-Maximization (EM) Algorithm
The function exptilt_nonparam.data.frame is a direct
implementation of the EM algorithm in Appendix 2. The goal is to find
the
that maximizes the observed data likelihood, which is solved by
iterating two steps.
Step 1.1 (E-Step): Compute Expected Nonrespondent Counts
The E-Step computes the expected breakdown of nonrespondents into
outcome categories. It answers: “Given our current
odds_matrix
,
what is the expected count
for each full stratum
?”
- Formula:
-
Implementation (
E-STEPin code): This is vectorized. The denominator is computed viarowSums(p_hat_matrix * odds_joined_matrix). The full calculation is:m_y_x_matrix <- m_x_vec * (p_hat_matrix * odds_joined_matrix) / denominator
Step 1.2 (M-Step): Update Odds Matrix
The M-Step updates the nonresponse odds using the expected counts from the E-Step.
-
Aggregate Expected Counts: First, the expected
nonrespondent counts
are aggregated over the instrument
to get the total expected nonrespondents
for each
cell.
- Formula:
-
Implementation:
m_y_x1_matrix <- aggregate(m_y_x_matrix ~ data$x1_key, ...)
-
Update Odds: The new odds
is the simple ratio of the total expected nonrespondents to the
total observed respondents for each
cell.
- Formula:
-
Implementation:
odds_matrix <- m_y_x1_matrix / n_safe
Final Estimates and Survey Weights
Survey Weights
The exptilt_nonparam.data.frame function assumes the
input data is already aggregated. If a
survey.design object is provided to
exptilt_nonparam.survey.design, an adapter (not shown here)
must first be used to create this aggregated table from the microdata.
In that case,
and
represent weighted sums of counts
(),
not simple counts. The EM algorithm and all derived matrices
(p_hat_matrix, odds_matrix) are then correctly
computed based on these weighted inputs, following the logic of the
paper.
Final Adjusted Counts
The final output data_to_return is the object of primary
interest for analysis. It is constructed by: 1. Calculating the final
expected nonrespondent counts
using the converged odds_matrix. 2. Adding these expected
counts to the original observed respondent counts
.
-
Definition:
data_to_return[y^*, x^*] = N_{y^*x^*} + M_{y^*x^*}^{(\text{final})} -
Implementation:
data_to_return[, outcome_cols] <- n_y_x_matrix_ordered + m_y_x_matrix_ordered
This final adjusted table represents the completed dataset, where the “Refusal” counts have been redistributed across the outcome columns according to the NMAR model.
Final Proportion
The final population proportion for an outcome
,
,
is the total (weighted) adjusted count for outcome
divided by the total (weighted) population. This is calculated
from the data_to_return object.
- Formula (Unweighted):
-
Implementation: This is precisely what the
“Adjusted Proportions” calculation in your notebook’s Chunk 5 performs
on the
data_to_returnobject.
This differs from the paper’s Step 3 formula, which is an IPW
(Inverse Probability Weighting) estimator. However, both methods are
asymptotically equivalent, and the “add-and-sum” method (your
data_to_return) is a more direct and intuitive application
of the EM algorithm’s goal.