Exponential Tilting Theory

Overview

This vignette explains the matrix-vectorized implementation of the exponential tilting (EXPTILT) estimator used in this package. The exposition focuses on concepts and linear-algebraic operations used in the implementation (functions such as generate_conditional_density_matrix, generate_C_matrix, generate_Odds, and the vectorized s_function), and mirrors the statistical equations that underlie the algorithm (see Riddles et al. (2016) for the theoretical background).

The method operates on a dataset split into respondents and non-respondents. Let $n = n_0 + n_1$ where $n_1$ is the number of respondents (observed $Y$ ) and $n_0$ is the number of non-respondents (missing $Y$ ). The algorithm always treats these two groups separately: the conditional-density model is fit on respondents and used to impute support for non-respondents; the missingness model is fit using covariates that include the candidate $y$ values.

We emphasize two distinct and disjoint sets of covariates throughout, and we require they have empty intersection: $\mathcal{X}_{\text{outcome}} \cap \mathcal{X}_{\text{missingness}} = \varnothing.$

covariates_for_outcome (denote $X_{\text{out}}$ ): variables used to model the conditional density $f_1(y\mid X_{\text{out}})$ on respondents;
covariates_for_missingness (denote $X_{\text{miss}}$ ): variables used in the response probability $\pi(X_{\text{miss}},y;\phi)$ (this includes candidate $y$ values when evaluating the missing-data expectation).

Distinguishing these sets clearly in notation and code is crucial: the conditional density model is fit using covariates_for_outcome on respondents only, while the missingness model uses covariates_for_missingness and (importantly) takes $y$ as an additional predictor when forming $\pi(\cdot;\phi)$ .

Notation and main objects

Let:

$n$ be the total number of units; split them into respondents (observed $Y$ ) and non-respondents (missing $Y$ ). Denote indices with $i$ for non-respondents and $k$ (or $j$ ) for respondent outcomes used as support.
$\{y_j\}_{j=1}^{n_1}$ be the vector of observed outcome values (respondents).
$X^{\text{out}}$ denote the matrix of covariates_for_outcome for respondents.
$X^{\text{un}}$ denote the matrix of covariates_for_outcome for non-respondents.
$X^{\text{miss,obs}}$ and $X^{\text{miss,un}}$ denote the covariates_for_missingness for respondents and non-respondents, respectively (we shorten these to $X_{\text{miss}}^{\text{obs}}$ and $X_{\text{miss}}^{\text{un}}$ when needed).
$\pi(x_{\text{miss}},y;\phi)$ the response probability (no explicit link notation here): for a given covariate row used in the missingness model and a candidate $y$ , $\pi(\cdot;\phi)$ denotes the probability of response under parameter $\phi$ .

We build three central matrices:

Conditional-density matrix $F$ (denoted in code as f_matrix_nieobs):
- Size: $n_0 \times n_1$ where $n_0$ is the number of non-respondents and $n_1$ is the number of distinct respondent $y$ values used as support.
- Entries: $F_{ij} = f_1(y_j \mid x^{\text{un}}_i; \hat\gamma)$
  where $\hat\gamma$ is the (estimated) parameter vector of the conditional-density model fit on respondents. This corresponds to the empirical approximation in equation (12): $\hat f_1(y_j) \propto \sum_{k:\,\delta_k=1} f_1(y_j \mid x_k; \hat\gamma).$
Column-normalizer vector $C$ (denoted in code as C_matrix_nieobs):
- Size: $n_1 \times 1$ (a column vector).
- Entries: the column-sums of the conditional densities evaluated at respondent covariates: $C_j = C(y_j;\hat\gamma) = \sum_{k:\,\delta_k=1} f_1(y_j \mid x^{\text{obs}}_k;\hat\gamma).$ Conceptually this is the denominator that appears when fractional weights are formed (see below).
Odds matrix $O(\phi)$ (constructed by generate_Odds):
- Size: $n_0 \times n_1$ .

Entries (for non-respondent $i$ and candidate $y_j$ ): $O_{ij}(\phi) = \frac{1-\pi(x^{\text{miss,un}}_i, y_j;\phi)}{\pi(x^{\text{miss,un}}_i, y_j;\phi)}$ The implementation exploits the separability of the linear predictor in the parameters: $\eta_{ij} = \beta_0 + X^{\text{miss,un}}_i\beta_{\text{miss}} + \beta_y y_j,$ and uses outer() to form the $n_0\times n_1$ matrix efficiently.

Using these objects we form a non-normalized weight matrix $U_{ij}(\phi) = O_{ij}(\phi) \cdot F_{ij}$ and then a normalized fractional-weight matrix $W_{ij}(\phi) = \frac{U_{ij}(\phi)}{C_j} = \frac{O_{ij}(\phi)\,f_1(y_j\mid x^{\text{un}}_i;\hat\gamma)}{\sum_{k:\,\delta_k=1} f_1(y_j\mid x^{\text{obs}}_k;\hat\gamma)}.$

These $W_{ij}$ match the weights appearing in the theoretical mean score approximation (equation (15) in the notes): they are the fractional contribution of each imputed $(i,j)$ pair to the conditional expectation for unit $i$ .

The vectorized score S_2 and matrix algebra

Recall the mean-score approximation that leads to the estimating equation used to solve for $\phi$ (cf. equation (13) in the notes): $S_2(\phi; \hat\phi^{(t)}, \hat\gamma) = \sum_{r=1}^n \Big[ \delta_r s(\phi;\delta_r, x^{\text{miss}}_r, x^{\text{out}}_r, y_r) + (1-\delta_r)\widetilde{E}_0\{ s(\phi;\delta, x^{\text{miss}}_r, x^{\text{out}}_r, Y) \mid x^{\text{out}}_r;\hat\phi^{(t)},\hat\gamma\} \Big] = 0.$

It is convenient to decompose this as the sum of observed and unobserved contributions: $S_2(\phi) = S_{\text{obs}}(\phi) + S_{\text{un}}(\phi),$ where $S_{\text{obs}}(\phi) = \sum_{r:\,\delta_r=1} s(\phi;\delta=1, x^{\text{miss}}_r, x^{\text{out}}_r, y_r)$ is the score contribution from respondents, and the missing-unit contribution is approximated by the discrete-support expectation $S_{\text{un}}(\phi) \approx \sum_{i:\,\delta_i=0} \sum_{j=1}^{n_1} W_{ij}(\phi)\, s(\phi;\delta=0, x^{\text{miss}}_i, x^{\text{out}}_i, y_j).$ The remainder of this section explains how the double sum in $S_{\text{un}}(\phi)$ is computed via matrix operations.

The crucial observation for vectorization is that the inner conditional expectation for a non-respondent $i$ is approximated by a weighted finite-sum over the respondent support $\{y_j\}$ : $\widetilde{E}_0\{ s(\phi;\delta, x^{\text{miss}}_i, x^{\text{out}}_i, Y) \mid x^{\text{out}}_i\} \approx \sum_{j=1}^{n_1} W_{ij}(\phi)\, s(\phi;\delta=0, x^{\text{miss}}_i, x^{\text{out}}_i, y_j).$

Stacking the non-respondent expectations across all non-respondents gives a single matrix operation. Let

For each pair $(i,j)$ evaluate the vector-valued score s at $(\delta=0, x^{\text{miss}}_i, x^{\text{out}}_i, y_j)$ . Collect these quickly by exploiting the algebraic factorization of the score (see the s_function implementation): many parameter-specific components are separable in $i$ and $j$ which allows creation of low-memory representations.
Denote by $S^{(0)}_{ij}$ the p-dimensional score vector at $(i,j)$ . Organize these so that for each parameter index $m$ we have an $n_0\times n_1$ matrix of values $[S^{(0)}_{\bullet\bullet}]_{m}$ (parameter-wise maps). Then the non-respondent contribution to the overall score vector is $\sum_{i=1}^{n_0} \sum_{j=1}^{n_1} W_{ij}(\phi)\, S^{(0)}_{ij} = \left[ \,\text{vec} \big( W \circ [S^{(0)}]_{1} \big)\, ,\, \ldots\, ,\, \text{vec}\big( W \circ [S^{(0)}]_{p} \big)\,\right]^\top$ where $\circ$ denotes elementwise multiplication and vec followed by an appropriate collapse (row-sum or column-sum) implements the inner summation depending on the parameter’s factorization. In concrete computational terms:
For parameter components that multiply per-row (i.e. depend only on $x_i$ times a factor that is a function of $y_j$ ) we compute elementwise products between $W$ and a factor matrix and then row-sum across $j$ to get an $n_0\times 1$ contribution per non-respondent, and then sum across $i$ .
For intercept-like or column-wise components, a column-sum followed by weighted multiplication suffices.

In the implementation this reduces to a sequence of dense matrix operations and row/column sums rather than explicit loops over the expanded index set of length $n_0\times n_1$ . This yields large speed and memory benefits for real datasets.

Concise vectorized S_2 recipe (conceptual):

Build $F$ (size $n_0\times n_1$ ) via generate_conditional_density_matrix(model).
Build $C$ (size $n_1$ ) via generate_C_matrix(model) by summing the conditional densities over respondents.
For a candidate $\phi$ compute $O(\phi)$ (size $n_0\times n_1$ ) via generate_Odds(model,\phi).
Form $W(\phi)=\dfrac{O(\phi)\circ F}{\mathbf{1}_{n_0} C^\top}$ (i.e. divide each column of $O\circ F$ by the corresponding scalar $C_j$ ).
Compute observed-score sum: $S_{\text{obs}}(\phi)=\sum_{r:\,\delta_r=1} s(\phi;\delta=1,x^{\text{miss}}_r,x^{\text{out}}_r,y_r)$ (this is small: one score vector per respondent).
Compute non-respondent expected-score: use $W(\phi)$ and parameter-wise factor matrices derived from $s(\cdot)$ to compute $S_{\text{un}}(\phi)=\sum_{i=1}^{n_0}\sum_{j=1}^{n_1} W_{ij}(\phi)\, s(\phi;\delta=0,x^{\text{miss}}_i,x^{\text{out}}_i,y_j)$ implemented via matrix multiplications and row/column sums.
Return the total score $S_2(\phi)=S_{\text{obs}}(\phi) + S_{\text{un}}(\phi)$ .

The root-finder then searches for $\phi$ such that $S_2(\phi)=0$ . In the package this is implemented by repeatedly forming $O(\phi)$ for the current candidate $\phi$ , computing $W(\phi)$ and $S_2(\phi)$ , and letting nleqslv perform the iteration.

Mean estimation and survey weights

After the response-model parameters $\hat\phi$ are obtained the package reports a mean estimate of the target outcome using an inverse-probability reweighting form. Let respondents be indexed by $j=1,\dots,n_1$ ; denote by $\pi_j = \pi(x^{\text{miss}}_j,y_j;\hat\phi)$ the fitted response probabilities evaluated at each respondent (here we write $x^{\text{miss}}_j$ for the covariates used in the missingness model evaluated at respondent $j$ ). With design weights $w_j$ (by default $w_j\equiv 1$ for non-survey data) the point estimator computed in the implementation is $\hat\mu = \frac{\sum_{j=1}^{n_1} w_j\, y_j / \pi_j}{\sum_{j=1}^{n_1} w_j / \pi_j}.\tag{mean-est}$

Notes on survey weights and $f_1$ fitting:

The conditional-density fit used to build $F$ can incorporate respondent sampling weights when provided (the implementation passes respondent weights to the density-fitting routine). Thus $f_1(\cdot\mid x;\gamma)$ may be a weighted fit when design_weights or a survey design is supplied.
The mean-estimate (mean-est) also uses design weights $w_j$ in both numerator and denominator as shown above. In code these are provided via model$design_weights and default to 1 for non-survey use.

In short: survey weights enter two places — (i) the conditional-density estimation (if requested) and (ii) the final mean calculation through the weighted IPW-type ratio in (mean-est).

Arguments passed to `exptilt` (summary)

The exptilt / exptilt.data.frame user-facing function accepts a number of arguments that control model specification and computation; the most important are:

data: a data.frame with outcome and covariates.
formula: a partitioned formula of the form y ~ aux1 + aux2 | miss1 + miss2 where the left part (before |) lists covariates_for_outcome and the right part lists covariates_for_missingness (the package helper splits these automatically).
auxiliary_means: (optional) population or target means for auxiliary variables used for scaling.
standardize: logical, whether to standardize features before fitting.
prob_model_type: character, either "logit" or "probit" to select the response probability family.
y_dens: choice of conditional-density family; "auto", "normal", "lognormal", or "exponential".
variance_method: "delta" or "bootstrap" for variance estimation.
bootstrap_reps: number of bootstrap replications when bootstrap is used.
control: list of control parameters forwarded to the nonlinear solver (nleqslv).
stopping_threshold: numeric; the sup-norm threshold for early stopping of the score (see above).
on_failure: behavior on failure ("return" or "error").
supress_warnings: logical to silence certain warnings.
design_weights: optional vector of respondent design weights (or full-sample weights which are subset internally).
survey_design: optional survey.design object; when provided some internal logic uses the survey path.
trace_level: integer controlling verbosity.
sample_size: integer for stratified subsampling used when data are large.
outcome_label, user_formula: utility arguments used for presentation and bookkeeping.

These arguments appear in the exptilt.data.frame function signature and control how the matrices $F$ , $C$ , and $O$ are built and how the solver is run.

Connection to the EM viewpoint

The EM-like update displayed in the theoretical notes (equation (14)) $\hat\phi^{(t+1)} \leftarrow \text{solve } S_2(\phi \mid \hat\phi^{(t)},\hat\gamma)=0$ is exactly what the implementation achieves: for a fixed conditional-density estimate $\hat\gamma$ and a current $\hat\phi^{(t)}$ , the expectations for the missing units are approximated by the discrete support on observed $y_j$ and the resulting equation for $\phi$ is solved (via root-finding). The heavy-lift is performed by the matrix calculus described earlier — constructing $F$ , $C$ , $O$ and then computing $W$ and multiplying by the parameter-wise score factors.

Stopping criterion (maximum-norm)

Practical optimization requires a convergence criterion. The implementation uses the maximum absolute component of the vector-valued score as a stopping rule. Concretely, if the solver produces a score vector $S_2(\phi)$ with $\max_{m=1,\dots,p} |S_{2,m}(\phi)| < \varepsilon$ for a user-specified stopping_threshold = \varepsilon, the algorithm treats this as converged. In the code this is used as an early-exit inside the target-function passed to the nonlinear solver: when the score’s sup-norm is below the threshold, a zero vector is returned to signal convergence and avoid further unnecessary computations.

This choice matches the intuition that the root-finder should stop when the estimating equations are (componentwise) negligibly small.

Practical tutorial: from raw data to matrix operations (conceptual steps)

Fit the conditional density $f_1(y\mid x;\gamma)$ using respondents and covariates_for_outcome. This gives a function that evaluates $f_1(y, x)$ for any $y$ and any covariate row $x$ (used for both respondents and non-respondents).
Evaluate the conditional density at the Cartesian product of non-respondent covariates and observed respondent $y$ values to form $F$ (done by generate_conditional_density_matrix). This is the empirical imputation support. Think of rows as target non-respondents and columns as candidate respondent outcomes.
Evaluate the same conditional density at respondent covariates to form the column-normalizer $C$ (done by generate_C_matrix) — this is simply the column-sum of densities over respondents.
For each trial value $\phi$ of the response-model parameters, compute the odds matrix $O(\phi)$ using the separable linear predictor and link function (implemented efficiently via outer() in code). Combine $O$ with $F$ and normalize columns by $C$ to obtain $W(\phi)$ .
Use the vectorized s_function to obtain parameter-specific factor matrices for the non-respondent-imputed scores; multiply (elementwise) by $W(\phi)$ and reduce (row/column sums) to compute the non-respondent contribution to $S_2(\phi)$ .
Add the observed-respondent score and use a root-finder (e.g. nleqslv) to find $\phi$ with $S_2(\phi)=0$ . The solver may use the maximum-norm stopping threshold described above to exit early.

Why the vectorization matters (practical remarks)

Memory: the naive expansion to an explicit dataset of size $n_0\times n_1$ would store duplicated covariate rows and blow up memory. The implemention exploits separability (intercept + $X^{\text{miss}}_i\beta_{\text{miss}}$ + $\beta_y y_j$ ) and vectorized R primitives (outer, matrix multiplications, column/row sums) to avoid large temporary allocations.
Speed: elementwise operations on dense matrices plus BLAS-accelerated matrix multiplication are much faster than interpreted loops in R for typical dataset sizes.
Clarity: organizing logic as the three matrices $F$ , $C$ , and $O$ , followed by elementwise combination and reductions, makes the relationship between the statistical approximation and the implementation transparent and easier to reason about.

Closing notes and references

This vignette mapped the implementation functions to the math in the theoretical notes and showed how the EM-like mean-score step reduces to a small set of matrix operations. The implementation follows the ideas described in Riddles et al. (2016) for exponential tilting under NMAR: fit the conditional density on respondents, approximate the missing-data expectation by a finite sum over observed $y$ values, and solve the resulting estimating equations for the missingness-model parameters.