--- title: "Choice-based sampling and WESML weights" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Choice-based sampling and WESML weights} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") options(digits = 4) ``` Choice data are often sampled by outcome. A transport researcher running an on-site survey interviews travellers at the terminal of the mode they actually chose; a hospital-choice study may oversample patients of rare hospitals; a marketing team may recruit equal numbers of buyers of each brand. In each case the unit is drawn *conditional on the alternative it chose*, so the sample choice shares are not the population choice shares. Treating such a sample as random changes the likelihood target and, in general, biases the estimates. WESML fixes that sampling problem; it does not fix every econometric problem. The weighted likelihood still relies on the maintained utility specification and on whatever exogeneity assumptions justify interpreting the covariates, especially prices, as demand shifters rather than equilibrium outcomes. Manski and Lerman's (1977) weighted exogenous sample maximum likelihood (WESML) correction weights each choice situation by $$ w_i = \frac{Q_{j(i)}}{H_{j(i)}}, $$ where $j(i)$ is the alternative chosen by situation $i$, $Q_j$ is the population share choosing alternative $j$, and $H_j$ is the corresponding sample share. Maximizing the weighted log-likelihood $\sum_i w_i \log P_i$ recovers the population parameters. choicer provides two helpers: - `sample_by_choice()` draws a choice-based sample from a population frame and attaches WESML weights. - `wesml_weights()` computes the same weights when you already have a sample and know the population shares `Q`. Both helpers normalize the weights to mean 1 by default. Normalization — and indeed any rescaling of the weights by a common factor — leaves the point estimates and the robust (sandwich) variance unchanged, so the attached `.wesml_weight` need not equal $Q/H$ literally; only the *relative* weights across strata matter. ```{r setup} library(choicer) library(data.table) set_num_threads(2) ``` ## Build a population For exposition, start from a simulated population in which tastes are heterogeneous (a random coefficient on `w1` and `w2`), so a mixed logit is the natural estimator. We turn off the outside option and fix the choice set so that every situation has exactly one chosen alternative and the strata are clean. In empirical work the population shares `Q` usually come from administrative totals, market shares, or survey weights external to the choice-based estimation sample. ```{r population} sim <- simulate_mxl_data( N = 3000, J = 4, Sigma = diag(c(1.0, 1.5)), # two uncorrelated random coefficients seed = 11, outside_option = FALSE, vary_choice_set = FALSE ) pop <- as.data.table(sim$data) Q <- prop.table(table(pop[choice == 1, alt])) round(Q, 3) ``` ## Draw a choice-based sample Now sample the same number of choice situations from each chosen alternative. This keeps whole choice situations together: if an id is sampled, all of its alternative rows are retained. ```{r sample} cb <- sample_by_choice( pop, id_col = "id", alt_col = "alt", choice_col = "choice", n_per_alt = 300L, seed = 12L ) strata <- sort(names(attr(cb, "Q"))) rbind( population = attr(cb, "Q")[strata], sample = attr(cb, "H")[strata] ) |> round(3) cb[choice == 1, .(id, chosen_alt = alt, .wesml_weight)][1:8] ``` The sample choice shares are deliberately equalized, but the attached weights restore the population shares in the weighted likelihood. The weight is constant within an id and repeated across that id's alternative rows, which is exactly the row-level layout `run_mxlogit()` expects through `weights_col`. ## Weighted estimation and inference We fit two mixed logits on the choice-based sample: an ordinary (unweighted) fit that ignores the sampling design, and a WESML fit that passes the weight column and requests the robust sandwich covariance. Passing `weights_col` by name keeps the estimation target visible in the script, which is the recommended style even when the data already carry a `choice_sampling` attribute from `sample_by_choice()`. ```{r fit} common <- list( data = cb, id_col = "id", alt_col = "alt", choice_col = "choice", covariate_cols = c("x1", "x2"), # fixed coefficients random_var_cols = c("w1", "w2"), # random coefficients S = 100L, draws = "generate", seed = 7L, scale_vars = "sd" ) fit_unweighted <- do.call(run_mxlogit, c(common, list(se_method = "bhhh"))) fit_wesml <- do.call(run_mxlogit, c(common, list( weights_col = ".wesml_weight", se_method = "sandwich" ))) ``` > **Tip.** As in the [mixed logit vignette](mxl.html), raise the number of draws > `S` until the estimates are stable and warm-start a stubborn solver with > `theta_init`. `S = 100` here keeps the package build quick. The unweighted estimator treats the equalized sample shares as if they were the population shares; WESML reweights the sampled situations back to the population. With alternative-specific constants in the model the correction is most visible in the constants and, through them, in the fitted shares: ```{r coef} round(cbind( unweighted = coef(fit_unweighted), wesml = coef(fit_wesml) ), 3) ``` ```{r shares} share_compare <- rbind( population = as.numeric(Q), wesml = drop(predict(fit_wesml, type = "shares")), unweighted = drop(predict(fit_unweighted, type = "shares")) ) colnames(share_compare) <- names(Q) round(share_compare, 3) ``` The WESML fit reproduces the population shares `Q`, while the unweighted fit reproduces the equalized *sample* shares — a direct picture of the bias the correction removes. In a single finite sample the WESML estimates need not be closer to the truth parameter by parameter, but they target the population likelihood under the choice-based sampling design. For inference, the point of `se_method = "sandwich"` is that under non-uniform weights the inverse weighted Hessian and the ordinary BHHH variance are *not* valid covariance estimators. The sandwich uses the weighted Hessian as bread, $A = \sum_i w_i(-H_i)$, and the weight-squared outer product of the per-situation scores as meat, $B = \sum_i w_i^2 s_i s_i'$, giving $V = A^{-1} B A^{-1}$. Because $A$ scales linearly and $B$ quadratically in the weights, $V$ is invariant to any common rescaling of them — consistent with the mean-1 normalization above. ```{r se} summary(fit_wesml) ``` The same robust variance is available post hoc on any fitted mixed logit via `wesml_vcov()`, so you can obtain choice-based-sampling standard errors even from a fit estimated with `se_method = "hessian"` without refitting. > **A note on the multinomial logit.** choicer implements WESML weighting and > the robust sandwich for the *mixed* logit (which nests the plain logit as the > degenerate, zero-variance case). For the plain multinomial logit there is a > classical and convenient result (Manski and Lerman, 1977): when the model > includes a full set of alternative-specific constants, choice-based sampling > leaves the slope coefficients consistently estimated *even without weighting* — > only the ASCs are inconsistent. Each constant is shifted by > $\ln\!\big(H_j / Q_j\big)$ and can be corrected by subtracting that term. So > for an MNL with ASCs the substantive marginal-utility parameters are unaffected > by the sampling scheme; only the constants (and the predicted shares they > drive) need correcting. ## Starting from an existing sample When the choice-based sample already exists, provide the population shares `Q` directly: ```{r attach} cb2 <- copy(cb) cb2[, .wesml_weight := NULL] cb2 <- wesml_weights( cb2, id_col = "id", alt_col = "alt", choice_col = "choice", Q = attr(cb, "Q"), attach = TRUE ) attr(cb2, "choice_sampling") ``` The names of `Q` must match the chosen-alternative strata exactly after coercion to character. This strict matching is intentional: silently dropping a realized stratum would change the target population. ## References Manski, C. F. and Lerman, S. R. (1977). The estimation of choice probabilities from choice based samples. *Econometrica*, 45(8), 1977-1988. Train, K. E. (2009). *Discrete Choice Methods with Simulation* (2nd ed.). Cambridge University Press, Section 3.7.