1.3 Estimation Using Multiple Proxy Variables
Within the context of combining register and survey data, we consider here multisource estimation methods that make use of two or more proxy variables. Deficiency of coverage, relevance, and timeliness is often the reason that register-based estimation is not viable. When the lack of coverage can be limited to specific domains or variables, the problem can be remedied by the collection of supplementary survey data using the split-population or split-data approach. There would be only one value for each variable of interest now that the data supplement each other. Different multisource estimation approaches are needed for multiple proxy variables.
We shall classify the various scenarios using two conditions summarized in Table 1.1. (i) Whether one treats one of the proxy variables as the target measure and the others as associated with relevance bias – to be referred to as the asymmetric setting; the setting is symmetric otherwise, where either none of the proxy variables is considered to be the ideal measure, or all are correct measures which nevertheless do not have perfect population coverage, (ii) Whether it is necessary to have linked data at the individual or cell level – to be referred to as the linked setting; the setting is unlinked otherwise. Each of the approaches listed in Table 1.1 covers a variety of methods with an extensive body of literature. The following elaboration aims merely to provide a brief accessible overview, and the references given serve only as points of departure for further exploration.
Table 1.1 Indirect estimation using register and survey proxy variables.
Linked data | One target measure and relevance bias in the others? | |
---|---|---|
Yes (asymmetric) | No (symmetric) | |
Yes (linked) | Survey weighting Prediction modeling | Capture–recapture methods Structural equation modeling |
No (unlinked) | Benchmark adjustment | Constrained optimization |
1.3.1 Asymmetric Setting
The two most common approaches under the asymmetric-linked setting are survey weighting and prediction modeling, where the register proxy variable is used as an auxiliary variable or a covariate. See e.g. Säarndal, Swensson, and Wretman (1992), for design-based approach to survey weighting that makes use of auxiliary variables; Valliant, Dorfman, and Royall (2000) and Chambers and Clark (2012) for model-based approach to finite population prediction; Rao and Molina (2015) for relevant methods of small area estimation. We make two observations. Firstly, when the overlapping survey variable is deemed necessary despite the presence of a register proxy, the latter is typically the most powerful among all the auxiliary variables when it comes to weighting adjustment and regression modeling. See e.g. Djerf (1997) and Thomsen and Zhang (2001) for the use of register economic activity status in the LFS, and the effects on reducing sampling and nonresponse errors. Secondly, applications to remedy Representation errors are much less common. However see, e.g. survey weighting under dependent sampling for the estimation of coverage errors (Nirel and Glickman 2009), mixed-effects models for assessing register coverage errors (Mancini and Toti 2014), and different misclassification models for register NACE (Van Delden et al. 2016), and register household (Zhang 2011).
The nature of a proxy variable implies a special use that is beyond what is feasible with a non-proxy auxiliary variable, no matter how good an auxiliary it is: provided suitable conditions, it is possible to substitute (or replace) the target measure by the proxy value. However, substitution would only be acceptable for a subset of the units but not all since, had it been acceptable for all the units, one would have had “direct tabulation” instead.
It follows that adjustment, or imputation in the case of a rejected value, will be necessary. Macro-level survey estimates can be imposed as benchmarks to achieve statistical relevance at the corresponding level. Linked datasets are typically not necessary here – recall the Norwegian register-based employment status described earlier. This yields many methods under what may be referred to as the benchmarked adjustment approach for combining register and survey proxy variables under the asymmetric-unlinked setting.
Repeated weighting and constrained (mass) imputation are two common approaches of benchmarked adjustment; see e.g. de Waal (2016) for a discussion. Repeated weighting is a technique initially presented for sample reweighing in the presence of overlapping survey estimates (Renssen and Nieuwenbroek 1997). It has been used for the reconciliation of Dutch virtual census output tables (Houbiers 2004). But it can equally be applied to adjust register datasets so that afterward, e.g. the weighted register proxy total agrees with the valid target totals imposed. This does not require linking the register datasets and the external datasets from which the benchmark totals are obtained. An inconvenience arises in cases where there are multiple proxy variables to be benchmarked and the variables are available for different subsets of units. This may be the case due to partial missing data in a single register file or when merging multiple register files. Some imputation will then be necessary if one would like to have a single set of weights for the whole dataset.
The one-number census imputation provides an example of the alternative imputation-based benchmarked adjustment methods (Brown et al. 1999). In the case of multiple proxy variables observed on different subsets of units, imputation is applied not only to the units with partially missing data, but also to the units with no observed variables at all, or possibly the units with completely observed data. The result is a complete dataset that guarantees numerical consistency for any tabulation across the variables and population domains. Constrained imputation for population datasets are e.g. discussed by Shlomo, de Waal, and Pannekoek (2009) and Zhang (2009a). Methods that incorporate micro-data edit constraints are e.g. studied in Coutinho, de Waal, and Shlomo (2013), Pannekoek, Shlomo, and DeWaal (2013), and Pannenkoek and Zhang (2015). Chambers and Ren (2004) consider a method of benchmarked outlier robust imputation. Obviously, it may be difficult to generate a single population dataset that is fit for all possible statistical uses. de Waal (2016) discusses the use of “repeated imputation.” Notice that there are many relevant works on the generation of benchmarked synthetic populations in Spatial Demography, Econometrics, and Sociology.
The distinction between weighting and imputation can be somewhat blurred when it comes to the adjustment of cross-classified proxy contingency tables, because an adjusted cell count is just the number of individuals with the corresponding cross-classification one would have in an imputed dataset. Take, e.g. a two-way table, where the rows represent population domains at some detailed level, say, by local area and sex-age group, and the columns a composition of interest, say, income class. Let X denote the table based on combining population and tax register data. Let Y(r) denote the known vector of population domain sizes, and let
denote the survey-based estimates of population totals by income class, which are the row and column benchmarks of the target table Y, respectively. Starting with X and by means of iterative proportional fitting (IPF) until convergence, one