Cross-sectional Unbiasedness of the Distorted Data
The distribution of the infused noise is symmetric, and allocation of the noise factors is random. The data distribution resulting from the noise infusion should thus be unbiased. We compute the bias ΔX in each cell kt, expressed in percentage terms:
Table 2.1 Distribution of errors Δr in first-order serial correlation, QWI.
Variable | Median | Semi-interquartile range |
---|---|---|
Accessions | −0.000 542 | 0.026 314 |
Beginning-of-quarter employment | 0.000 230 | 0.021 775 |
Full-quarter employment | 0.000 279 | 0.018 830 |
Net job flows | −0.000 025 | 0.002 288 |
Separations | 0.000 797 | 0.025 539 |
Evidence of unbiasedness is provided by Figure 2.2, which shows the distribution of the bias for X = B. 11 The distribution of ΔB has most mass around the mode at 0%. Also, as is to be expected, secondary spikes are present around ±c, the inner bound of the noise distribution.
Box 2.2 Sidebox: Do-It-Yourself Noise Infusion
The interested user might consult a simple example (with fake data) at https://github.com/labordynamicsinstitute/rampnoise (Vilhuber 2017) that illustrates this mechanism.
2.4 Physical and Legal Protections
The provision of very detailed micro-tabulations or public-use microdata may not be sufficient to inform certain types of research questions. In particular, for business data the thresholds that trigger SDL suppression methods are met far more often than for individuals or households. In those cases, the research community needs controlled access to confidential microdata. Three key reasons why access to microdata may be beneficial are:
1 (i) microdata permit policy makers to pose and analyze complex questions. In economics, for example, analysis of aggregate statistics does not give a sufficiently accurate view of the functioning of the economy to allow analysis of the components of productivity growth;
2 (ii) access to microdata permits analysts to calculate marginal rather than just average effects. For example, microdata enable analysts to do multivariate regressions whereby the marginal impact of specific variables can be isolated;
3 (iii) broadly speaking, widely available access to microdata enables replication of important research(United Nations 2007, p. 4)
As we’ve outlined above, many of the concerns about confidentiality have either removed or prevented creation of public-use microdata versions of linked files, exacerbating the necessity of providing alternate access to the confidential microdata.
NSOs and survey organizations usually provide access to confidential linked data within restricted-access data centers. In the United States, this means either using 1 of 30 secure sites managed by the Census Bureau as part of the Federal Statistical Research Data Center System (FSRDC),12 or going to the headquarters of the statistical agency. Similarly, in other countries, access is usually restricted to headquarters of NSOs. Secure enclaves managed by NSOs used to be rare. In the 1990s and early 2000s, an expansion of existing networks and the creation of new, alternate methods of accessing data housed in secure enclaves occurred in several countries. Access methods may be through physical travel, remote submission, or remote processing. However, all methods rely on two fundamental elements. First, the researchers accessing the data are mostly free to choose the modeling strategy of their choice, and is not restricted to the tables or queries that the data curator has used for published statistics. Second, the output from such models is then analyzed to avoid unauthorized disclosure, and subsequently released to the researcher for publication.
Several methods are currently used by NSOs and other data collecting agencies to provide access to confidential data. Sections 2.4.1–2.4.5 will describe each of them in turn.13
2.4.1 Statistical Data Enclaves
Statistical data enclaves, or Research Data Centers, are secure computing facilities that provide researchers with access to confidential microdata, while putting restrictions on the content that can be removed from the facility. The different advisory committees of the two largest professional association (ASA, and the American Economic Association, AEA), pushed for easier and broader access for researchers as far back as the 1960s, though the emphasis then was on the avoiding the cost of making special tabulations. The AEA suggested creating Census data centers at selected universities (Kraus 2013). In the 1990s and early 2000s, similar networks started in other countries. In Canada, the Canadian Foundation for Innovation (CFI) awarded a number of grants to open research data centers, with the first opening at McMaster University (Hamilton, Ontario) in 2000.14 The creation of the RDCs was specifically motivated by the inability to ensure confidentiality while providing usability of longitudinally linked survey data (Currie and Fortin 2015).
In the United States, a 2004 grant by the National Science Foundation laid the groundwork for subsequent expansion of the (then Census) Research Data Center network from 8 locations, open since the mid-1990s, to over 30 locations