Administrative Records for Survey Methodology. Группа авторов. Читать онлайн. Newlib. NEWLIB.NET

Автор: Группа авторов
Издательство: John Wiley & Sons Limited
Серия:
Жанр произведения: Математика
Год издания: 0
isbn: 9781119272069
Скачать книгу
are not equivalent. In a regression discontinuity design, for example, there will now be a window around the break point in the running variable that reflects the uncertainty associated with the noise infusion. If the effect is not large enough, it will be swamped by noise even though all the inputs to the analysis are unbiased, or nearly so. Once again, using the unmodified confidential data via a restricted access agreement does not completely solve the problem because once the noisy data have been published, the agency has to consider the consequences of allowing the publication of a clean regression discontinuity design estimate where the plot of the unprotected outcomes versus the running variable can be compared to the similar plot produced from the public noisy data.

      The basic problem for empirical social scientists is that agencies must have a general purpose data publication strategy in order to provide the public good that is the reason for incurring the cost of data collection in the first place. But this publication strategy inherently advantages certain analyses over others. Statisticians and computer scientists have developed two related ways to address this problem: synthetic data combined with validation servers and privacy-protected query systems. Statisticians define “synthetic data” as samples from the joint probability distribution of the confidential data that are released for analysis. After the researcher analyzes the synthetic data, the validation server is used to repeat some or all of the analyses on the underlying confidential data. Conventional SDL methods are used to protect the statistics released from the validation server.

      2.2.2 Formal Privacy Models

      All formal privacy models define a cumulative, global privacy loss associated with all of the publications released from a given confidential database. This is called the total privacy-loss budget. The budget can then be allocated to each of the released queries. Once the budget is exhausted, no more analysis can be conducted. The researcher must decide how much of the privacy-loss budget to spend on each query – producing noisy answers to many queries or sharp answers to a few. The agency must decide the total privacy-loss budget for all queries and how to allocate it among competing potential users.

      An increasing number of modern SDL and formal privacy procedures replace methods like deterministic suppression and targeted random swapping with some form of noisy query system. Over the last decade these approaches have moved to the forefront because they provide the agency with a formal method of quantifying the global disclosure risk in the output and of evaluating the data quality along dimensions that are broadly relevant.

      Relatively recently, formal privacy models have emerged from the literature on database security and cryptography. In formal privacy models, the data are distorted by a randomized mechanism prior to publication. The goal is to explicitly characterize, given a particular mechanism, how much private information is leaked to data users.

      Differential privacy is a particularly prominent and useful approach to characterizing formal privacy guarantees. Briefly, a formal privacy mechanism that grants ε-differential privacy places an upper bound, parameterized by ε, on the ability of a user to infer from the published output whether any specific data item, or response, was in the original, confidential data (see Dwork and Roth 2014 for an in-depth discussion).

      Formal privacy models are very intriguing because they solve two key challenges for disclosure limitation. First, formal privacy models by definition provide provable guarantees on how much privacy is lost, in a probabilistic sense, in any given data publication. Second, the privacy guarantee does not require that the implementation details, specifically the parameter ε, be kept secret. This allows researchers using data published under formal privacy models to conduct fully SDL-aware analysis. This is not the case with many traditional disclosure limitation methods which require that key parameters, such as the swap rate, suppression rate, or variance of noise, not be made available to data users (Abowd and Schmutte 2015).

      To illustrate the application of new disclosure avoidance techniques, we describe three examples of linked data and the means by which confidentiality protection is applied to each. First, the Health and Retirement Study(HRS) links extensive survey information to respondents’ administrative data from the Social Security Administration (SSA) and the Center for Medicare and Medicaid Services (CMS). To protect confidentiality in the linked HRS–SSA data, its data custodians use a combination of restrictive licensing agreements, physical security, and restrictions on model output. Our second example is the Census Bureau’s Survey of Income and Program Participation (SIPP), which has also been linked to earnings data from the Internal Revenue Service (IRS) and benefit data from the SSA. Census makes the linked data available to researchers as the SIPP Synthetic Beta File(SSB). Researchers can directly access synthetic data via a restricted server and, once their analysis is ready, request output based on the original harmonized confidential data via a validation server. Finally, the Longitudinal Employer-Household Dynamics Program (LEHD) at the Census Bureau links data provided by 51 state administrations to data from federal agencies and surveys and censuses on businesses, households, and people conducted by the Census Bureau. Tabular summaries of LEHD are published with greater detail than most business and demographic data. The LEHD is accessible in restricted enclaves, but there are also restrictions on the output researchers can release. There are many other linked data sources. These three are each innovative in some fashion, and allow us to illustrate the issues faced when devising disclosure avoidance methods for linked data.

      2.3.1 HRS–SSA

      2.3.1.1 Data Description

      The HRS is conducted by the Institute for Social Research at the University of Michigan. Data collection was launched in 1992 and has reinterviewed the original sample of respondents every two years since then. New cohorts and sample refreshment have made the HRS one of the largest representative longitudinal samples of Americans over 50, with over 26 000 respondents in a given wave (Sonnega and Weir 2014). In 2006, the HRS started collecting