Administrative Records for Survey Methodology. Группа авторов. Читать онлайн. Newlib. NEWLIB.NET

Автор: Группа авторов
Издательство: John Wiley & Sons Limited
Серия:
Жанр произведения: Математика
Год издания: 0
isbn: 9781119272069
Скачать книгу
linking data from different agencies. More recently, the 2016 Australian Census elicited substantial controversy when the Australian Bureau of Statistics (ABS) decided to keep identifiable data collected through the census for a substantially longer time period, with the explicit goal of enabling linkages between the census and administrative data, as well as linkages across historical censuses (Australian Bureau of Statistics 2015; Karp 2016).

      There are no methods for disclosure limitation and confidentiality protection specifically designed for linked data. Protecting data constructed by linking administrative records, survey responses, and “found” transaction records relies on the same methods as might be applied to each source individually. It is the richness inherent in the linkages, and in the administrative information available to some potential intruders, that pose novel challenges.

      Statistical confidentiality can be viewed as “a body of principles, concepts, and procedures that permit confidentiality to be afforded to data, while still permitting its use of for statistical purposes” (Duncan, Elliot, and Salazar-González 2011, p. 2). In order to protect the confidentiality of the data they collect, NSOs and survey organizations (henceforth referred to generically as data custodians) employ many methods. Very often, data are released to the public as tabular summaries. Many of the protection mechanisms in use today evolved to protect published tables against disclosure. Generically, the idea is to limit the publication of cells with “too few” respondents, where the notion of “too few” is assessed heuristically.

      We will not provide a detailed history or taxonomy of statistical disclosure limitation (SDL) and formal privacy models, instead will refer the reader to other publications on the topic (Duncan, Elliot, and Salazar-González 2011; Dwork and Roth 2014; FCSM 2005). We do need to set up the problem, which we will do by reviewing suppression, coarsening, swapping, and noise infusion (input and output). These are widely used techniques and the main issues that arise in applications to linked data can be understood with reference to these methods.

      Researchers, however, are not indifferent to these strategies. A researcher who needs detailed geographic variation will benefit from data in which the complementary suppressions are based on removing detailed industries. A researcher who needs detailed industry variation will prefer data with complementary suppression based on geography. Ultimately, the committee that chooses the complementary suppression strategy will determine which research uses are possible and which are ruled out.

      But the problem is deeper than this: suppression is a very ineffective SDL technique. Researchers working with the cooperation of the BLS have shown that the suppression strategy used in major BLS business data publications provides almost no protection if it is applied, as is currently the case, to each data release separately (Holan et al. 2010). Some agencies may use cumulative suppression strategies in their sequential data releases. In this case, once an item has been designated for either primary or complementary suppression, it would disappear from the release tables until the entire product is redesigned.

      Many social scientists believe that suppression can be complemented by restricted access agreements that allow the researcher to use all of the confidential data but limit what can be published from the analysis. Such a strategy is not a complete solution because SDL must still be applied to the output of the analysis, which quickly brings the problem of which output to suppress back to the forefront.

      Custom tabulations and data enclaves. Another traditional response by data custodians to the demand by researchers for more extensive and detailed summaries of confidential data, was to create a custom tabulation, a table not previously published, but generated by data custodian staff with access rights to the confidential data, and typically subject to the same suppression rules. As these requests increased, the tabulation and analysis work was offloaded onto researchers by providing them with access to protected microdata. This approach has expanded rapidly in the last two decades, and is widely used around the world. We discuss it in detail later in this chapter.

      Coarsening is a method for protecting data that involves mapping confidential values into broader categories. The simplest method is a histogram, which maps values into (fixed) intervals. Intuitively, the broader the interval, the more protection is provided.

      2.2.1 Input Noise Infusion

      Protection mechanisms for microdata are often similar in spirit, though not in their details, to the methods employed for tabular data. Consider coarsening, in which the more detailed response to a question (say, about income), is classified into a much smaller set of bins (for instance, income categories such as “[10 000; 25 000]”). In fact, many tables can be viewed as a coarsening of the underlying microdata, with a subsequent count of the coarsened cases.

      Many microdata methods are based on input noise infusion: distorting the value of some or all of the inputs before any publication data are built. The Census Bureau uses this technique before building publication tables for many of its business establishment products and in the American Community Survey (ACS) publications, and we will discuss it in more detail for one of those data products later in this chapter. The noise infusion parameters can be set such that all of the published statistics are formally unbiased – the expected value of the published statistic equals the value of the confidential statistic with respect to the probability distribution of the infused noise – or nearly so. Hence, the disclosure risk and data quality can be conveniently summarized by two parameters: one measuring the absolute distortion in the data inputs and the other measuring the mean squared error of publication statistics (either overall for censuses or relative to the undistorted survey estimates).

      From the viewpoint of empirical social sciences, however, all input distortion systems with the