The Census Bureau Disclosure Review Board at the time of release used two standards for disclosure avoidance in partially synthetic data. First, using the best available matching technology, the percentage of true matches relative to the size of the files should not be excessively large. Second, the ratio of true matches to the total number of matches (true and false) should be close to one-half.
The disclosure avoidance analysis (Abowd, Stinson, and Benedetto 2006) uses the principle that a potential intruder would first try to reidentify the source record for a given synthetic data observation in the existing SIPP public use files. Two distinct matching exercises – one probabilistic (Fellegi and Sunter 1969), one distance-based (Torra, Abowd, and Domingo-Ferrer 2006) – between the synthetic data and the harmonized confidential data were conducted.4 The harmonized confidential data – actual values of the data items as released in the original SIPP public use files – are the equivalent of the best available information for an intruder attempting to reidentify a record in the synthetic data. Successful matches between the harmonized confidential data and the synthetic data represent potential disclosure risks. In practice, the intruder would also need to make another successful link to exogenous data files that contain direct identifiers such as names, addresses, telephone numbers, etc. The results from the experiments are conservative estimates of reidentification risk. For the probabilistic matching, the assessment matched synthetic and confidential files exactly on the unsynthesized variables of gender and marital status, and success of the matching exercise is assessed using a person identifier which is not, in fact, available in the released version of the synthetic data. Without the personid, an intruder would have to compare many more record pairs to find true matches, would not find any more true matches (the true match is guaranteed to be in the blocks being compared), and would almost certainly find more false matches. In fact, the records that can be reidentified represent only a very small proportion (less than 3%) of candidate records, and correct reidentifications are swamped by a sea of false reidentifications (Abowd, Stinson, and Benedetto 2006, p. 6).
In distance-based matching, records between the harmonized confidential and synthetic data are blocked in a similar way, and distances (or similarity scores) are computed for a given confidential record and every synthetic record within a block. The three closest records are declared matches, and the personid again checked to verify how often a true match is obtained. A putative intruder who treated the closest record as a match would correctly link about 1% of all synthetic records, and less than 3% in the worst-case subgroup (Abowd, Stinson, and Benedetto 2006, p. 8).
Figure 2.1 Probability density function of the ramp distribution used in LEHD disclosure avoidance system.
2.3.2.4 Analytical Validity Assessment
Although synthetic data are designed to solve a confidentiality protection problem, the success of this solution is measured by both the degree of protection provided and the user’s ability to reliably estimate scientifically interesting quantities. The latter property of the synthetic data is known as analytical (or statistical) validity. Analytical validity exists when, at a minimum, estimands can be estimated without bias and their confidence intervals (or the nominal level of significance for hypothesis tests) can be stated accurately (Rubin 1987). To verify analytical validity, the confidence intervals surrounding the point estimates obtained from confidential and synthetic data should completely overlap (Reiter, Oganian, and Karr 2009), presumably with the synthetic confidence interval being slightly larger because of the increased variation arising from the synthesis. When these results are obtained, inferences drawn about the coefficients will be consistent whether one uses synthetic or completed data. The reader interested in detailed examples that show how analytic validity is assessed in the SSB should consult Figures 2.1 and 2.2 and associated discussion in Abowd, Schmutte, and Vilhuber (2018).
Box 2.1 Sidebox: Practical Synthetic Data Use
The SIPP–SSA–IRS Synthetic Beta File is accessible to users in its current form since 2010. Interested users can request an account by following links at https://www.vrdc.cornell.edu/sds/. Applications are judged solely on feasibility (i.e. the necessary variables are on the SSB). After projects are approved by the Census Bureau, researchers will be given accounts on the Synthetic Data Server. Users can submit validation requests, following certain rules, outlined on the Census Bureau’s website. Deviations from the guidelines may be possible with prior approval of the Census Bureau, but are typically only granted if specialized software is needed (other than SAS or Stata), and only if said software also exists already on Census Bureau computing systems. Between 2010 and 2016, over one hundred users requested access to the server, using a succession of continuously improved datasets.
Figure 2.2 Distribution of ΔB in Maryland. For details, see text.
2.3.3 LEHD: Linked Establishment and Employee Records
2.3.3.1 Data Description
The LEHD data links employee wage records extracted from Unemployment Insurance (UI) administrative files from 51 states with establishment-level records from the Quarterly Census of Employment and Wages (QCEW, also provided by the partner states), the SSA-sourced record of applications for SSNs (“Numident”), residential addresses derived from IRS-provided individual tax filings, and data from surveys and censuses conducted by the U.S. Census Bureau (2000 and 2010 decennial censuses, as well as microdata from the ACS). Additional information is linked in from the Census Bureau’s Employer Business Register and its derivative files. The merged data are subject both to United States Code (U.S.C.) Title 13 and Title 26 protections. For more details, see Abowd, Haltiwanger, and Lane (2004) and Abowd et al. (2009).
From the data, multiple output products are generated. The Quarterly Workforce Indicators (QWI) provide local estimates of a variety of employment and earnings indicators, such as job creation, job destruction, new hires, separations, worker turnover, and monthly earnings, for detailed person and establishment characteristics, such as age, gender, firm age, and firm size (Abowd et al. 2009). The first QWI were released in 2003. The data are used for a variety of analyses and research, emphasizing detailed local data on demographic labor market variables (Gittings and Schmutte 2016; Abowd and Vilhuber 2012). Based on the same input data, the LEHD Origin-Destination Employment Statistics (LODES) describe the geographic distribution of jobs according to the place of employment and the place of worker residence (Center for Economic Studies 2016). New job-to-job flow statistics measure the movement of jobs and workers across industries and regional labor markets (Hyatt et al. 2014). The microdata underlying these products is heavily used in research, since it provides nearly universal coverage of U.S. workers observed at quarterly frequencies. Snapshots of the statistical production database are made available to researchers regularly (McKinney and Vilhuber 2011a,2011b; Vilhuber and McKinney 2014).
2.3.3.2 Disclosure Avoidance Methods
We