2.3.1.2 Linkages to Other Data
The HRS team requests permission from respondents to link their survey responses to other data resources, as described below. For consenting respondents, HRS data are linked at the individual level to administrative records from Social Security and Medicare claims, thus allowing for detailed characterizations of income and wealth over time. Through this, the HRS is linked to at least a dozen other datasets. See Abowd, Schmutte, and Vilhuber (2018) or http://hrsonline.isr.umich.edu/index.php?p=reslis for a more complete list of linked datasets.
The CMS maintain claims records for the medical services received by essentially all Americans age 65 and older and those less than 65 years who receive Medicare benefits. These records include comprehensive information about hospital stays, outpatient services, physician services, home health care, and hospice care. When linked to the HRS interview data, this supplementary information provides far more detail on the health circumstances and medical treatments received by HRS participants than would otherwise be available.
Data from HRS interviews are also linked to information about respondents’ employers. This improves information on employer-provided benefits, including pensions. While most pension-eligible workers have some idea of the benefits available through their pension plans, they generally are not knowledgeable about detailed provisions of the plans. By linking HRS interview data with detailed information on pension plans, researchers can better understand the contribution of the pension to economic circumstances and the effects of the pension structure on work and retirement decisions.
HRS data are also linked at the individual level to administrative records from Social Security and Medicare, Veteran’s Administration, the National Death Index, and employer-provided pension plan information (Sonnega and Weir 2014).
2.3.1.3 Disclosure Avoidance Methods
To ensure privacy and confidentiality, all study participants’ names, addresses, and contact information are maintained in a secure control file (National Institute on Aging and the National Institutes of Health 2017). Anyone with access to identifying information must sign a pledge of confidentiality. The survey data are only released to the research community after undergoing a rigorous process to remove or mask any identifying information. First a set of sensitive variables (such as state of residence or specific occupation) are suppressed or masked. Next, the remaining variables are tested for any possible identifying content. When testing is complete, the data files are subject to final review and approval by the HRS Data Release Protocol Committee. Data ready for public use are made available to qualified researchers via a secure website. Registration is required of all researchers before downloading files for analyses. In addition, use of linked data from other sources, such as Social Security or Medicare records, is strictly controlled under special agreements with specially approved researchers operating in secure computing environments that are periodically audited for compliance.
Additional protections involve distortion of the microdata prior to dissemination to researchers. Earnings and benefits variables such as those from SSA in the HRS are rounded or top coded (Deang and Davies 2009). Similarly, geographic classifications are limited to broad levels of aggregation (e.g. census divisions instead of states or states instead of counties).
The HRS uses licensing as its primary method of giving access to restricted files. A license can be secured only after meeting a stringent set of criteria that leads to a contractual agreement between the HRS, the researcher, and the researcher’s employer. The license enables the user to receive restricted files and use them at the researcher’s own institutional facility.
2.3.2 SIPP–SSA–IRS (SSB)
2.3.2.1 Data Description
The SIPP/SSA/IRS Public Use File, known as the SIPP Synthetic Beta File or SSB, combines variables from the Census Bureau’s SIPP, the IRS individual lifetime earnings data, and the SSA individual benefit data. Aimed at a user community that was primarily interested in national retirement and disability programs, the selection of variables for the proposed SIPP/SSA/IRS-PUF focused on the critical demographic data to be supplied from the SIPP, earnings histories going back to 1937 from the IRS data maintained at SSA, and benefit data from SSA’s master beneficiary records, linked using respondents’ SSNs. After attempting to determine the feasibility of adding a limited number of variables from the SIPP directly to the linked earnings and benefit data, it was decided that the set of variables that could be added without compromising the confidentiality protection of the existing SIPP public use files was so limited that alternative methods had to be used to create a useful new file.
The technique adopted is called partially synthetic data with multiple imputation of missing items. The term “partially synthetic data” means that the person-level records are released containing some variables from the actual responses and other variables where the actual responses have been replaced by values sampled from the posterior predictive distribution (PPD) for that record, conditional on all of the confidential data. From 2003 until 2015, seven preliminary versions of the SSB were produced. In this chapter, we will focus on the protections that pertain to the linked nature of the data. The interested reader is referred to Abowd, Stinson, and Benedetto (2006) for details on data sources, imputation, and linkage. The analysis here is for the SSB version 4. Since version 4, two additional versions have been released with slightly different structure.3 Subsequent versions are well-illustrated by the extensive analysis described here.
2.3.2.2 Disclosure Avoidance Methods
The existence of SIPP public use files poses a key challenge for disclosure avoidance. To protect the confidentiality of survey respondents, it was deemed necessary to prevent reidentification of a record that appears in the synthetic data against the existing SIPP public use files. Hence, all information regarding the dating of variables whose source was a SIPP response, and not administrative data, has to be made consistent across individuals regardless of the panel and wave from which the response was taken. The public use file contains several variables that were never missing and are not synthesized. These variables are: gender, marital status, spouse’s gender, initial type of Social Security benefits, type of Social Security benefits in 2000, and the same benefit type variables for the spouse. All other variables in the SSB v4 were synthesized.
The model first imputes any missing data, then synthesizes the completed data (Reiter 2004). For each iteration of the missing data imputation phase and again during the synthesis phase, a joint PPD for all of the required variables is estimated according to the following protocol. At each node of the parent/child tree, a statistical model is estimated for each of the variables at the same level. The statistical model is a Bayesian bootstrap, logistic regression, or linear regression (possibly with transformed inputs). The missing data phase included nine iterations of estimation. The synthetic data phase occurred on the 10th iteration. Four missing data implicates were created. These constitute the completed data files that are the inputs to the synthesis phase. Four synthetic implicates were created for each missing data implicate, for a total of 16 synthetic implicates on the released file. Because copying the final weight to each implicate of the synthetic data would have provided an additional unsynthesized variable with 55 552 distinct values, the disclosure risk associated with the weight variable had to be addressed. A synthetic weight using a PPD based on the Multinomial/Dirichlet natural conjugate likelihood and prior was created.
2.3.2.3 Disclosure Avoidance Assessment
The link of administrative earnings, benefits and SIPP data adds a significant amount of information to an already very detailed survey and could pose potential disclosure risks beyond those originally managed as part of the regular SIPP public use file disclosure avoidance process. The synthesis of the earnings