Administrative Records for Survey Methodology. Группа авторов. Читать онлайн. Newlib. NEWLIB.NET

Автор: Группа авторов
Издательство: John Wiley & Sons Limited
Серия:
Жанр произведения: Математика
Год издания: 0
isbn: 9781119272069
Скачать книгу
biomarkers, and DNA samples. The collection of these additional sensitive attributes reinforces confidentiality concerns.

      2.3.1.2 Linkages to Other Data

      The CMS maintain claims records for the medical services received by essentially all Americans age 65 and older and those less than 65 years who receive Medicare benefits. These records include comprehensive information about hospital stays, outpatient services, physician services, home health care, and hospice care. When linked to the HRS interview data, this supplementary information provides far more detail on the health circumstances and medical treatments received by HRS participants than would otherwise be available.

      Data from HRS interviews are also linked to information about respondents’ employers. This improves information on employer-provided benefits, including pensions. While most pension-eligible workers have some idea of the benefits available through their pension plans, they generally are not knowledgeable about detailed provisions of the plans. By linking HRS interview data with detailed information on pension plans, researchers can better understand the contribution of the pension to economic circumstances and the effects of the pension structure on work and retirement decisions.

      HRS data are also linked at the individual level to administrative records from Social Security and Medicare, Veteran’s Administration, the National Death Index, and employer-provided pension plan information (Sonnega and Weir 2014).

      2.3.1.3 Disclosure Avoidance Methods

      To ensure privacy and confidentiality, all study participants’ names, addresses, and contact information are maintained in a secure control file (National Institute on Aging and the National Institutes of Health 2017). Anyone with access to identifying information must sign a pledge of confidentiality. The survey data are only released to the research community after undergoing a rigorous process to remove or mask any identifying information. First a set of sensitive variables (such as state of residence or specific occupation) are suppressed or masked. Next, the remaining variables are tested for any possible identifying content. When testing is complete, the data files are subject to final review and approval by the HRS Data Release Protocol Committee. Data ready for public use are made available to qualified researchers via a secure website. Registration is required of all researchers before downloading files for analyses. In addition, use of linked data from other sources, such as Social Security or Medicare records, is strictly controlled under special agreements with specially approved researchers operating in secure computing environments that are periodically audited for compliance.

      The HRS uses licensing as its primary method of giving access to restricted files. A license can be secured only after meeting a stringent set of criteria that leads to a contractual agreement between the HRS, the researcher, and the researcher’s employer. The license enables the user to receive restricted files and use them at the researcher’s own institutional facility.

      2.3.2 SIPP–SSA–IRS (SSB)

      2.3.2.1 Data Description

      The SIPP/SSA/IRS Public Use File, known as the SIPP Synthetic Beta File or SSB, combines variables from the Census Bureau’s SIPP, the IRS individual lifetime earnings data, and the SSA individual benefit data. Aimed at a user community that was primarily interested in national retirement and disability programs, the selection of variables for the proposed SIPP/SSA/IRS-PUF focused on the critical demographic data to be supplied from the SIPP, earnings histories going back to 1937 from the IRS data maintained at SSA, and benefit data from SSA’s master beneficiary records, linked using respondents’ SSNs. After attempting to determine the feasibility of adding a limited number of variables from the SIPP directly to the linked earnings and benefit data, it was decided that the set of variables that could be added without compromising the confidentiality protection of the existing SIPP public use files was so limited that alternative methods had to be used to create a useful new file.

      The existence of SIPP public use files poses a key challenge for disclosure avoidance. To protect the confidentiality of survey respondents, it was deemed necessary to prevent reidentification of a record that appears in the synthetic data against the existing SIPP public use files. Hence, all information regarding the dating of variables whose source was a SIPP response, and not administrative data, has to be made consistent across individuals regardless of the panel and wave from which the response was taken. The public use file contains several variables that were never missing and are not synthesized. These variables are: gender, marital status, spouse’s gender, initial type of Social Security benefits, type of Social Security benefits in 2000, and the same benefit type variables for the spouse. All other variables in the SSB v4 were synthesized.

      The model first imputes any missing data, then synthesizes the completed data (Reiter 2004). For each iteration of the missing data imputation phase and again during the synthesis phase, a joint PPD for all of the required variables is estimated according to the following protocol. At each node of the parent/child tree, a statistical model is estimated for each of the variables at the same level. The statistical model is a Bayesian bootstrap, logistic regression, or linear regression (possibly with transformed inputs). The missing data phase included nine iterations of estimation. The synthetic data phase occurred on the 10th iteration. Four missing data implicates were created. These constitute the completed data files that are the inputs to the synthesis phase. Four synthetic implicates were created for each missing data implicate, for a total of 16 synthetic implicates on the released file. Because copying the final weight to each implicate of the synthetic data would have provided an additional unsynthesized variable with 55 552 distinct values, the disclosure risk associated with the weight variable had to be addressed. A synthetic weight using a PPD based on the Multinomial/Dirichlet natural conjugate likelihood and prior was created.

      2.3.2.3 Disclosure Avoidance Assessment

      The link of administrative earnings, benefits and SIPP data adds a significant amount of information to an already very detailed survey and could pose potential disclosure risks beyond those originally managed as part of the regular SIPP public use file disclosure avoidance process. The synthesis of the earnings