The past few decades have shown that experimental evaluations are feasible in a wide variety of settings. The field has gotten quite good at executing experiments that aim to answer questions about average impacts of policies and programs. Over this same time period there has been increased awareness of a broad range of cause-and-effect questions that evaluation research examines and corresponding methodological innovation and creativity to meet increased demand from the field. That said, experimental evaluations have been subject to criticism, for a variety of reasons (e.g., Bell & Peck, 2015).
The main criticism that compels this book is that experimental evaluations are not suited to disaggregating program impacts in ways that connect to program implementation or practice. That is, experiments have earned a reputation for being a relatively blunt tool, where program implementation details are a “black box.” The complexity, implementation, and nuance of a program itself tends to be overlooked when an evaluation produces a single number (the “impact”) to represent the program’s effectiveness.
Box 1.1 Definition and Origins of the Term “Black Box” in Program Evaluation
In the field of program evaluation, “black box” refers to how some impact evaluations are perceived to consider the program and its implementation. It is possible to evaluate the impact of a program without knowing much at all about what the program is. In that circumstance, the program itself is considered a black box, an unknown.
Perhaps the first published reference to black box appeared in a 1993 Institute for Research on Poverty discussion paper, “Prying the Lid from the Black Box” by David Greenberg, Robert Meyer, and Michael Wiseman (although two of these authors credit Larry Orr for using the black box term before then). This paper seems to have evolved and was published in 1994 as “Multisite Employment and Training Program Evaluation: A Tale of Three Studies” by the same trio, with follow-up papers in the decade that followed (e.g., Greenberg Meyer, Michalopoulos, & Wiseman, 2003).
In the ensuing two decades, the term—as in getting inside the black box—has become associated with the idea of understanding the details of a program’s operations. A special section of the American Journal of Evaluation (volume 36, issue 4) titled ‘Unpacking the “Black Box” of Social Programs and Policies’ was dedicated to the methods; and three chapters of the 2016 New Directions for Evaluation (issue 152) considered “Inside the Black Box” evaluation designs and analyses.
Indeed, recent years have seen policymakers and funders—in government, private, and foundation sectors—desiring to learn more from their evaluations of health, education, and social programs. Although the ability to establish a program’s causal impact is an important contribution, it may be insufficient for those who immediately want to know what explains that treatment effect: Was the program effective primarily because of its quality case management? Did its use of technology in interacting with its participants drive impacts? Or are both aspects of the program essential to its effectiveness?
To answer these types of additional research questions about the key ingredients of an intervention’s success with the same degree of rigor requires a new perspective on the use of experimentals in practice. This book considers a range of impact evaluation questions, most importantly those questions that focus on the impact of specific aspects of a program. It explores how a variety of experimental evaluation design options can provide the answers to these questions and suggests opportunities for experiments to be applied in more varied settings and focused on program improvement efforts.
The State of the Field
The field of program evaluation is large and diverse. Considering the membership and organizational structure of the U.S.-based American Evaluation Association (AEA)—the field’s main professional organization—the evaluation field covers a wide variety of topical, population-related, theoretical, contextual, and methodological areas. For example, the kinds of topics that AEA members focus on—as defined by the association’s sections, or Topical Interest Groups (TIGs), as they are called—include education, health, human services, crime and justice, emergency management, the environment, and community psychology. As of this writing, there are 59 TIGs in operation. The kinds of population-related interests cover youth; feminist issues; indigenous peoples; lesbian, gay, bisexual and transgendered people; Latinos/as; and multiethnic issues. The foundational, theoretical, or epistemological perspectives that interest AEA members include theories of evaluation, democracy and governance, translational research, research on evaluation, evaluation use, organizational learning, and data visualization. The contexts within which AEA members consider their work involve nonprofits and foundations, international and cross-cultural entities and systems, teaching evaluation, business and management, arts and cultural organizations, government, internal evaluation settings, and independent consultancies. Finally, the methodologies considered among AEA members include collaborative, participatory, and empowerment; qualitative; mixed methods; quantitative; program-theory based; needs assessment; systems change; cost-benefit and effectiveness; cluster, multisite, and multilevel; network analysis; and experimental design and analytic methods, among others. Given this diversity, it is impossible to classify the entire field of program evaluation neatly into just a few boxes. The literature regarding any one of these topics is vast, and the intersections across dimensions of the field imply additional complexity.
What this book aims to do is focus on one particular methodology: that of experimental evaluations. Within that area, it focuses further on designs to address the more nuanced questions what about a program drives its impacts. The book describes the basic analytic approach to estimating treatment effects, leaving full analytic methods to other texts that can provide the needed deeper dive.
Across the field, alternative taxonomies exist for classifying evaluation approaches. For example, Stern et al. (2012) identify five types of impact evaluations: experimental, statistical, theory based, case based, and participatory. The focus of this book is the first. Within the subset of the evaluation field that uses randomized experiments, there are several kinds of evaluation models, which I classify here as (1) large-scale experiments, (2) nudge or opportunistic experiments, (3) rapid-cycle evaluation, and (4) meta-analysis and systematic reviews.
Large-Scale Experiments
Perhaps the most commonly thought of experiments are what I will refer to as “large-scale” impact studies, usually government-funded evaluations. These tend to be evaluations of federal or state policies and programs. Many are demonstrations, where a new program or policy is rolled out and evaluated. For example, beginning in the 1990s, the U.S. Department of Housing and Urban Development’s Moving to Opportunity Fair Housing Demonstration (MTO) tested the effectiveness of a completely new policy: that of providing people with housing subsidies in the form of vouchers under the condition that they move to a low poverty neighborhood (existing policy did not impose the neighborhood poverty requirement).
Alternatively, large-scale federal evaluations can be reforms of existing programs, attempts to improve incrementally upon the status quo. For instance, a slew of welfare reform efforts in the 1980s and 1990s tweaked aspects of existing policy, such as changing the tax rate on earnings and its relationship to cash transfer benefit amounts, or changing the amount in assets (such as a vehicle’s value) that a person could have while maintaining eligibility for assistance. These large-scale experiments usually consider broad and long-term implications of policy change, and, as such, take a fair amount of time to plan, implement, and generate results.
This slower process of planning and implementing a large-scale study, and affording the time needed to observe results, is also usually commensurate with the importance of the policy decisions: Even small effects of changing the tax rate on earnings for welfare recipients can result in large savings (or