Uncertainty quantification and reduction are fundamental challenges in the environmental and hydrological sciences. Uncertainties arise due to data scarcity, limited observability, and our incapacity to fully resolve spatial variability and dynamics, or to define correctly all the physical, chemical, and biological processes involved with their boundary and initial conditions and forcing terms [e.g., Christakos, 1992; Rubin, 2003; Oreskes et al., 1994]. As an outcome, full validation of model concepts and perfect model calibration is an almost impossible task [Oreskes et al., 1994].
 In the hydro(geo)logical sciences, we often use models to predict and address scientific hypotheses or challenges in engineering and management under uncertainty:
 Should we shut down a drinking water well before contamination arrives [e.g., Frind et al., 2006; Enzenhöfer et al., 2012]?
 How large is the risk emanating from a contaminated site [e.g., Troldborg et al., 2010]?
 Is natural attenuation occurring [e.g., Schwede and Cirpka, 2010; Cvetkovic, 2011]?
 Is a proposed remediation design safe [e.g., Cirpka et al., 2004; Bolster et al., 2009]?
 Is a high-level radioactive waste site safe [e.g., Andricevic and Cvetkovic, 1996]?
 Is a CO2 injection site safe [e.g., Oladyshkin et al., 2011a, 2011b]?
 Is human health risk above a critical value or not [e.g., de Barros and Rubin, 2008; de Barros et al., 2009, 2011]?
 Is a proposed model adequate to answer these questions [e.g., Neuman, 2003; Refsgaard et al., 2006]?
 All of the above examples include scientific hypotheses, binary decisions, or binary questions of compliance with environmental performance metrics such as human health risk or maximum contaminant levels [see de Barros et al., 2012]. Pappenberger and Beven  provide a list of more studies where binary decisions were taken under uncertainty. They also report the fact that practitioners often complain about the discrepancy between soft uncertainty bounds and the binary character of decisions.
 One way of dealing with binary questions in the face of uncertainty is to formalize them as hypothesis tests. This makes it possible to systematically and rigorously test assumptions, models, predictions, or decisions. Beven  summarizes strong arguments that models and theories in the environmental sciences are nothing else but hypotheses. A prominent example is the work by Luis and McLaughlin , who indeed approach model validation via formal statistical hypothesis testing. Consequently, modelers and scientists should admit the hypothesis-like character of models and their underlying theories, conceptualizations, assumptions, parameterizations, and parameter values. We propose that the same approach should be taken to support any type of decisions that modelers, engineers, scientists, and managers need to take under uncertainty. One should treat model predictions and derived conclusions and decisions as hypotheses, and one should continuously try to assess and test their validity.
 The credibility of and confidence in any answer or decision under uncertainty increases with well-selected additional data that help to better test, support, and calibrate the involved assumptions, models, and parameters. However, data must be collected in a rational and goal-oriented manner because field campaigns and laboratory analysis are expensive while budgets are limited [e.g., James and Gorelick, 1994].
 This is where optimal design and geostatistical optimal design [e.g., Pukelsheim, 2006; Ucinski, 2005; Christakos, 1992] come into play. Optimal design (OD) gets the maximum gain of information from limited sampling, field campaigns, or experimentation. It optimizes the projected trade-offs between the costs spent on additional data versus the higher level of information. It can be used to optimize (1) what types of data (e.g., material parameters, state variables) to collect, (2) where to sample (e.g., the spatial layout and time schedule of observation networks), and (3) how to best excite the system to observe an informative response (e.g., designing tracer injections or hydraulic tests). Many applications in groundwater hydrology can be found in the literature [e.g., James and Gorelick, 1994; Reed et al., 2000a; Herrera and Pinder, 2005; Nowak et al., 2010; Leube et al., 2012].
 Classical OD theory is based on utility theory [e.g., Fishburn, 1970]. It optimizes the utility of sampling [e.g., Pukelsheim, 2006], which is traditionally defined as increased information [Bernardo, 1979] or reduced uncertainty (measured by variances, covariances, or entropies [e.g., Pukelsheim, 2006; Nowak, 2010; Abellan and Noetinger, 2010]). Unfortunately, these are only surrogate (approximate) measures for the actual utility, rather than ultimate measures [e.g., Loaiciga et al., 1992]. The use of surrogates may corrupt the optimality of designs for the originally intended purpose.
 Goal-oriented approaches define the utility of sampling via ultimate measures, i.e., measures defined in the context of a given management application [e.g., Ben-Zvi et al., 1988; James and Gorelick, 1994; Feyen and Gorelick, 2005; Bhattacharjya et al., 2010; Li, 2010]. Thus, optimal sampling and field campaign strategies can adapt to the interplay between the actual information needs of the goal at hand, the available measurement and investigation techniques, and the specific composition of uncertainty [e.g., Maxwell et al., 1999; de Barros et al., 2009; Nowak et al., 2010]. For instance, de Barros et al.  showed how the utility of data depends on the considered environmental performance metric (e.g., maximum concentration levels, travel times, or human health risk).
 Our work focuses, for now, on situations where the objective for field campaigns or experimentation is to support binary decision problems with maximum confidence in a one-time field effort. By proper formulation of corresponding hypothesis tests, we cast the binary decision problem into the context of Bayesian decision theory [e.g., Berger, 1985] and Bayesian hypothesis testing [e.g., Press, 2003]. The latter differs from classical testing in two respects: (1) it can absorb prior knowledge on the likelihood of hypotheses, and (2) it can assess the probabilities of all possible decision errors. Then, we optimize field campaigns in a goal-oriented manner such that the total probability of decision error is minimized. The outcome of our approach are data collected rationally toward the specific hypothesis, question, or decision at hand, providing the desired confidence at minimal costs.
 There is a substantial body of work in the literature that optimizes field campaigns via Bayesian decision theory, maximizing the expected data worth [e.g., Massmann and Freeze, 1987; James and Gorelick, 1994]. The data worth concept follows classical ideas from utility theory and decision theory [e.g., Fishburn, 1970; Ben-Zvi et al., 1988; Raiffa et al., 1995]. It assigns monetary units for utility, and then weighs up expected benefits against the costs of field campaigns. The practical difference between our suggested approach and classical data worth studies is twofold. First, our approach encompasses arbitrary hypothesis tests or binary questions, whereas data worth studies are restricted to management tasks that provide a context for monetizing data utility. Second, we do not maximize the monetary worth of data collection but use the error probability of binary decisions or conclusions derived from the data as objective function to minimize. In contrast to classical data worth analysis, this avoids commensuration, i.e., does not require to have a common (monetary) scale for different values such as sampling costs versus improved scientific confidence or reduced health risk.
 An alternative approach to avoid commensuration is multiobjective optimization (MOO) [e.g., Sawaragi et al., 1985; Marler and Arora, 2004]. MOO provides a suite of Pareto-optimal candidate solutions, i.e., solutions that cannot improve in any aspect without degrading in at least one other aspect. The final decision is found by inspecting and discussing the trade-offs in the different aspects between the suggested Pareto optima [e.g., Reed and Minsker, 2004; Kollat and Reed, 2007]. Thus, the problem of commensuration or preference articulation is postponed to after the optimization, where more information on the trade-offs and their consequences are available. A second reason to use MOO is that there may be a multitude of competing or (seemingly) incompatible objectives by different stakeholders. Such objectives may as well evolve and change over time, especially in the design of long-term groundwater monitoring networks [e.g., Loaiciga et al., 1992; Reed and Kollat, 2012].
 Looking at binary decisions leads to classical single-objective optimization, just like most of optimal design theory or data worth concepts. We restrict our current work to the single-objective context, for now looking at the case where a planned field campaign should chiefly support a single decision problem in a one-time field effort. Still, nothing restricts our approach from integration into MOO approaches in future work, e.g., if other objectives coexist, or for a detailed trade-off analysis between improved decision confidence and costs of the field campaign.
 The hypotheses or decision problems that can be supported with our approach include, e.g., model validity, model choice for geostatistical, physical, chemical, or biological model assumptions, parameterization forms or closure assumptions, compliance with environmental performance metrics or other model predictions that are related to binary decisions, and reliability questions in optimization and management. In a synthetic test case for the sake of illustration, we feature the prediction of compliance with maximum contaminant levels as an environmental performance metric, looking at contaminant transport to an ecologically sensitive location through a two-dimensional heterogeneous aquifer with uncertain covariance function and uncertain boundary conditions.
2. General Approach and Mathematical Formulation
 Assume that either a modeler, scientist, or a manager is asked to provide a yes/no decision on a proposed statement. Due to the inherent uncertainty of the problem-at-hand, the answer can only be found at a limited confidence level. The outline of our approach for such situations is as follows:
 1. To cast the corresponding yes/no question or binary decision into a hypothesis test (section 2.1).
 2. To insert the hypothesis test into the Bayesian framework, which yields the probability of making a false decision (section 2.2).
 3. To analyze the expected reduction of error probability through planned field campaigns as criterion for optimal design (section 2.3).
 4. To minimize the error criterion by optimizing the field campaign (section 2.3 and section A3).
 The obtained sampling schemes allow the proposed hypotheses (and the final decision) to be affirmed or refuted at the desired confidence and at minimum costs for the field campaign. In the following we use a most generic formulation. We will illustrate our methodology based on one scenario with two different levels of complexity in sections 3, 4, and 5.
2.1. Classical Hypothesis Testing
 This section summarizes the key steps of classical hypothesis testing and introduces the notation, before we move on to Bayesian hypothesis testing and optimal design. The well-known individual steps of hypothesis testing are [e.g., Stone, 1996; Casella and Berger, 2002]:
 1. Identify the null hypothesis H0 and the alternative hypothesis H1.
 Null hypothesis H0: A fallback assumption H0 on some target variable q holds.
 Alternative hypothesis H1: A (desirable) assumption H1 on q is true.
 H0 is the hypothesis that is accepted for the time being, while the burden of proof is on H1, which is the hypothesis one desires to prove [e.g., Shi and Tao, 2008]. Per definition, falsely accepting H1 is the more critical type of error. This calls for sufficient evidence (i.e., statistically significant data) before accepting H1 over H0, and coincides well with the precautionary principle in policy, environmental sciences, and the public health sector [e.g., Kriebel et al., 2001].
 2. Choose a level of significance . Typically, should be small (e.g., ). It is the probability of falsely accepting H1 although H0 is in fact true, and so controls the worse type of test failure. The quantity is often called the power of the test.
 3. Decide what type of data are appropriate to judge H0 and H1, and define the relevant test statistic T that can be computed from the data.
 4. Derive the distribution of the test statistic, while considering all statistical assumptions (independence, distribution, etc.) of the data. More accurately, should be denoted as because it is the distribution of T that can be observed if H0 was in fact true. Since T is a sample statistic, will depend, among other things, on the sample size N.
 5. The significance level and the character of the test (one-sided, two-sided) partitions the distribution into the critical or rejection region (reject H0), and the acceptance region (accept H0 due to lacking statistical evidence against it).
 6. Evaluate the observed value Tobs of the test statistic from the data.
 7. Depending on the value of Tobs, decide on the two hypotheses:
 Decision D0: Accept H0 if Tobs is outside the critical region of .
 Decision D1: Else, reject H0 in favor of H1.
 Either of the decisions D0 and D1 could be false, leading to the corresponding error events:
 error : Decision D1 was taken (H1 accepted) although hypothesis H0 is in fact true (called false positive of H1).
 error : Decision D0 was taken (H0 accepted) although hypothesis H1 is in fact true (called false negative of H1).
 Within classical hypothesis tests, these errors can only be quantified via their conditional probabilities:
The significance level is the maximum acceptable conditional probability for the
This paper argues US manufacturers still fail to identify metrics that predict performance results despite two decades of intensive investment in data-mining applications because indicators with the power to predict complex results must have high information content as well as a high impact on those results. But data mining cannot substitute for experimental hypothesis testing in the search for predictive metrics with high information content—not even in the aggregate—because the low-information metrics it provides require improbably complex theories to explain complex results. So theories should always be simple but predictive factors may need to be complex. This means the widespread belief that data mining can help managers find prescriptions for success is a fantasy. Instead of trying to substitute data mining for experimental hypothesis testing, managers confronted with complex results should lay out specific strategies, test them, adapt them—and repeat the process. Copyright © 2013 John Wiley & Sons, Ltd.