Hypothetical scenarios provide a popular alternative to field experiments for scholars interested in nudging behavior change, comprising a substantial proportion of such studies in the domains of finance, transportation, and sustainability. Yet their validity as proxies for real-world contexts is unclear. To investigate, we designed four styles of hypothetical scenarios to approximate five recent field studies of nudges in distinct domains, running a total of 20 pre-registered experiments (N=16,071, n>200 per cell). This design allows clear comparison of old field data with new hypothetical data. We find that hypothetical outcomes are consistently biased upwards –participants engage more in target behaviorsby a median factor of 3.81 compared to the original field experiment –while their estimations of treatment effects are unpredictable: sometimes bigger, sometimes smaller, sometimes calibrated. Further, none of our four hypothetical designs reliably reduced estimation error. Without a gold standard approach to constructing hypothetical scenarios, behavioral researchers should use caution when employing this low-cost but unreliable tool to evaluate nudge interventions.