Contextual Awareness is All You Need Post #5: Dr. Rumman Chowdhury
Throughout the history of startups and big tech innovation, metrics have served as both catalysts for breakthrough innovation and sources of organizational and societal misalignment. As AI becomes embedded in more aspects of daily life, the importance of accurate, robust measurement—often referred to as measurement science—has never been greater. This blog post explores why measurement matters, how metrics have shaped tech company behavior, and how measurement science can enhance system-level AI evaluation and societal impact assessment.
From the early days of software-as-a-service (SaaS) to the AI boom of today, metrics have played a fundamental role in how startups and tech companies are evaluated, funded, and scaled. In the SaaS era, investors relied on metrics like customer acquisition cost (CAC), lifetime value (LTV), and net revenue retention (NRR) to assess a company’s health and potential. These metrics created clear incentives for companies to follow: lower CAC, increase LTV, and keep customers engaged. When executed well, this led to sustainable growth and long-term value creation. This measurement translated to models – where ‘performance’ was deemed the most important.
However, over-reliance on certain metrics has also created misaligned incentives. As Goodhart’s Law tells us, focusing solely on user growth or revenue can lead companies to chase certain metrics and prioritize short-term gains over long-term product quality or user satisfaction. In the rapidly evolving AI industry, the temptation to optimize for headline metrics—such as model accuracy or inference speed—can often obscure critical issues including fairness, robustness, and real-world impact. As noted in our recent paper, “Overemphasizing metrics leads to a variety of real-world harms, including manipulation, gaming, and a myopic focus on short-term qualities and inadequate consideration of long-term consequences.”[1]
As the AI revolution has accelerated, the old playbook has begun to fray. The rapid scaling of AI companies—sometimes reaching mass adoption in just two years—has forced investors and founders to rethink which metrics truly matter. AI startups today are evaluated on a mix of traditional and novel metrics: gross margin, retention, engagement, proprietary technology, revenue growth; assuming that in-silico technical performance automatically equates to ‘better technology’ once deployed. Yet, as the industry matures, it is becoming clear that technical benchmarks alone are insufficient for assessing real-world value. Companies that focus narrowly on these technical, computationally-centric metrics often fail to capture the broader business or societal impact of their AI systems.
Measurement science is the systematic study of how to define, collect, and interpret measurements to ensure they are valid, reliable, and meaningful. Historically, AI progress has been measured using standardized datasets and benchmarks—think MNIST for image recognition or ImageNet for computer vision. These benchmarks have played a critical role in driving technical progress, but they have limitations. They often fail to capture how AI systems perform in real-world, dynamic environments, where data distributions shift, and user needs evolve. In the context of AI, measurement science offers a path forward to move beyond static benchmarks focused on system performance and functionality to develop dynamic, context-aware methods for evaluating system utility, relevant risks and their associated impacts. By integrating qualitative expert knowledge, continuous monitoring, and adaptive evaluation frameworks, we can build a more complete picture of AI capabilities and risks.
Evaluating AI at the system level—rather than just the model level—is essential for understanding its true impact. System-level evaluation considers not only the technical performance of an AI model but also how users interact with systems in deployment settings, the efficacy of organizational processes for managing the system, and the broader ecosystem in which technology is placed. This approach is especially important when considering societal impact, where the consequences of AI errors or harmful biases can be far-reaching. In order to achieve these goals, effective AI measurement requires integrating perspectives from computer scientists, social scientists, domain experts, and affected communities. Each stakeholder group prioritizes different aspects—developers focus on technical performance, users care about utility, while regulators emphasize safety and compliance.
The way we measure AI performance shapes how tech companies prioritize their efforts. When investors and regulators focus on narrow technical benchmarks, startups are incentivized to optimize for those benchmarks, sometimes at the expense of broader goals like fairness, safety, or long-term sustainability. Conversely, when evaluation frameworks include a wider range of metrics—such as business outcomes, user impact, and regulatory compliance—companies are more likely to build systems that deliver real value and minimize harm. Recent research from the Stanford AI Index and McKinsey supports this view: companies that measure AI success based on long-term business outcomes are 3.8 times more likely to see significant financial returns from their AI initiatives compared to those relying solely on technical accuracy measures. This shift in focus is not just good for business; it is also essential for ensuring that AI systems are developed and deployed responsibly.
This means the future measurement landscape will expand beyond static, single-turn, in silico benchmarks to embrace field testing, red teaming, and longitudinal studies that capture how AI systems are actually used and how their effects ripple through organizations and society. Social science experimental design offers a wealth of tools for this purpose: randomized controlled trials (RCTs), quasi-experimental designs, natural experiments, and mixed-methods research can all be adapted to study AI in context. For example, RCTs can be used to assess the impact of AI-driven interventions in education or healthcare, while mixed-methods approaches can combine quantitative data on system performance with qualitative insights from user interviews, focus groups, and observational studies.
Recent frameworks, such as value-sensitive design (VSD), provide a blueprint for integrating contextual awareness into AI evaluation. VSD’s tripartite methodology—conceptual, empirical, and technical—guides evaluators to specify the relevant context, collect and generate contextually informed data, and iteratively refine system design based on real-world feedback. This approach is especially well-suited to studying the second-order effects of AI, such as shifts in user behavior, workforce transformations, and long-term societal consequences.
A new evaluation ecosystem must also address the challenges of reproducibility, transparency, and stakeholder engagement. Advances in open science, pre-registration, and adversarial collaboration can help ensure that findings are robust and generalizable. Moreover, interdisciplinary collaboration is essential: just as the original sociotechnical systems movement brought together engineers and social scientists, the next generation of AI measurement will require close partnership between computer scientists, measurement experts, behavioral scientists, and domain specialists.
Ultimately, the goal is to build a measurement infrastructure that is as dynamic and adaptive as the AI systems it seeks to evaluate. This infrastructure will need to support continuous monitoring, feedback loops, and the integration of diverse data sources—from system logs and user surveys to third-party audits and regulatory filings. By grounding AI evaluation in both measurement science and social science experimental design, we can develop a more complete understanding of AI’s real-world effects.
Accurate measurement is the foundation of responsible AI development and deployment. The history of tech and AI startups shows that metrics can both incentivize innovation and create harmful misalignments when overemphasized or poorly designed. Measurement science, informed by the sociotechnical perspective and social science methodologies, offers a way forward, by providing the tools and frameworks needed to evaluate AI systems at the system level and assess their true impact on society. As AI continues to reshape industries and daily life, the need for robust, context-aware measurement has never been greater. By embracing measurement science and broadening the scope of AI evaluation, we can ensure that AI delivers on its promise—not just for businesses, but for society as a whole.
Sources
[1] Reality Check: A New Evaluation Ecosystem Is Necessary to Understand AI’s Real World Effects https://arxiv.org/pdf/2505.18893.pdf