Verification, Validation, and Clinical Reality
How digital health built an evaluation framework from proven concepts
Why This Matters
Evaluation frameworks don’t start from zero
My prior post talked about how evaluation frameworks developed for radar systems during World War II still provide valuable insights for evaluating modern AI systems. The underlying principle: we don’t need to reinvent the wheel for every new technology.
This week I want to give a more detailed example of not needing to reinvent the wheel in evaluation frameworks in healthcare technology. This example comes from digital health. I’ll discuss a highly-cited paper1 whose senior author (Prof. Jessilyn Dunn at Duke) is an academic leader in digital health. The paper, which is a personal favorite of mine, is entitled “Verification, analytical validation, and clinical validation (V3) - the foundation of determining fit-for-purpose for Biometric Monitoring Technologies (BioMeTs)”.
The paper proposes a framework for evaluating what the authors refer to as BioMETs (Biometric Monitoring Technologies). I’ll discuss the digital health context for the paper, how the authors adapted established concepts from other fields, and the three-part “V3” (verification, analytical validation, and clinical validation) framework they propose.
Why write about a digital health evaluation framework? Because digital health sits at an interesting intersection: it’s newer than traditional medical devices but more established than current AI systems. The evaluation challenges the field has grappled with (how to assess tightly integrated hardware-software systems, how to validate algorithmic outputs, how to ensure real-world clinical utility) offer valuable lessons. In future posts, I’ll explore how these concepts might transfer to evaluating medical AI systems, especially those that use non-deterministic models such as LLMs.
Digital Health: Some Historical Context
Two revolutions at once
There’s two ways to look at the birth of the field of digital health. One is from the pure tech perspective. The introduction of the iPhone almost 20 years ago led in relatively short order to almost everyone having an internet-connected computer in their pocket. In the same year that the iPhone was released, FitBit was founded, which kicked off an explosion in fitness monitors and smart watches from Apple, Garmin, Google, and others. Consumer tech started to paint around the edges of healthcare.
At the same time, medicine itself was becoming more digital. The first continuous glucose monitoring (CGM) systems were introduced around the turn of the 21st century. CGM attracted heavy hitters, including Medtronic, Dexcom, and Abbott2. Due in part to the smartphone revolution, and in part to known limitations to in-office blood pressure measurement, there was an explosion in at-home devices for blood pressure monitoring. This explosion was so consequential that the American Heart Association now recommends that all people with hypertension should monitor their blood pressure at home3.
The other way that medicine was becoming more digital was the 2010 HITECH Act4, which heavily incentivized the adoption of electronic health records (EHRs) by healthcare systems. On one hand, EHRs have numerous limitations (e.g. data siloing, usability challenges), and their attendant documentation burden has contributed to provider burnout. On the other hand, the HITECH act prompted the deployment of essential (if not incomplete) infrastructure for the emerging field of digital health.
Digital Health: Hardware + Software
When neither component stands alone
Digital health technologies represent an evolution in healthcare technology in that they are a genuine hybrid of hardware and software. Yes, medical devices have used software of one kind or another for decades. However, the traditional use of that software is sandboxed to the functioning of the device itself. Or, from a different view, the hardware represents the main value of device. Of course, nothing is ever that black-and-white. However, digital health technologies give instantaneous readout of hardware sensors (e.g. what is my blood glucose right now) and also use software algorithms that give insight into time-series data (e.g. recent trends in blood glucose levels). They even can start to make diagnostic-type inferences (e.g. are recent blood pressure values consistent with hypertension).
Tried and True Concepts: Fit-For-Purpose, Verification, and Validation
Established ideas adapted for new challenges
This close relationship between hardware and software is one reason why the authors developed a new framework for evaluation of BioMET digital health technologies. They use on three key established concepts: fit-for-purpose, verification, and validation.
Fit-for-Purpose: The concept of fit-for-purpose addresses the question of: is this technology adequate for its specified intended use? The concept is not specific to biomedical technologies and has its origins in manufacturing and engineering. Its wide use should not a surprise. After all, “does the thing do what it’s supposed to do” is an evergreen question. As it relates to the development of digital health technology, fit-for-purpose is important because it is essential for those that develop technology to define the parameters of intended use. Early in R&D, it may be an iterative process to define fit-for-purpose. But, in the end, it needs to be clearly defined in order to successfully verify and validate a new technology.
Verification and Validation: Verification and validation are intertwined concepts. Their context-specific definitions vary, as systematically highlighted by the authors. For the sake of brevity, I’ll use the definitions in the IEEE Standard for System, Software, and Hardware Verification and Validation document5 cited in Table 1 in the manuscript. In this document, verification functions as an internal consistency check and validation as an external reality check. Verification is the “process of evaluating a system or component to determine whether the products of a given development phase satisfy the conditions imposed at the start of that phase.” Validation, on the other hand, is the “process of evaluating a system or component during or at the end of the development process to determine whether it satisfies specified requirements.” This includes ensuring that the product “satisfies intended use and user needs.”
I know this is a lot of text to go through. But, maybe thinking of it this way would be helpful. Verification is asking “Can the device measure heart rate as it was designed to?” Validation is asking “Is the device’s measurement of heart rate clinically useful in the way it was anticipated?”
Bringing It All Together: The V3 Framework
V3: Verification, Analytical Validation, Clinical Validation
The distinction between internal consistency (verification) and external reality check (validation) is carried forward by the authors of the V3 framework in the context of tightly integrated hardware and software. Again, the point isn’t that software has never been part of medical hardware before. Rather, it’s that the tightly integrated hardware-software device has functionality and value that cannot be neatly attributed to either its hardware or software. The authors break down the verification and validation process into three parts: verification, analytical validation, and clinical validation.
Verification: Verification is concerned with the performance of the enabling hardware sensor of the device. The sensor could be chemical (e.g. glucose sensor), optical (e.g. smartwatch heart rate sensor), electrical (e.g. electrocardiographic sensor), and so on. The verification process focuses on sensor-level data and asks the question: does the sensor measure what it is expected to measure?
Analytical Validation: Analytical validation is concerned with the performance of the algorithms enabled by the software. Specifically, it asks the question: do the software algorithms, in conjunction with the measured sensor data, deliver the expected metrics or predictions? For example, using an example from the paper, given the verified ability of a bioelectrical sensor to capture electrocardiographic data, can the algorithms correctly identify the presence of arrhythmias by using established performance criteria.
Clinical Validation: Clinical validation is concerned with device performance in the context in which it will be used. It asks the question: are the algorithmic-derived outputs of the device meaningful in its “stated context of use.” Continuing the example of an electrocardiographic device with algorithms that can identify arrhythmias, the authors state that clinical validation would ensure that, for example, the device can “acceptably detect atrial fibrillation in adults.”
Looking Ahead
From digital health to AI evaluation
I wanted to write about this paper because I think many of the concepts in it are useful in developing approaches for evaluation of AI. One of the concepts is being clear about the extent to which AI is “sharing the stage” with another technology, that is, being clear about where AI is significantly interacting with other software components or hardware. Another is the the idea of first testing for internal consistency during the course of development and then performing external reality checks to ensure that the product is meeting fit-for-purpose criteria.
It is early days in developing robust frameworks for non-deterministic AI in healthcare67, and it is important to have these concepts in mind not only to leverage the lessons of prior works but also to communicate those frameworks to other using concepts that they readily understand.
Goldsack, J. C., Coravos, A., Bakker, J. P., Bent, B., Dowling, A. V., Fitzer-Attas, C., Godfrey, A., Godino, J. G., Gujar, N., Izmailova, E., Manta, C., Peterson, B., Vandendriessche, B., Wood, W. A., Wang, K. W., & Dunn, J. (2020). Verification, analytical validation, and clinical validation (V3): the foundation of determining fit-for-purpose for Biometric Monitoring Technologies (BioMeTs). Npj Digital Medicine, 3(1), 55. https://doi.org/10.1038/s41746-020-0260-4
Hirsch, I. (2018). Introduction: History of Glucose Monitoring. ADA Clinical Compendia, 1–1. https://doi.org/10.2337/db20181-1
Monitoring Your Blood Pressure at Home. (n.d.). Www.Heart.Org. Retrieved December 21, 2025, from https://www.heart.org/en/health-topics/high-blood-pressure/understanding-blood-pressure-readings/monitoring-your-blood-pressure-at-home
Health Information Technology for Economic and Clinical Health Act. (2025). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Health_Information_Technology_for_Economic_and_Clinical_Health_Act&oldid=1308776672
IEEE Standard for System, Software, and Hardware Verification and Validation. IEEE Std 1012-2016 (Revision of IEEE Std 1012-2012/ Incorporates IEEE Std 1012-2016/Cor1-2017) 1–260 (2017). https://doi.org/10.1109/IEEESTD.2017.8055462
Tam, T. Y. C., Sivarajkumar, S., Kapoor, S., Stolyar, A. V., Polanska, K., McCarthy, K. R., Osterhoudt, H., Wu, X., Visweswaran, S., Fu, S., Mathur, P., Cacciamani, G. E., Sun, C., Peng, Y., & Wang, Y. (2024). A framework for human evaluation of large language models in healthcare derived from literature review. Npj Digital Medicine, 7(1), 258. https://doi.org/10.1038/s41746-024-01258-7
Jiang, Y., Black, K. C., Geng, G., Park, D., Zou, J., Ng, A. Y., & Chen, J. H. (2025). MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLM Agents. NEJM AI, 2(9). https://doi.org/10.1056/AIdbp2500144


