Close Menu
  • Breaking News
  • Business
  • Career
  • Sports
  • Climate
  • Science
    • Tech
  • Culture
  • Health
  • Lifestyle
  • Facebook
  • Instagram
  • TikTok
Categories
  • Breaking News (5,238)
  • Business (317)
  • Career (4,446)
  • Climate (217)
  • Culture (4,417)
  • Education (4,636)
  • Finance (213)
  • Health (865)
  • Lifestyle (4,300)
  • Science (4,323)
  • Sports (342)
  • Tech (178)
  • Uncategorized (1)
Hand Picked

Sudan’s army captures two areas in North Kordofan as RSF burns more bodies | Sudan war News

November 15, 2025

8 outdated habits older people refuse to give up that actually make them happier than tech-savvy millennials – VegOut

November 15, 2025

SpaceX completes second fastest turnaround between Falcon 9 launches from Cape Canaveral – Spaceflight Now

November 15, 2025

Jumpstart your future at the Toppel Career Center

November 15, 2025
Facebook X (Twitter) Instagram
  • About us
  • Contact us
  • Disclaimer
  • Privacy Policy
  • Terms and services
Facebook X (Twitter) Instagram
onlyfacts24
  • Breaking News

    Sudan’s army captures two areas in North Kordofan as RSF burns more bodies | Sudan war News

    November 15, 2025

    Berkshire Hathaway’s surprising new tech stake

    November 15, 2025

    ‘Don’t need porch puppies’: Democrats say base is rightfully upset over shutdown deal

    November 15, 2025

    From family breadwinner at 11 to world-famous perfume entrepreneur

    November 15, 2025

    Trump threatens $1-5 billion lawsuit against BBC over Jan. 6 speech edit

    November 15, 2025
  • Business

    CBSE Class 12 Business Studies Exam Pattern 2026 with Marking Scheme and Topic-wise Marks Distribution

    November 13, 2025

    25 Tested Best Business Ideas for College Students in 2026

    November 10, 2025

    Top 10 most-read business insights

    November 10, 2025

    SAP Concur Global Business Travel Survey in 2025

    November 4, 2025

    Global Topic: Panasonic’s environmental solutions in China—building a sustainable business model | Business Solutions | Products & Solutions | Topics

    October 29, 2025
  • Career

    Jumpstart your future at the Toppel Career Center

    November 15, 2025

    Skilled trades fair shows teens new career paths | Redmond News

    November 15, 2025

    Campbell Law Career Night introduces students to legal community – News

    November 15, 2025

    Texans Joe Mixon Takes ‘Rare’ Action After Ominous Career News

    November 15, 2025

    New Social Work Specialization Prepares Students for Mental Health Careers – Georgia State University News – Andrew Young School of Policy Studies, Press Releases, Press Releases, The Graduate School

    November 15, 2025
  • Sports

    Thunder’s Nikola Topic diagnosed with testicular cancer, undergoing chemotherapy

    November 15, 2025

    Nikola Topic, Oklahoma City Thunder, PG – Fantasy Basketball News, Stats

    November 14, 2025

    Sports industry in Saudi Arabia – statistics & facts

    November 14, 2025

    OKC Thunder Guard Nikola Topic Diagnosed with Testicular Cancer

    November 12, 2025

    Nikola Topic: Oklahoma City Thunder guard, 20, diagnosed with cancer

    November 11, 2025
  • Climate

    Organic Agriculture | Economic Research Service

    November 14, 2025

    PA Environment & Energy Articles & NewsClips By Topic

    November 9, 2025

    NAVAIR Open Topic for Logistics in a Contested Environment”

    November 5, 2025

    Climate-Resilient Irrigation

    October 31, 2025

    PA Environment & Energy Articles & NewsClips By Topic

    October 26, 2025
  • Science
    1. Tech
    2. View All

    Three Trending Tech Topics at the Conexxus Annual Conference

    November 15, 2025

    Another BRICKSTORM: Stealthy Backdoor Enabling Espionage into Tech and Legal Sectors

    November 14, 2025

    Data center energy usage topic of Nov. 25 Tech Council luncheon in Madison » Urban Milwaukee

    November 11, 2025

    Google to add ‘What People Suggest’ in when users will search these topics

    November 1, 2025

    SpaceX completes second fastest turnaround between Falcon 9 launches from Cape Canaveral – Spaceflight Now

    November 15, 2025

    Sun fires off 2nd-strongest flare of 2025, sparking radio blackouts across Africa

    November 15, 2025

    Watch: Sentinel-6B launch live broadcast

    November 15, 2025

    Latest science news: New Glenn launch | China’s astronauts return | ‘Other’ ATLAS explodes

    November 15, 2025
  • Culture

    Tampa Bay TimesCommunism, ‘toxic culture’ and more: A busy Florida State Board of EducationA roundup of Florida education news from around the state..3 hours ago

    November 15, 2025

    THE POP CULTURE NEWS BULLETIN 216: SEE THE NEW TAYLOR SWIFT AND ‘PRADA’ TRAILERS!

    November 15, 2025

    Penn State celebrates culture and connections | University Park Campus News

    November 15, 2025

    Why Native American Heritage Month matters in San Diego

    November 15, 2025

    Meow Wolf Grapevine bends reality with new show ‘Phenomenomaly’ in time for holidays

    November 15, 2025
  • Health

    Editor’s Note: The Hot Topic Of Women’s Health

    November 14, 2025

    WHO sets new global standard for child-friendly cancer drugs, paving way for industry innovation

    November 10, 2025

    Hot Topic, Color Health streamline access to cancer screening

    November 6, 2025

    Health insurance coverage updates the topic of Penn State Extension webinar

    November 5, 2025

    Hot Topic: Public Health Programs & Policy in Challenging Times

    November 5, 2025
  • Lifestyle
Contact
onlyfacts24
Home»Science»Medical AI tools are growing, but are they being tested properly?
Science

Medical AI tools are growing, but are they being tested properly?

March 8, 2025No Comments
Facebook Twitter Pinterest LinkedIn Tumblr Email
030325 ac ai health apps feat.jpg
Share
Facebook Twitter LinkedIn Pinterest Email

Artificial intelligence algorithms are being built into almost all aspects of health care. They’re integrated into breast cancer screenings, clinical note-taking, health insurance management and even phone and computer apps to create virtual nurses and transcribe doctor-patient conversations. Companies say that these tools will make medicine more efficient and reduce the burden on doctors and other health care workers. But some experts question whether the tools work as well as companies claim they do.

AI tools such as large language models, or LLMs, which are trained on vast troves of text data to generate humanlike text, are only as good as their training and testing. But the publicly available assessments of LLM capabilities in the medical domain are based on evaluations that use medical student exams, such as the MCAT. In fact, a review of studies evaluating health care AI models, specifically LLMs, found that only 5 percent used real patient data. Moreover, most studies evaluated LLMs by asking questions about medical knowledge. Very few assessed LLMs’ abilities to write prescriptions, summarize conversations or have conversations with patients — tasks LLMs would do in the real world.

The current benchmarks are distracting, computer scientist Deborah Raji and colleagues argue in the February New England Journal of Medicine AI. The tests can’t measure actual clinical ability; they don’t adequately account for the complexities of real-world cases that require nuanced decision-making. They also aren’t flexible in what they measure and can’t evaluate different types of clinical tasks. And because the tests are based on physicians’ knowledge, they don’t properly represent information from nurses or other medical staff.

“A lot of expectations and optimism people have for these systems were anchored to these medical exam test benchmarks,” says Raji, who studies AI auditing and evaluation at the University of California, Berkeley. “That optimism is now translating into deployments, with people trying to integrate these systems into the real world and throw them out there on real patients.” She and her colleagues argue that we need to develop evaluations of how LLMs perform when responding to complex and diverse clinical tasks.

Science News spoke with Raji about the current state of health care AI testing, concerns with it and solutions to create better evaluations. This interview has been edited for length and clarity.

SN: Why do current benchmark tests fall short?

Raji: These benchmarks are not indicative of the types of applications people are aspiring to, so the whole field should not obsess about them in the way they do and to the degree they do.

This is not a new problem or specific to health care. This is something that exists throughout machine learning, where we put together these benchmarks and we want it to represent general intelligence or general competence at this particular domain that we care about. But we just have to be really careful about the claims we make around these datasets.

The further the representation of these systems is from the situations in which they are actually deployed, the more difficult it is for us to understand the failure modes these systems hold. These systems are far from perfect. Sometimes they fail on particular populations, and sometimes, because they misrepresent the tasks, they don’t capture the complexity of the task in a way that reveals certain failures in deployment. This sort of benchmark bias issue, where we make the choice to deploy these systems based on information that doesn’t represent the deployment situation, leads to a lot of hubris.

SN: How do you create better evaluations for health care AI models?

Raji: One strategy is interviewing domain experts in terms of what the actual practical workflow is and collecting naturalistic datasets of pilot interactions with the model to see the types or range of different queries that people put in and the different outputs. There’s also this idea that [coauthor] Roxana Daneshjou has been doing in some of her work with “red teaming,” with actively gathering a group of people to adversarialy prompt the model. Those are all different approaches to getting at a more realistic set of prompts closer to how people actually interact with the systems.

Another thing we are trying is getting information from actual hospitals as usage data — like how they are actually deploying it and workflows from them about how they are actually integrating the system — and anonymized patient information or anonymized inputs to these models that could then inform future benchmarking and evaluation practices.

Sponsor Message

There are approaches that exist from other disciplines [like psychology] about how to ground your evaluations in observations of reality to be able to assess something. The same applies here — how much of our current evaluation ecosystem is grounded in the reality of what people are observing and what people are either appreciating or struggling with in terms of the actual deployment of these systems.

SN: How specialized should model benchmark testing be?

Raji: The benchmark that is geared towards question answering and knowledge recall is very different from a benchmark to validate the model on summarizing doctors’ notes or doing questioning and answering on uploaded data. That kind of nuance in terms of the task design is something that I’m trying to get to. Not that every single person should have their own personalized benchmark, but that common task that we do share needs to be way more grounded than multiple-choice tests. Because even for real doctors, those multiple-choice questions are not indicative of their actual performance.

SN: What policies or frameworks need to be in place to create such evaluations?

Raji: This is mostly a call for researchers to invest in thinking through and constructing not just benchmarks but also evaluations, at large, that are more grounded in the reality of what our expectations are for these systems once they get deployed. Right now, evaluation is very much an afterthought. We just think that there’s a lot more attention that could be paid towards the methodology of evaluation, the methodology of benchmark design and the methodology of just assessment in this space. 

Second, we can ask for more transparency at the institutional level such as through AI inventories in hospitals, wherein hospitals should share the full list of different AI products that they make use of as part of their clinical practice. That’s the kind of practice at the institutional level, at the hospital level, that would really help us understand what people are currently using AI systems for. If [hospitals and other institutions] published information about the workflows that they sort of integrate these AI systems into, that can also help us think of better evaluations. That kind of thing at the hospital level will be super helpful.

At the vendor level too, sharing information about what their current evaluation practice is — what their current benchmarks rely on — helps us figure out the gap between what they are currently doing and something that could be more realistic or more grounded.

SN: What is your advice for people working with these models?

Raji: We should, as a field, be more thoughtful about the evaluations that we focus on or that we [overly base our performance on.]

It’s really easy to pick the lowest hanging fruit — medical exams are just the most available medical tests out there. And even if they are completely unrepresentative of what people are hoping to do with these models at deployment, it’s like an easy dataset to compile and put together and upload and download and run.

But I would challenge the field to be a lot more thoughtful and to pay more attention to really constructing valid representations of what we hope the models do and our expectations for these models once they are deployed.

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

SpaceX completes second fastest turnaround between Falcon 9 launches from Cape Canaveral – Spaceflight Now

November 15, 2025

Sun fires off 2nd-strongest flare of 2025, sparking radio blackouts across Africa

November 15, 2025

Watch: Sentinel-6B launch live broadcast

November 15, 2025

Latest science news: New Glenn launch | China’s astronauts return | ‘Other’ ATLAS explodes

November 15, 2025
Add A Comment
Leave A Reply Cancel Reply

Latest Posts

Sudan’s army captures two areas in North Kordofan as RSF burns more bodies | Sudan war News

November 15, 2025

8 outdated habits older people refuse to give up that actually make them happier than tech-savvy millennials – VegOut

November 15, 2025

SpaceX completes second fastest turnaround between Falcon 9 launches from Cape Canaveral – Spaceflight Now

November 15, 2025

Jumpstart your future at the Toppel Career Center

November 15, 2025
News
  • Breaking News (5,238)
  • Business (317)
  • Career (4,446)
  • Climate (217)
  • Culture (4,417)
  • Education (4,636)
  • Finance (213)
  • Health (865)
  • Lifestyle (4,300)
  • Science (4,323)
  • Sports (342)
  • Tech (178)
  • Uncategorized (1)

Subscribe to Updates

Get the latest news from onlyfacts24.

Follow Us
  • Facebook
  • Instagram
  • TikTok

Subscribe to Updates

Get the latest news from ONlyfacts24.

News
  • Breaking News (5,238)
  • Business (317)
  • Career (4,446)
  • Climate (217)
  • Culture (4,417)
  • Education (4,636)
  • Finance (213)
  • Health (865)
  • Lifestyle (4,300)
  • Science (4,323)
  • Sports (342)
  • Tech (178)
  • Uncategorized (1)
Facebook Instagram TikTok
  • About us
  • Contact us
  • Disclaimer
  • Privacy Policy
  • Terms and services
© 2025 Designed by onlyfacts24

Type above and press Enter to search. Press Esc to cancel.