This story is part of a series on the current progression in Regenerative Medicine. This piece discusses advances in artificial intelligence technologies.
In 1999, I defined regenerative medicine as the collection of interventions that restore to normal function tissues and organs that have been damaged by disease, injured by trauma, or worn by time. I include a full spectrum of chemical, gene, and protein-based medicines, cell-based therapies, and biomechanical interventions that achieve that goal.
ChatGPT and similar chatbot-style artificial intelligence software may soon serve a critical frontline role in the healthcare industry. ChatGPT is a large language model using vast amounts of data to generate predictive text responses to user queries. Released on November 30, 2022, ChatGPT, or Chat Generative Pre-trained Transformer, has become one of the fastest-growing consumer software applications, with hundreds of millions of global users. Some may be inclined to ask ChatGPT for medical advice instead of searching the internet for answers, which prompts the question of whether chatbox artificial intelligence is accurate and reliable for answering medical questions.
Dr. Rachel Goodman and colleagues at Vanderbilt University investigated chatbox responses in a recent study in Jama. Their study tested ChatGPT-3.5 and the updated GPT-4 using 284 physician-prompted questions to determine accuracy, completeness, and consistency over time. I will analyze their findings and present the pros and cons of incorporating artificial intelligence chatboxes into the healthcare industry.
Goodman and colleagues wanted to conduct a thorough investigation of ChatGPT. Their 284 questions were devised by 33 physicians across 17 specialties. Questions were varied between easy, medium, and hard, as well as a combination of multiple-choice, binary, and descriptive questions.
The accuracy scale ranged from one to six (one indicating completely incorrect and six indicating completely correct), and the completeness score ranged from one to three (one indicating an incomplete answer to the prompted questions and three indicating a comprehensive answer). Each score was determined by the physicians of that particular question’s field.
In the first round of testing with GPT-3.5, the researchers tabulated a median accuracy score of 5.0 and a median completeness score of 3.0, meaning on the first try, ChatGPT-3.5 typically answered the questions nearly accurately and comprehensively.
Of the 180 questions asked for GPT-3.5, 71 (39.4%) were completely accurate, and another 33 (18.3%) were nearly accurate. Roughly 8% of questions were completely incorrect, and most answers given an accuracy score of 2.0 or less were given to the most challenging questions. Most responses (53.3%) were comprehensive to the question, whereas only 12.2% were incomplete. The researchers note that accuracy and completeness correlated across difficulty and question type.
The 36 inaccurate answers receiving a score of 2.0 or lower on the accuracy scale were reevaluated 11 days later, using GPT-3.5 to evaluate improvement over time. Notably, 26 of the 26 answers improved in accuracy, with the median score for the group improving from 2.0 to 4.0.
To test and evaluate the accuracy and completeness of GPT-4 as compared to GPT-3.5, researchers asked both systems 44 questions regarding melanoma and immunotherapy guidelines. The mean score for accuracy improved from 5.2 to 5.7, while the mean score for completeness improved from 2.6 to 2.8, as medians for both systems were 6.0 and 3.0, respectively. These results suggest an improved answer generation for GPT-4, as expected.
To further cement their findings, the researchers asked the GPT-4 another 60 questions related to ten common medical conditions. Again, the resulting median accuracy was 6.0, and median completeness was 3.0. The mean scores were 5.7 and 2.8.
Among all 284 questions asked across the two chatbox platforms, the median accuracy score was 5.5, and the median completeness score was 3.0, suggesting the chatbox format is a potentially powerful tool given its prowess.
Many healthcare chatbots using artificial intelligence already exist in the healthcare industry. These include OneRemission, which helps cancer patients manage symptoms and side effects, and Ada Health, which assesses symptoms and creates personalized health information, among others.
ChatGPT and similar large language models would be the next big step for artificial intelligence incorporating into the healthcare industry. With hundreds of millions of users, people could easily find out how to treat their symptoms, how to contact a physician, and so on.
However, we must note the drawbacks of relying on such technologies before we proceed with their incorporation.
First is the question of privacy. There are ethical considerations to giving a computer program detailed medical information that could be hacked and stolen. Any healthcare entity using a chatbox system must ensure protective measures are in place for its patients.
Secondly, there will be cases of misinformation and misdiagnosis. While a median accuracy score of 5.5 is impressive, it still falls short of a perfect score across the board. The remaining inaccuracies could be detrimental to the patient’s health, receiving false information about their potential condition.
Thirdly, while the chatbox systems have the potential to create efficient healthcare workplaces, we must be vigilant to ensure that credentialed people remain employed at these workplaces to maintain a human connection with patients. There will be a temptation to allow chatbox systems a greater workload than they have proved they deserve. Accredited physicians must remain the primary decision-makers in a patient’s medical journey.
Ultimately, however, the further advances of artificial intelligence are fascinating, and it will be interesting to see how large language models such as ChatGPT are implemented into all aspects of life, including the healthcare industry, in the near future.
To read more of this series, please visit www.williamhaseltine.com
Read the full article here