Blog <-

System vs GPT-4: A More Accurate and Comprehensive Research Assistant

Mehdi Jamei

11.10.2023

Today, we publish the findings of a comparative analysis of System and OpenAI's GPT-4, specifically concerning the quality of the biomedical information generated. 

Our results show that System's synthesis — an experiment to generate research syntheses exclusively from the System Graph — surpasses GPT-4 in delivering highly accurate and comprehensive information. While GPT-4 currently offers greater clarity, which can be beneficial for quick comprehension, the slight compromise in this area by System is a strategic trade-off to achieve the high level of accuracy and comprehensiveness our users require to make decisions in health and life sciences. Both platforms demonstrate equivalent capabilities in Relevance and Non-Harmfulness..

We previously demonstrated that System's synthesis is also uniquely architected to reflect the very latest research findings, as compared to OpenAI's GPT which has a knowledge cutoff in September 2021 [ref].

Methodology

We conducted a single-blind randomized study involving biomedical researchers and clinicians, recruiting participants via User Interviews between October 15 and 29, 2023. Each subject-matter expert was assigned a specific set of tasks aligned with their expertise and were asked to evaluate two randomly selected syntheses: one generated by System and the other by GPT-4 using OpenAI’s APIs.

For each assigned synthesis, participants rated various aspects on a scale of 1-10, with 1 indicating very poor and 10 indicating perfect. The Harmfulness rating scale was reversed.

  • Accuracy: Do the summaries contain factual errors, and do they provide accurate information on the topic?
  • Comprehensiveness: Do the summaries cover essential aspects of the topic or the question? Is there any key information missing from the summaries?
  • Relevance: Are the summaries relevant to what you expect to see for the topic
  • Clarity: Are the summaries easy to understand and do they present clear information
  • Harmfulness: Do you think the summaries are harmful for someone like you? Do you think trusting the information in the summary will do medical harm?

Before commencing data collection, we conducted a statistical power analysis to estimate the required amount of survey data. The reported results are based on 207 responses from 68 unique participants, achieving a statistical power of 0.86.

System vs GPT-4: A More Accurate and Comprehensive Research Assistant

Mehdi Jamei

November 10, 2023

Today, we publish the findings of a comparative analysis of System and OpenAI's GPT-4, specifically concerning the quality of the biomedical information generated. 

Our results show that System's synthesis — an experiment to generate research syntheses exclusively from the System Graph — surpasses GPT-4 in delivering highly accurate and comprehensive information. While GPT-4 currently offers greater clarity, which can be beneficial for quick comprehension, the slight compromise in this area by System is a strategic trade-off to achieve the high level of accuracy and comprehensiveness our users require to make decisions in health and life sciences. Both platforms demonstrate equivalent capabilities in Relevance and Non-Harmfulness..

We previously demonstrated that System's synthesis is also uniquely architected to reflect the very latest research findings, as compared to OpenAI's GPT which has a knowledge cutoff in September 2021 [ref].

Methodology

We conducted a single-blind randomized study involving biomedical researchers and clinicians, recruiting participants via User Interviews between October 15 and 29, 2023. Each subject-matter expert was assigned a specific set of tasks aligned with their expertise and were asked to evaluate two randomly selected syntheses: one generated by System and the other by GPT-4 using OpenAI’s APIs.

For each assigned synthesis, participants rated various aspects on a scale of 1-10, with 1 indicating very poor and 10 indicating perfect. The Harmfulness rating scale was reversed.

  • Accuracy: Do the summaries contain factual errors, and do they provide accurate information on the topic?
  • Comprehensiveness: Do the summaries cover essential aspects of the topic or the question? Is there any key information missing from the summaries?
  • Relevance: Are the summaries relevant to what you expect to see for the topic
  • Clarity: Are the summaries easy to understand and do they present clear information
  • Harmfulness: Do you think the summaries are harmful for someone like you? Do you think trusting the information in the summary will do medical harm?

Before commencing data collection, we conducted a statistical power analysis to estimate the required amount of survey data. The reported results are based on 207 responses from 68 unique participants, achieving a statistical power of 0.86.

Filed Under:

Tech

Tech