October 17, 2023
In our continuous improvement of System, each feature undergoes a thorough evaluation process to measure its robustness and comprehensiveness, employing methodologies tailored to assess specific attributes and performances of each function within the platform.
Today, we are pleased to present new results from an in-depth, randomized, and blind analysis of the performance of System's AI-assisted synthesis against other products.
You can see the detailed results of the study here. We will regularly update this page with new findings.
Here are the two main takeaways:
The pipeline that generates System's synthesis of scientific literature is a complex multi-step process. The overall task can be summarized as follows:
Given a biomedical search query, find all relevant research studies and create an overall summary of the findings of those studies.
The specific nature of this task does not allow us to compare it directly with LLM-specific industry benchmarks like PubMedQA, MedQA, and MedMCQA. To measure the accuracy, comprehensiveness, and harmfulness of our synthesis, then, we conducted a study with subject-matter experts (SMEs). In order to maintain objectivity and reduce bias in the assessment process, the study was randomized and blind.
Participants were randomly allocated one of two general tasks:
Task 1: Participants were presented with two syntheses, one generated by System and another by a competitor. They were asked to evaluate and choose between the syntheses, taking into account various dimensions including accuracy, comprehensiveness, clarity, relevance, and helpfulness.
Task 2: Participants were provided with a single random synthesis along with all the associated citations presented in the respective product interfaces. They were tasked with rating it on a scale of 1-10 on each of the following dimensions:
Data collection was continued until achieving statistical power of at least 0.8 in the results, utilizing the Two Sample T-Test for each comparison between product pairs across all evaluated dimensions. Detailed information regarding the number of participants and data points collected is available in the table below.
Filed Under:
Tech