Blog <-

Evaluating the performance of System's research synthesis

Mehdi Jamei

10.17.2023

In our continuous improvement of System, each feature undergoes a thorough evaluation process to measure its robustness and comprehensiveness, employing methodologies tailored to assess specific attributes and performances of each function within the platform.

Today, we are pleased to present new results from an in-depth, randomized, and blind analysis of the performance of System's AI-assisted synthesis against other products.

Results

You can see the detailed results of the study here. We will regularly update this page with new findings.

Here are the two main takeaways:

70% of experts prefer System's synthesis over other options.
System's synthesis stands out for its unparalleled accuracy and thoroughness.

‍

Performance Measurement Methodology

Survey Design

The pipeline that generates System's synthesis of scientific literature is a complex multi-step process. The overall task can be summarized as follows:

Given a biomedical search query, find all relevant research studies and create an overall summary of the findings of those studies.

The specific nature of this task does not allow us to compare it directly with LLM-specific industry benchmarks like PubMedQA, MedQA, and MedMCQA. To measure the accuracy, comprehensiveness, and harmfulness of our synthesis, then, we conducted a study with subject-matter experts (SMEs). In order to maintain objectivity and reduce bias in the assessment process, the study was randomized and blind.

Participants were randomly allocated one of two general tasks:

Task 1: Participants were presented with two syntheses, one generated by System and another by a competitor. They were asked to evaluate and choose between the syntheses, taking into account various dimensions including accuracy, comprehensiveness, clarity, relevance, and helpfulness.

Task 2: Participants were provided with a single random synthesis along with all the associated citations presented in the respective product interfaces. They were tasked with rating it on a scale of 1-10 on each of the following dimensions:

Accuracy: Do the summaries contain factual errors, and do they provide accurate information of the topic?
Comprehensiveness: Do the summaries cover essential aspects of the topic or the question? Is there any key information missing from the summaries?
Relevance: Are the summaries relevant to what you expect to see for the topic?

Statistical Power

Data collection was continued until achieving statistical power of at least 0.8 in the results, utilizing the Two Sample T-Test for each comparison between product pairs across all evaluated dimensions. Detailed information regarding the number of participants and data points collected is available in the table below.

	System Synthesis vs Commercial Product #1	System Synthesis vs Commercial Product #2
Task 1	25 users / 50 evaluations	8 users / 56 evaluations
Task 2	14 users / 56 evaluations	14 users / 56 evaluations

‍

Evaluating the performance of System's research synthesis

Mehdi Jamei

October 17, 2023

Today, we are pleased to present new results from an in-depth, randomized, and blind analysis of the performance of System's AI-assisted synthesis against other products.

Results

You can see the detailed results of the study here. We will regularly update this page with new findings.

Here are the two main takeaways:

70% of experts prefer System's synthesis over other options.
System's synthesis stands out for its unparalleled accuracy and thoroughness.

‍

Performance Measurement Methodology

Survey Design

The pipeline that generates System's synthesis of scientific literature is a complex multi-step process. The overall task can be summarized as follows:

Given a biomedical search query, find all relevant research studies and create an overall summary of the findings of those studies.

Participants were randomly allocated one of two general tasks:

Accuracy: Do the summaries contain factual errors, and do they provide accurate information of the topic?
Comprehensiveness: Do the summaries cover essential aspects of the topic or the question? Is there any key information missing from the summaries?
Relevance: Are the summaries relevant to what you expect to see for the topic?

Statistical Power

	System Synthesis vs Commercial Product #1	System Synthesis vs Commercial Product #2
Task 1	25 users / 50 evaluations	8 users / 56 evaluations
Task 2	14 users / 56 evaluations	14 users / 56 evaluations

‍

Filed under:

Tech

Request a demo

System API

Request received

Join the community

Welcome,systems thinker

Request a demo

WholeHealth

Welcome,systems thinker

Blog <-

Evaluating the performance of System's research synthesis

Mehdi Jamei

10.17.2023

Results

Performance Measurement Methodology

Survey Design

Statistical Power

System Synthesis vsCommercial Product #1

System Synthesis vsCommercial Product #2

Evaluating the performance of System's research synthesis

Results

Performance Measurement Methodology

Survey Design

Statistical Power

System Synthesis vsCommercial Product #1

System Synthesis vsCommercial Product #2

Welcome,
systems thinker

Welcome,
systems thinker

System Synthesis vs
Commercial Product #1

System Synthesis vs
Commercial Product #2

System Synthesis vs
Commercial Product #1

System Synthesis vs
Commercial Product #2