Given the risks associated with AI, we consider it essential to routinely evaluate and benchmark our models and share the results. This page is regularly updated with our latest research with methodologies described below each graph.
Scoring highest across multiple dimensions
System Pro's synthesis surpasses OpenAI's GPT-4 in terms of accuracy and relevance, while maintaining an up-to-date knowledge of scientific discoveries.
We conducted a blind, randomized study with biomedical researchers and clinicians, recruiting participants via User Interviews between October 15 and 29, 2023. Each subject-matter expert was assigned a specific set of tasks aligned with their expertise and were asked to evaluate two randomly selected syntheses: one generated by System and the other by OpenAI's GPT-4.
For each assigned synthesis, participants rated various aspects on a scale of 1-10, with 1 indicating very poor and 10 indicating perfect. The Harmfulness rating scale was reversed.
Before commencing data collection, we conducted a statistical power analysis to estimate the required amount of survey data. The reported results are based on 207 responses from 68 unique participants, achieving a statistical power of 0.86.
Taking accuracy, completeness, relevance, helpfulness, and clarity into account, 70% of experts prefer System Pro’s synthesis over other AI-assisted research tools.
We conducted a randomized single-blind study with researchers and clinicians. Users were recruited on User Interviews between October 1-15, 2023. Each subject-matter expert was assigned a set of tasks relevant to their domain of expertise. For each task, users were asked to compare two randomly assigned syntheses: one generated by System and the other by another commercial product. They were then instructed to choose the better synthesis taking into account multiple dimensions (accuracy, completeness, clarity, relevance, and helpfulness). After each selection, users were required to provide a reason for their choice. Prior to data collection, a statistical power analysis was conducted to estimate the amount of survey data needed. The presented results are based on 144 responses by 33 unique participants.
50 search queries done by System Pro users from June-October 2023 were used to create a dataset of syntheses from System, Commercial Product #1, and Commercial Product #2.
Syntheses (preserving citations included in the respective UIs) were randomly assigned to biomedical researchers and clinicians, blindly recruited on the Users Interviews platform, for evaluation. For each assigned synthesis, users were asked to score various aspects from 1-10, 1 being very poor and 10 being perfect.
- Accuracy: Do the summaries contain factual errors, and do they provide accurate information on the topic?
- Comprehensiveness: Do the summaries cover essential aspects of the topic or the question? Is there any key information missing from the summaries?
- Relevance: Are the summaries relevant to what you expect to see for the topic?
A Two-Sample T-Test was used to measure the difference between similar scores of different products. Once the test result became significant for all metrics, data collection was stopped.
The presented results are based on 256 responses by 16 unique participants.
The most citations
For the same query, on average, System cites 6x as many studies.
The most depth
On average, System is able to generate much longer syntheses while maintaining an unrivaled citation count.
The most breadth
System’s syntheses cover many more biomedical topics that are related to the search.
A representative sample of 50 searches conducted by System Pro users between May and September 2023 was created. To compare System Pro with Commercial Product #1, we conducted the same search query and recorded the resulting summary and citations. Searches were done in September 2023. Commercial Product #2 does not directly synthesize search results, as it relies on a question to generate an answer. To make a direct comparison, we utilized the sections of System’s synthesis for a specific search query (for example, for user query of “SLE and b-cell depletion” System Pro generated the following sections: “Overview”, “Role of B-cells in SLE“, “B-cell depletion therapies”, “Efficacy of B-cell depletion in SLE”). We generated a question for each section using OpenAI's GPT-4 and asked Commercial Product #2 that question (in the example above, for the section called “B-cell depletion therapies” GPT-4 generated the following question: “What are the different B-cell depletion therapies used in the treatment of SLE?”). We then saved the resulting summary and articles. On average, it took 4.9 searches on Commercial Product #2 to generate a comparable summary.