June 28, 2024
Study design is an important component for measuring the quality of scientific research. In biomedical and social science, there is an accepted hierarchy of study types that is arranged by how generalizable the results of the study are. Typically, observational studies that rely on descriptions of specific patient histories or summaries of groups of patients are thought to be less conclusive than experimental studies. Experimental studies are typically in the form of a Randomized-Controlled Trial (RCT), where a treatment of interest is given to a randomly assigned group and their outcomes are compared to those of a control group.
While research catalogs like PubMed or OpenAlex often “tag” articles with their study design, these tags are often incomplete, or are well-documented as having high sensitivity (ie., nearly all studies that are RCTs are tagged) but relatively low specificity (a low percentage of studies with the RCT tag actually have an RCT design). This means any filter based on these tags would include many non-experimental studies in the results. Since we also need to identify RCTs among papers without any article tags — as is the case with thousands of new papers added daily — we set out to build a machine learning model to predict a paper’s study design based on available information.
There are a number of ML approaches appropriate for this task. Early efforts to identify RCTs using machine learning used support vector machines to classify studies based on numerical features constructed from their descriptions. More recent approaches use neural network (and deep learning) models and natural language processing (NLP) tools for efficient tokenization and embedding. There is also the option of using “general-purpose” LLMs, which have been shown to be very adept at performing question-answering tasks (in this case, a very specific one).
Our choice was to take advantage of pre-trained foundational models where the source corpus was highly related to our task and then “fine-tuning” it for our specific purpose. There are a number of NLP models trained on articles indexed in PubMed (where most of the relationships in the System Graph from biomedical research can be found). For our benchmark model, we added a classification layer to BioMedBERT, a BERT-based transformer and fine-tuned it with labeled examples. For comparison, we also fine-tuned models based on PubMedBERT and BioLinkBERT, with BioMedBERT exhibiting the best results in our performance testing. This approach met some specific requirements — LLM-based solutions introduced inference latency with no detectable performance improvement and less auditable model outputs, and fine-tuning allowed us to deploy models much quicker than if we had trained the entire deep learning model from scratch.
To generate these labeled examples, we constructed a dataset of research paper titles and abstracts, and generated AI-assisted (and human-reviewed) labels for whether the research is described as a randomized-controlled trial design. Because of class imbalance (some estimates place the number of RCTs in the PubMed corpus among papers labeled as RCTs at around only 7%), the training dataset oversampled from observations labeled as RCTs (to approximately 25% of the sample), and the classifier models were fine-tuned using this data over 12 training epochs (or until convergence).
After deploying the System RCT classifier into production, we were able to classify over 9,000 research studies as being RCTs where PubMed’s publication type tags were unavailable. We also identified around 6,700 studies that were previously tagged as “experimental” but did not have a description of study design sufficient to be classified as such.
Today, we are making one of our PyTorch-based classification models available on Hugging Face. This binary classifier is fine-tuned based on the 100 million parameter BioMedBERT abstract-based pre-trained model. The model takes titles and abstracts of research papers as the input, and the model will output a predicted label (RCT or not) as well as a probability “score.” This specific model achieved out-of-sample F1 score of 0.92 on a human-labeled test set sampled from PubMed.
Our work on study type classification continues, with experimentation with using more of the available domain-specific pre-trained models (and extending to testing with fine-tuning on “general-purpose” LLMs), using models trained on applicable snippets of descriptions of research found in article full-text, and assembling ensembles of models to improve prediction performance.
Filed Under:
Tech