Dissertation Title: Domain Adaptation, Transfer Learning and Knowledge Infusion for low-resource Biomedical Machine Learning

Date: 2025/05/06 - 2025/05/06

Dissertation Title: Domain Adaptation, Transfer Learning and Knowledge Infusion for low-resource Biomedical Machine Learning

Speaker: Jieli Zhou, Ph.D. candidate at UM-SJTU Joint Institute

Time: 9:30 AM - 11:00 AM, Tuesday, May 6, 2025 (Beijing Time)

Location: Room 202, East Wing of Pao Yue-Kong Library (GIFT)

Abstract

In recent years, the rapid development of advanced biomedical technologies, like high-throughput multiomics sequencing, biomedical imaging, and electronic health records, has pushed biomedicine into an exciting new era where data-driven methods are increasingly important. Meanwhile, huge progress has been achieved in the field of artificial intelligence (AI). These two trends have been converging over the past three decades, and AI has revolutionized biomedical research. From early-day expert systems giving automatic psychiatric analysis to modern deep learning systems achieving expert-level diagnostic accuracy in medical imaging and enabling scientific discoveries like complex protein structure prediction and new CRISPR design. Despite all these promising advances, current biomedical AI remains largely data-specific, task-specific and knowledgelimited. The field of biomedical research is rich and intricate, presenting its own set of challenges. Among the most significant of these challenges are the limited availability of labeled data, the intricate nature of required tasks, and the complex foundational knowledge needed. Due to these challenges, the majority of biomedical research problems are low-resource, making the current generation of biomedical AI systems weak to complete the tasks. One of the promising directions is the concept of Artificial General Intelligence or AGI. In this ambitious vision, AGI systems are capable of generalizing across data and tasks. In particular, they are rich in biomedical knowledge that they can adapt to novel situations. AGI systems hold great promise to solve the complex biomedical research problems, making generalizable, adaptable and knowledgeable solutions to biomedical research. This thesis aims to bridge the wide gap between current specialized and narrow biomedical AI systems and the great vision of AGI in biomedical research by focusing on three fundamental aspects: domain adaptation, transfer learning, and knowledge infusion. By enhancing model performance across data distribution shifts, task variability, and biomedical knowledge infusion, this work paves the way for more generalist biomedical AI and towards AGI for biomedicine. This thesis makes the following contributions:

Low-Resource COVID-19 Diagnostics on Chest X-rays Using Domain Adaptation. Traditional deep learning methods have difficulty accurately diagnosing diseases from chest X-rays in low-resource settings with very limited labeled data. This low-resource situation was exactly the case during the COVID-19 pandemic, where doctors and healthcare professionals were busy treating patients instead of labeling chest x-rays. To address this data scarcity issue at emergent times, this thesis proposes Semi-supervised Open Set Domain Adversarial Network (SODA), a novel semi-supervised domain adaptation neural network to utilize large numbers of general domain common chest x-rays like the NIH Chest X-ray dataset, and successfully adapt to the limited COVID-19 chest x-ray dataset with drastically different data distributions. SODA takes a two-level hierarchical data alignment strategy: general domain alignment to reduce distribution differences and common subspace alignment to extract the shared features of common classes. By thorough experiments, we demonstrate that SODA outperforms other deep learning and domain adaptation models in terms of lowresource COVID-19 classification. SODA also showcases accurate COVID-19 pathological feature localization, which holds great potential in addressing chest x-ray based diseases diagnostics, especially in low-resource settings.

Knowledge-infused Large Language Models for Complex Biomedical Lay Summarization. Biomedical text, for example biomedical literature and clinical reports, serve as the main gateways for biomedical experts to communicate their latest research findings and medical diagnostics. However, these biomedical text are optimized for peer communication but usually inaccessible to non-technical lay people. Traditional methods like supervised text summarization, report generation and text simplification algorithms usually take in fixed setof paired text and targets, then train a supervised model to do predictions in a task-specific way. This approach usually provides syntactically optimal solutions, but neglects the deep biomedical knowledge hidden in these texts. To this end, this thesis develops BioLayLLM, a knowledge-infused large language model for generating factual, relevant, and readable lay summaries of biomedical papers and clinical reports. By leveraging extensive biomedical knowledge bases, dictionaries and background knowledge sources, BioLayLLM enriches the original text with contextually relevant background explanations, accurate term definitions and actionable biomedical logic flows. In addition, BioLayLLM also incorporates biomedical chain-of-thought technique which takes the rigid structure of biomedical text into account and generate structurally accessible lay summaries. Through these combination of techniques, BioLayLLM outperforms specialized supervised models, and state-of-the-art LLMs, achieving first place in readability at the BioLaySumm competition at ACL BioNLP 2024. In addition, BioLayLLM also generates highly accurate, patient-friendly lay summaries for clinical reports, such as colorectal cancer pathology reports.

Human Endogenous Retrovirus Analysis with DNA Foundation Model. Human Endogenous Retroviruses (HERVs) are remnants of ancient exogenous virus infections that got into the human genome over millions of years ago, and take up about 8% of our DNA. Once considered as "Junk DNA", HERVs are now discovered to be highly associated with numerous diseases like autoimmune diseases, cancers, and aging, etc. However, sequence-based studies on HERVs are by far very limited, due to the lack of high-quality datasets, the under-annotation of coding HERVs, and the under-performance of traditional machine learning methods. To this end, this thesis introduces HervEVO, a three-component foundation model system for accurate classification, high-fidelity generation and biologically informed interpretations of HERVs. Beyond isolated simple tasks and pattern recognition, this work showcases the potential of powerful foundation models to unify distinct tasks, and achieves state-of-the-art performance in HERV classification (AUROC > 95%, AUPRC > 93%). HervEVO also generates high-fidelity coding HERV sequences to augment scarce coding sequences and improve classification model performance. To further increase intepretability, we develop HervLLM, a knowledge-infused large language model that provides structured and interpretable insights into HERV sequences. Together through these three components, HervEVO facilitates accurate classification, generation, and interpretation of the low-resource HERV research, paving the way for more biological discoveries from the dark matters of our genome.

In addition, in the outlook section, the thesis explores the potential of foundation model based multiagent AI systems for end-to-end biological discovery. We envision the future where autonomous AI systems like AI Biologists would solve complicated biomedical tasks end-to-end. For end-to-end biological discovery, the thesis proposes the concept of MarkerAgent, which automates data-driven bioinformatics tasks such as analyzing single-cell RNA sequencing data, performing literature reviews, generating biologically informed scientific hypotheses, and validating disease markers through machine reasoning. This on-going work aims to streamline the entire biomedical research process instead of focusing on one task.

Overall, the advancements in this thesis represent a technical chain from data adaptation, task unification and knowledge integration to make low-resource biomedical AI algorithms more generalizable. The three main works focus on different but related aspects of low-resource biomedical machine learning, and demonstrate a clear path for evolving the current biomedical machine learning towards AGI.

Biography

Jieli Zhou is a Ph.D. candidate at Shanghai Jiao Tong University, specializing in Biomedical Artificial Intelligence. His main research focuses on developing novel data-efficient machine learning methods for biomedical data. He holds a B.S. in Mathematics and an M.S. in Computational Data Science from Carnegie Mellon University. Prior to joining SJTU, he worked as a Data Scientist at C3.ai in the San Francisco Bay Area, where he focused on data-efficient learning for industrial IoT data. He has authored over 10 publications in leading biomedical AI venues, including IEEE/ACM TCBB, Frontiers in Medicine, Asian Journal of Surgery, Clinical eHealth, as well as workshops at major AI conferences such as ACL, KDD, and AAAI.