Rice University researchers in the field of computer science have uncovered bias in commonly utilized machine learning tools employed in immunotherapy research. Anja Conev, Romanos Fasoulis, and Sarah Hall-Swan, Ph.D. candidates collaborating with faculty members Rodrigo Ferreira and Lydia Kavraki, delved into publicly accessible peptide-HLA (pHLA) binding prediction data and identified a skew towards affluent communities. Their study scrutinizes the impact of biased data input on the algorithmic recommendations crucial to immunotherapy research.
Understanding Peptide-HLA Binding Prediction, Machine Learning, and Immunotherapy
Human Leukocyte Antigen (HLA) is a genetic component present in all individuals that encodes proteins integral to our immune system. These proteins interact with peptide fragments within our cells, marking infected cells for immune system recognition and response. Variants in genes, known as alleles, differ slightly among individuals. Ongoing immunotherapy investigations aim to pinpoint peptides that can effectively bind with a patient’s HLA alleles.
The ultimate goal is the development of personalized and highly efficient immunotherapies. Therefore, accurately predicting which peptides bind to specific alleles is a critical step. The precision of these predictions directly impacts the therapy’s potential efficacy.
However, the process of determining how well a peptide binds to an HLA allele is labor-intensive, prompting the use of machine learning tools for predictive analysis. Here lies the issue uncovered by Rice University’s team: the training data for these models seems to exhibit a geographical bias favoring higher-income regions.
The ramifications are significant. Failure to consider genetic data from lower-income populations may result in future immunotherapies being less effective for these groups.
Fasoulis emphasized, “Given that machine learning is used to identify potential peptide candidates for immunotherapies, if you basically have biased machine models, then those therapeutics won’t work equally for everyone in every population.”
Reassessing ‘Pan-Allele’ Binding Predictors
The efficacy of machine learning models hinges on the quality of the input data. Any bias in the data, even if inadvertent, can influence the algorithm’s outcomes.
Current machine learning models employed for pHLA binding prediction claim the ability to extrapolate data for alleles not present in the training dataset, labeling themselves as “pan-allele” or “all-allele.” Rice University’s research challenges this assertion.
Conev stated, “We wanted to see if they really worked for the data that is not in the datasets, which is the data from lower-income populations.” Through their analysis of publicly available pHLA binding prediction data, Fasoulis, Conev, and their team confirmed their hypothesis of data bias leading to algorithmic bias. By shedding light on this discrepancy, they aim to spur the development of a genuinely unbiased method for predicting pHLA binding.
Ferreira, the faculty advisor and co-author, highlighted the necessity of considering data within a social context to address bias in machine learning. He emphasized the importance of recognizing the historical and economic factors influencing the populations from which the data is sourced.
Kavraki emphasized the importance of accuracy and transparency in clinical tools, particularly in the realm of personalized cancer immunotherapies. She stressed the need to acknowledge and rectify biases in these tools and to raise awareness within the research community regarding the challenges of obtaining unbiased datasets.
The team’s findings, published in the journal iScience, aim to catalyze new research endeavors that are inclusive and beneficial across diverse demographic segments.
More information:
- Anja Conev et al, HLAEquity: Examining biases in pan-allele peptide-HLA binding predictors, iScience (2023). DOI: 10.1016/j.isci.2023.108613
Source:
- Widely used machine learning models reproduce dataset bias: Study (2024, February 18)
- Retrieved on 18 February 2024 from https://phys.org/news/2024-02-widely-machine-dataset-bias.html
Please note that this content is for informational purposes only and is subject to copyright protection.