XAI for understanding associations between disease-related genes

Motivation

In the biomedical field, Machine Learning and Artificial Intelligence models have shown impressive successes, but they often lack some degree of explainability, in the sense they fail to provide human-understandable logical decisions.

Discovering Disease Genes (DGs) is one particular task in the biomedical research field that requires some degree of explainability, particularly for prioritizing genes for further research.

Background

S2B is a network-based method that uses protein-protein interaction (PPI) networks to predict DGs associated with two similar diseases.

In this case, it’s important to understand why certain genes may be related with two similar diseases, and adding knowledge to the process may improve gene prioritization.

Methodology

Building a new Network

Our approach consists in building a large network graph that contains both PPI interactions from four different sources, and also the Gene Ontology. In this new network, a node might represent either a protein, or a GO Term; Edges may represente either a PPI, or a GO annotation.

Assessing the impact of the GO insertions

We tested out the hypothesis that the insertion of semantic information enriches the PPI network and improves the S2B's method performance. This testing was carried out by running S2B ten times, for different kinds of graph networks, with 10 fixed sets of 100 random seed genes from ALS and SMA disease modules.

The introduction of noise

By adding a lot of information to the network, we expected to add some noise to the mixture as well. Therefore, we tested the impact of trimming down the original PPI network, to just contain physical interactions. We also proceeded in a similar way with the GO, filtering it either vertically or horizontally.

Results

Our results show that just by itself, the GO doesn't produce a very significant increase in performance to the S2B Method when added to the network.

However, when using a network with just physical PPI's, we can see a noticeable increase in performance, which shows well the need to trim and filter our data.

Globally, taking into account performance and run times, the best combination seems to be the one that uses a network containing only physical PPI's, as well as only GO Terms referring to Biological Processes.

What about explainability?

The genes with the highest S2B score are more likely to be in the overlap between disease modules and, hence, more likely to be associated with both diseases.

Transcription regulation, apoptotic processes and gene expression highly involved in the pathological process of ALS and SMA.

VCAM1 and TP53 genes encode for proteins related to signal transduction and membrane adhesion.

These proteins have been highlighted as potential therapeutic candidates to ALS and SMA in experimental studies available in the literature.

Conclusions

The inclusion of the GO into the PPI network improves S2B’s performance. Filtering and trimming brings further improvement.
The combination between using only physical PPI’s, and the GO terms referring to biological processes produces the best results yet.
We also bring more explainability to the method, and are now able to interpret why certain genes are such strong candidates.
Future work will consist in producing a weighted network. We’ll attribute weights to edges between nodes based on the semantic similarity between their associated GO terms to uncover more relevant interactions.

Authors

José Zenóglio de Oliveira

LASIGE, Faculdade de Ciências

Francisco Pinto

BioISI, Faculdade de Ciências

Catia Pesquita

LASIGE, Faculdade de Ciências

Funding

This work was supported by the Fundação para a Ciência e a Tecnologia (FCT) under LASIGE Research Unit ref. UIDB/00408/2020 and UIDP/00408/2020, and Partially Supported by FCT Centre grants to BioISI ref. UIDB/04046/2020 and UIDP/04046/2020.

Explainable AI for understanding associations between disease related genes

José Zenóglio de Oliveira¹, Francisco Pinto², Catia Pesquita¹

Motivation

Background

Methodology