Explainable AI for understanding associations between disease related genes

José Zenóglio de Oliveira1, Francisco Pinto2, Catia Pesquita1


1LASIGE, Faculdade de Ciências, Universidade de Lisboa
2BioISI, Faculdade de Ciências, Universidade de Lisboa

Motivation

In the biomedical field, Machine Learning and Artificial Intelligence models have shown impressive successes, but they often lack some degree of explainability, in the sense they fail to provide human-understandable logical decisions.

Discovering Disease Genes (DGs) is one particular task in the biomedical research field that requires some degree of explainability, particularly for prioritizing genes for further research.

Background

S2B is a network-based method that uses protein-protein interaction (PPI) networks to predict DGs associated with two similar diseases.

In this case, it’s important to understand why certain genes may be related with two similar diseases, and adding knowledge to the process may improve gene prioritization.

Methodology


Building a new Network

Our approach consists in building a large network graph that contains both PPI interactions from four different sources, and also the Gene Ontology. In this new network, a node might represent either a protein, or a GO Term; Edges may represente either a PPI, or a GO annotation.


Assessing the impact of the GO insertions

We tested out the hypothesis that the insertion of semantic information enriches the PPI network and improves the S2B's method performance. This testing was carried out by running S2B ten times, for different kinds of graph networks, with 10 fixed sets of 100 random seed genes from ALS and SMA disease modules.


The introduction of noise

By adding a lot of information to the network, we expected to add some noise to the mixture as well. Therefore, we tested the impact of trimming down the original PPI network, to just contain physical interactions. We also proceeded in a similar way with the GO, filtering it either vertically or horizontally.

Results

  • Our results show that just by itself, the GO doesn't produce a very significant increase in performance to the S2B Method when added to the network.


  • However, when using a network with just physical PPI's, we can see a noticeable increase in performance, which shows well the need to trim and filter our data.


  • Globally, taking into account performance and run times, the best combination seems to be the one that uses a network containing only physical PPI's, as well as only GO Terms referring to Biological Processes.



What about explainability?

The genes with the highest S2B score are more likely to be in the overlap between disease modules and, hence, more likely to be associated with both diseases.

  • Transcription regulation, apoptotic processes and gene expression highly involved in the pathological process of ALS and SMA.


  • VCAM1 and TP53 genes encode for proteins related to signal transduction and membrane adhesion.


  • These proteins have been highlighted as potential therapeutic candidates to ALS and SMA in experimental studies available in the literature.

Authors

José Zenóglio de Oliveira

LASIGE, Faculdade de Ciências

Francisco Pinto

BioISI, Faculdade de Ciências

Catia Pesquita

LASIGE, Faculdade de Ciências

Funding

This work was supported by the Fundação para a Ciência e a Tecnologia (FCT) under LASIGE Research Unit ref. UIDB/00408/2020 and UIDP/00408/2020, and Partially Supported by FCT Centre grants to BioISI ref. UIDB/04046/2020 and UIDP/04046/2020.