Improving data mining of PPI networks by

combining deep learning with knowledge graphs

Laura Balbi, Catia Pesquita

LASIGE, Faculdade de Ciências, Universidade de Lisboa

Motivation

Graphs provide a data structure for knowledge representation and are useful for the description of relationships in all kinds of knowledge domains, including biological systems.

Knowledge graphs (KG) allow for the representation of knowledge through a logical description of its concepts, including of biological entities. By combining biological data with the relevant biological knowledge available, we’ll be bringing knowledge graphs into data mining tasks, and creating an opportunity for ML models to learn from the enriched graph data.

Objectives

This work aims to improve representation learning-based graph mining over PPI networks by enriching the graph with meaningful knowledge pertaining to its proteins, through:

Investigation of different approaches to integrate the Gene Ontology with PPI networks and combine it with different deep learning pipelines.

Comparison and evaluation of the different combinations in benchmark data.

The Gene Ontology (GO) is one of the main resources of biological knowledge. It provides specific vocabulary for the description of several biological aspects pertaining to proteins, such as: the biological processes (BP) they may be involved in; their localization along the cellular components (CC); and their molecular functions (MF).

Through the ontology’s annotations (GOA), different proteins can be linked to the description of their function.

Above is part of a Knowledge Graph combining biological data (proteins) and its related knowledge, as described by the Gene Ontology’s terms, through annotations.

Methodology


Datasets used:

Open Graph Benchmark Project’s PPI Networks (Ogbn-proteins and Ogbl-ppa);

Data mining tasks:

Ogbn-proteins - multi-label classification, where labels represent protein functional classes;

Ogbl-ppa - PPI prediction problem.

Ontology and annotations retrieved from the Gene Ontology website;

Methodology steps:

Enrich the datasets with the ontology and annotations between its terms and the dataset’s proteins;

Investigate graph/machine learning methods that are able to explore the added information;

Three Deep Learning pipelines used so far:

1) simple MLP; 2) MLP with node2vec node embedding; and 3) Graph Convolutional Network (GCN).


Pipelines

1) MLP (simple feed-forward neural network)

The MLP receives only raw node features as input, it doesn't consider the nodes' local neighborhoods.

2) MLP with Node2Vec embeddings

The MLP + node2vec pipeline inputs both raw node features and a node2vec-made vector representation of random walks through the nodes’ neighborhoods onto the MLP.

3) GCN

The GCN considers the nodes' neighborhood by message-passing. It's classification is based on the loopy-belief propagation method, where, for example, the unlabeled node h receives message m:

where 𝜓(𝑌𝑒,𝑌ℎ) represents the dependency between h and e and ∅𝑓 is the initial/prior belief of the node’s label, being ℒ the set of all Y labels and e a neighboring node of h.

Preliminary Results

Results for the multi-label classification task:

Performance results for the metrics ROC-AUC and accuracy.

The results above demonstrate that the enrichment of the PPI graph with knowledge of its entities has allowed for an overall significant increase of the performance of the machine learning methods evaluated.

Authors

Laura Balbi

LASIGE, Faculdade de Ciências

Catia Pesquita

LASIGE, Faculdade de Ciências

Funding

Catia Pesquita, Laura Balbi are funded by the FCT through LASIGE Research Unit, ref. UIDB/00408/2020 and ref. UIDP/00408/2020.