Improving Data Mining of PPI networks by combining DL with KGs

Motivation

Graphs provide a data structure for knowledge representation and are useful for the description of relationships in all kinds of knowledge domains, including biological systems.

Knowledge graphs (KG) allow for the representation of knowledge through a logical description of its concepts, including of biological entities. By combining biological data with the relevant biological knowledge available, we’ll be bringing knowledge graphs into data mining tasks, and creating an opportunity for ML models to learn from the enriched graph data.

Objectives

This work aims to improve representation learning-based graph mining over PPI networks by enriching the graph with meaningful knowledge pertaining to its proteins, through:

Investigation of different approaches to integrate the Gene Ontology with PPI networks and combine it with different deep learning pipelines.

Comparison and evaluation of the different combinations in benchmark data.

The Gene Ontology (GO) is one of the main resources of biological knowledge. It provides specific vocabulary for the description of several biological aspects pertaining to proteins, such as: the biological processes (BP) they may be involved in; their localization along the cellular components (CC); and their molecular functions (MF).

Through the ontology’s annotations (GOA), different proteins can be linked to the description of their function.

Above is part of a Knowledge Graph combining biological data (proteins) and its related knowledge, as described by the Gene Ontology’s terms, through annotations.

Methodology

Datasets used:

Open Graph Benchmark Project’s PPI Networks (Ogbn-proteins and Ogbl-ppa);

Data mining tasks:

Ogbn-proteins - multi-label classification, where labels represent protein functional classes;

Ogbl-ppa - PPI prediction problem.

Ontology and annotations retrieved from the Gene Ontology website;

Methodology steps:

Enrich the datasets with the ontology and annotations between its terms and the dataset’s proteins;

Investigate graph/machine learning methods that are able to explore the added information;

Three Deep Learning pipelines used so far:

1) simple MLP; 2) MLP with node2vec node embedding; and 3) Graph Convolutional Network (GCN).

Pipelines

1) MLP (simple feed-forward neural network)

The MLP receives only raw node features as input, it doesn't consider the nodes' local neighborhoods.

2) MLP with Node2Vec embeddings

The MLP + node2vec pipeline inputs both raw node features and a node2vec-made vector representation of random walks through the nodes’ neighborhoods onto the MLP.

3) GCN

The GCN considers the nodes' neighborhood by message-passing. It's classification is based on the loopy-belief propagation method, where, for example, the unlabeled node h receives message m:

where 𝜓(𝑌𝑒,𝑌ℎ) represents the dependency between h and e and ∅𝑓 is the initial/prior belief of the node’s label, being ℒ the set of all Y labels and e a neighboring node of h.

Preliminary Results

Results for the multi-label classification task:

Performance results for the metrics ROC-AUC and accuracy.

The results above demonstrate that the enrichment of the PPI graph with knowledge of its entities has allowed for an overall significant increase of the performance of the machine learning methods evaluated.

Future Work

In the near future these implementations will be adapted into the protein-protein interaction prediction problem.
Furthermore, the selection of machine learning pipelines evaluated will be expanded into more complex learning models and diverse approaches will be explored to filtering the ontology data being injected into the PPI networks.

Authors

Laura Balbi

LASIGE, Faculdade de Ciências

Catia Pesquita