Is there Data Leakage in

Protein Interactions Prediction

using Knowledge Graphs?

Rita T. Sousa, Sara Silva, Catia Pesquita

LASIGE, Faculdade de Ciências, Universidade de Lisboa

Biomedical knowledge graphs share, reuse and import data from each other routinely...

... This can be a potential source of data leakage for machine learning applications using these knowledge graphs, creating predictive models with overestimated performance values.

To investigate potential data leakage, we simulate a realistic training situation, where everything in the training set occurs before everything in the test set.

Motivation

As machine learning is increasingly being used, concerns about data leakage have been raised.

Leakage occurs when information about the target of a data mining problem that should not be legitimately available to mine from is introduced, and it can lead to overestimation of the model’s performance. In biomedical applications, such as protein-protein interaction (PPI) prediction, data leakage can also be an issue since multiple databases and resources reuse the same sources of information.

The majority of PPI prediction methods that are based on knowledge graphs (KGs) explore the Gene Ontology (GO) KG, composed of the GO and GO annotations. Since GO annotations can be based on the automated processing of other data sources, the same information that is used to support a PPI in a database to also be used to establish a GO annotation for the proteins.

We investigated potential data leakage between the GO KG and the STRING database in the task of PPI prediction, by comparing performance on unseen interactions.

Knowledge Graphs and Data Resources

PPI prediction is cast as a classification task that takes as input the GO KG and a set of protein pairs. The first step of our approach is using historical data to build the PPI datasets. Then we use the GO KG and the protein pairs to predict interactions using several machine learning algorithms.

Gene Ontology Knowledge Graph

GO describes protein function with respect to three semantic aspects: biological processes, cellular components, and molecular functions. GO and the annotations that link proteins to GO classes make up a KG. We obtained archived versions of the GO and GO annotations in 2015, 2017 and 2019 from the GO Data Archive.

Protein-Protein Interaction Datasets

The PPI datasets were obtained from the STRING Database, one of the largest available PPI databases. We built several PPI datasets using three archived versions of the STRING database (v9.1, v10, and v10.5) and the current version (v11). For the current version, we created three datasets each excluding protein pairs present in each of the older versions.

We considered the following criteria to select protein pairs from STRING:

(i) each protein must be annotated with the GO;

(ii) protein interactions must be experimentally determined or from curated databases;

(iii) interactions must have a confidence score above 950 to retain only high confidence interactions.

We employed random sampling to create negative pairs composed of the human proteins present in the positive pairs but without any STRING interactions between them, building a balanced dataset.

Protein-Protein Interaction Prediction

We follow the setup in evoKGsim+, available on GitHub, that predicts relations between KG entity pairs that are not encoded in the graph using similarity-based semantic representations.

Similarity Computation

Employ three KG-based semantic similarity measures to compute semantic similarity: (i) two taxonomic measures (ResnikMax, SimGIC) combined with ICSeco approach; (ii) one based on graph embedding methods (RDF2Vec).

Supervised Learning

Apply six well-known classes of machine learning models to train classifiers: K-Nearest Neighbor (KNN); Genetic Programming (GP); Decision Tree (DT); XGBoost (XGB); Random Forest (RF); Multi-Layer Perceptron (MLP).

Performance Evaluation

For evaluating the quality of a predicted classification, the Weighted Average F-measure (WAF) was used for stratified 10-fold cross-validation. The values ​​reported are the WAFs median.

Results

We conducted two types of experiments: (i) Same version, where we train the model with randomly chosen 10 000 protein interacting pairs from the archived STRING version and test it with the remaining pairs; (ii) Future version, where we train the model with randomly chosen 10 000 protein pairs from the archived STRING version and test it on data from the current STRING version (excluding interactions present in the archived version). The same randomly chosen 10 000 protein pairs are used in both settings.

One-Minute Video

Authors

Rita T. Sousa

LASIGE, Faculdade de Ciências

Sara Silva

LASIGE, Faculdade de Ciências

Catia Pesquita

LASIGE, Faculdade de Ciências

Funding

Catia Pesquita, Sara Silva, Rita T. Sousa are funded by the FCT through LASIGE Research Unit, ref. UIDB/00408/2020 and ref. UIDP/00408/2020. Catia Pesquita and Rita T. Sousa are funded by project SMILAX (ref. PTDC/EEI-ESS/4633/2014), Sara Silva by projects BINDER (ref. PTDC/CCI-INF/29168/2017) and PREDICT (ref. PTDC/CCI-CIF/29877/2017), and Rita T. Sousa by FCT PhD grant (ref. SFRH/BD/145377/2019). It was also partially supported by the KATY project which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101017453.