Road to ML Engineer #36 - GNN Pipelines

Last Edited: 1/23/2025

The blog post discusses GNN pipelines for various graph-related tasks in GNN APIs.

ML

So far, when discussing GNNs, we have always focused on creating latent node representations for downstream tasks, avoiding discussions about the downstream tasks themselves. However, downstream tasks are crucial for choosing appropriate data preparation, models, evaluations, and building GNN pipelines. Now that we have covered some basic models and the fundamental functionalities of GNN APIs to build pipelines, we will finally discuss downstream tasks and informed pipelines.

Graph-Level Tasks

Graph-level tasks involve predicting the class or value of a graph by aggregating the final latent node embeddings or generating new graphs. Examples include toxicity binary classification of molecules, chemical property prediction for material science, and drug candidate generation for drug discovery. For graph-level tasks, we prepare multiple graphs with labels associated with them and perform supervised learning by splitting the dataset into training, validation, and test sets, as we have done before.

The above demonstrates how to generate datasets for graph-level tasks using TF-GNN and PyG. The implementation of PyG is almost identical to what we covered in the last article and aligns with the setting we are most familiar with. Since models do not see the test dataset during training, we describe these models as trained in the inductive setting. Models trained in inductive settings can generalize to unseen data.

Node-Level & Edge-Level Tasks

Node-level and edge-level tasks involve predicting the class or value of a node or edge. Examples of node-level tasks include fraud account detection in a financial transaction network and venue prediction of an academic paper in a citation network. Examples of edge-level tasks include recommender systems predicting ratings and future link prediction between social media accounts. These tasks typically involve a single graph rather than multiple graphs. When working with a single graph, we treat each node/edge as one data point and use masks to split the dataset.

As the models observe the entire graph structure during training for inference, when working with a single graph with masking, the models are said to be trained in a transductive setting. The models covered so far use adjacency matrices and assume the entire graph structure is known. When producing embeddings for a single graph, such models are less likely to generalize to unseen graphs.

Large Dynamic Graphs

It is acceptable for a model to lack generalizability to new graphs if the graph remains static over time. However, this becomes problematic when dealing with dynamically changing graphs. Additionally, conventional models are slower for large graphs as they rely on adjacency matrices. For large and dynamic graphs, such as rating prediction using large user-item bipartite graphs, we typically sample subgraphs of neighbors around a root node and train the model to perform predictions on the root node using the embeddings of its subgraph.

Using this method, models can be trained in an inductive setting, enabling generalization to unseen data or subgraphs of dynamically changing graphs. This approach also results in smaller, faster models. Depending on the task, we must choose appropriate data preparation, model, training settings, and evaluation methods.

Conclusion

In this article, we covered graph-level, node-level, and edge-level tasks, as well as inductive and transductive settings, along with appropriate GNN pipelines using GNN APIs. Once again, it is crucial to identify the type of problem at hand and choose an appropriate GNN pipeline accordingly. For more information on using GNN APIs, we recommend consulting the official documentation referenced below.

Resources