Road to ML Engineer #32 - Graph Convolutional Networks

In the last article, we discussed how the function to produce latent representations of a graph, $F$ , needs to be permutation equivariant, which applies a permutation invariant local function $g$ row-wise. There are multiple permutation invariant functions we can use for $g$ , but one of the simplest functions is to take the weighted sum of each feature among the node and its neighbors, which we have been calling convolution.

Graph Convolution

Graph convolution differs from the convolution operation we covered in the context of image processing in that graph convolution does not have a fixed kernel size to slide over. However, it is essentially the same in that it takes the weighted sum of a unit of data and all its neighbors. Using the adjacency matrix $A$ , we can compute the weighted sum of neighbors quite conveniently by $AXW$ , where $A$ is of size $(n, n)$ , $X$ , the node matrix, is of size $(n, d_{in})$ , and $W$ , the weight matrix, is of size $(d_{in}, d_{out})$ . Here, $n$ is the number of nodes, $d_{in}$ is the node feature dimension, and $d_{out}$ is the number of kernels or the output feature dimension. The resulting sum will be of size $(n, d_{out})$ .

However, $AXW$ is not a proper convolution as it does not include the node itself in computing the weighted sum. To address this, we create a new matrix $\tilde{A}$ by adding the identity matrix to the adjacency matrix and compute $\tilde{A}XW$ , making it a proper graph convolution. Notice that we could have set up $W$ of size $(n, n)$ , with values depending on node pairs, and computed $W\tilde{A}X$ . However, this operation is not permutation equivariant, as the permutation of $W$ must correspond to that of $\tilde{A}$ . Hence, we compute $\tilde{A}XW$ with $W$ whose values correspond to the input and output features.

Graph Convolutional Network

To allow the model to learn complex and nonlinear relationships through backpropagation, we introduce a non-linear activation function $\sigma$ and compute $\sigma(\tilde{A}XW)$ . However, the latent representations from the weighted sum can grow larger and larger, making it difficult for the model to learn appropriate weights. To normalize the weighted sum, we multiply the inverse of the degree matrix $D$ , which contains the degree of each node in its diagonal entries, to the weighted sum, finally arriving at a graph convolutional network (GCN). (*The degree of a node is the number of edges for an unweighted graph and the sum of edge weights for a weighted graph.)

H^{0} = X \\ H^{(l+1)} = \sigma(D^{-1}\tilde{A}H^{(l)}W^{(l)})

At each layer $l$ , GCN performs the computation above to produce a graph embedding $H^{(l+1)}$ of size $(n, d_{out})$ . However, in practice, GCN tends to compute $\sigma(D^{-\frac{1}{2}}\tilde{A}D^{-\frac{1}{2}}H^{(l-1)}W^{(l)})$ , which normalizes the value by $\sqrt{d_id_j}$ , the square root of the degrees of a pair of nodes, instead of the degree of the node of interest, and this empirically tends to perform better. In the context of $g$ and $F$ , GCN can be expressed as follows.

g(x_i, X_{N_i}) = \sum_{j \in \{i, N_i\}} \frac{1}{\sqrt{d_i d_j}}Wx_j \\ F(X, A) = \begin{bmatrix} - g(x_1, X_{N_1}) -\\ - g(x_2, X_{N_2}) -\\ \vdots \\ - g(x_v, X_{N_v}) -\\ \end{bmatrix} = \sigma(D^{-\frac{1}{2}}\tilde{A}D^{-\frac{1}{2}}XW)

Due to its relatively simple convolution operation, which simply takes the normalized weighted sum of the node and its neighbors, whose weights do not depend on the nodes of interest, it tends to perform better on homophilous graphs where similar nodes are more likely to be connected.

Code Implementation

The following shows the TensorFlow implementation of a GCN layer.

class GraphConvolutionLayer(layers.Layer):
    def __init__(self, d_in, d_out):
        super(GraphConvolutionLayer, self).__init__()
        self.d_in = d_in
        self.d_out = d_out
        self.W = self.add_weight(shape=(d_in, d_out),
                                       initializer='glorot_uniform',
                                       trainable=True)
 
    def call(self, x, A):
        D = tf.reduce_sum(A, axis=-1)
        D = tf.linalg.diag(D)
        D_inv_sqrt = tf.linalg.inv(tf.sqrt(D))
        A = tf.matmul(tf.matmul(D_inv_sqrt, A), D_inv_sqrt)
        x = tf.matmul(A, x)
        x = tf.matmul(x, self.W)
        x = tf.nn.relu(x)
        return x

The implementation assumes that the degree matrix will not be provided. The latent representation can be passed to a subsequent classifier or regressor to perform various tasks on graphs. I highly recommend implementing the graph convolution layer in PyTorch for practice.

Conclusion

In this article, we covered how convolution can be used as the permutation invariant aggregator to compose a permutation equivariant graph convolution layer, construct a graph convolutional network by stacking the layers, and produce latent representations for downstream tasks. While demonstrating the mathematical operations and code implementations, I also mentioned how the weights for computing the normalized weighted sum do not depend on the node, making GCN simple and effective on homophilous graphs but not necesarily for other graphs. Thus, in the next article, we will look at more complex models that can work well on non-homophilous graphs as well.

Resources

Barbero, F. 2023. Graph Convolutional Networks - Oxford Geometric Deep Learning. YouTube.
Velickovic, P. 2021. Theoretical Foundations of Graph Neural Networks. YouTube.