Road to ML Engineer #32 - Graph Convolutional Networks

Last Edited: 1/1/2025

The blog post discusses about graph convolutional networks.

ML

In the last article, we discussed how the function to produce latent representations of a graph, FF, needs to be permutation equivariant, which applies a permutation invariant local function gg row-wise. There are multiple permutation invariant functions we can use for gg, but one of the simplest functions is to take the weighted sum of each feature among the node and its neighbors, which we have been calling convolution.

Graph Convolution

Graph convolution differs from the convolution operation we covered in the context of image processing in that graph convolution does not have a fixed kernel size to slide over. However, it is essentially the same in that it takes the weighted sum of a unit of data and all its neighbors. Using the adjacency matrix AA, we can compute the weighted sum of neighbors quite conveniently by AXWAXW, where AA is of size (n,n)(n, n), XX, the node matrix, is of size (n,din)(n, d_{in}), and WW, the weight matrix, is of size (din,dout)(d_{in}, d_{out}). Here, nn is the number of nodes, dind_{in} is the node feature dimension, and doutd_{out} is the number of kernels or the output feature dimension. The resulting sum will be of size (n,dout)(n, d_{out}).

However, AXWAXW is not a proper convolution as it does not include the node itself in computing the weighted sum. To address this, we create a new matrix A~\tilde{A} by adding the identity matrix to the adjacency matrix and compute A~XW\tilde{A}XW, making it a proper graph convolution. Notice that we could have set up WW of size (n,n)(n, n), with values depending on node pairs, and computed WA~XW\tilde{A}X. However, this operation is not permutation equivariant, as the permutation of WW must correspond to that of A~\tilde{A}. Hence, we compute A~XW\tilde{A}XW with WW whose values correspond to the input and output features.

Graph Convolutional Network

To allow the model to learn complex and nonlinear relationships through backpropagation, we introduce a non-linear activation function σ\sigma and compute σ(A~XW)\sigma(\tilde{A}XW). However, the latent representations from the weighted sum can grow larger and larger, making it difficult for the model to learn appropriate weights. To normalize the weighted sum, we multiply the inverse of the degree matrix DD, which contains the degree of each node in its diagonal entries, to the weighted sum, finally arriving at a graph convolutional network (GCN). (*The degree of a node is the number of edges for an unweighted graph and the sum of edge weights for a weighted graph.)

H0=XH(l+1)=σ(D1A~H(l)W(l)) H^{0} = X \\ H^{(l+1)} = \sigma(D^{-1}\tilde{A}H^{(l)}W^{(l)})

At each layer ll, GCN performs the computation above to produce a graph embedding H(l+1)H^{(l+1)} of size (n,dout)(n, d_{out}). However, in practice, GCN tends to compute σ(D12A~D12H(l1)W(l))\sigma(D^{-\frac{1}{2}}\tilde{A}D^{-\frac{1}{2}}H^{(l-1)}W^{(l)}), which normalizes the value by didj\sqrt{d_id_j}, the square root of the degrees of a pair of nodes, instead of the degree of the node of interest, and this empirically tends to perform better. In the context of gg and FF, GCN can be expressed as follows.

g(xi,XNi)=j{i,Ni}1didjWxjF(X,A)=[g(x1,XN1)g(x2,XN2)g(xv,XNv)]=σ(D12A~D12XW) g(x_i, X_{N_i}) = \sum_{j \in \{i, N_i\}} \frac{1}{\sqrt{d_i d_j}}Wx_j \\ F(X, A) = \begin{bmatrix} - g(x_1, X_{N_1}) -\\ - g(x_2, X_{N_2}) -\\ \vdots \\ - g(x_v, X_{N_v}) -\\ \end{bmatrix} = \sigma(D^{-\frac{1}{2}}\tilde{A}D^{-\frac{1}{2}}XW)

Due to its relatively simple convolution operation, which simply takes the normalized weighted sum of the node and its neighbors, whose weights do not depend on the nodes of interest, it tends to perform better on homophilous graphs where similar nodes are more likely to be connected.

Code Implementation

The following shows the TensorFlow implementation of a GCN layer.

class GraphConvolutionLayer(layers.Layer):
    def __init__(self, d_in, d_out):
        super(GraphConvolutionLayer, self).__init__()
        self.d_in = d_in
        self.d_out = d_out
        self.W = self.add_weight(shape=(d_in, d_out),
                                       initializer='glorot_uniform',
                                       trainable=True)
 
    def call(self, x, A):
        D = tf.reduce_sum(A, axis=-1)
        D = tf.linalg.diag(D)
        D_inv_sqrt = tf.linalg.inv(tf.sqrt(D))
        A = tf.matmul(tf.matmul(D_inv_sqrt, A), D_inv_sqrt)
        x = tf.matmul(A, x)
        x = tf.matmul(x, self.W)
        x = tf.nn.relu(x)
        return x

The implementation assumes that the degree matrix will not be provided. The latent representation can be passed to a subsequent classifier or regressor to perform various tasks on graphs. I highly recommend implementing the graph convolution layer in PyTorch for practice.

Conclusion

In this article, we covered how convolution can be used as the permutation invariant aggregator to compose a permutation equivariant graph convolution layer, construct a graph convolutional network by stacking the layers, and produce latent representations for downstream tasks. While demonstrating the mathematical operations and code implementations, I also mentioned how the weights for computing the normalized weighted sum do not depend on the node, making GCN simple and effective on homophilous graphs but not necesarily for other graphs. Thus, in the next article, we will look at more complex models that can work well on non-homophilous graphs as well.

Resources