in

# HashGNN: Deep Dive into Neo4j GDS’s New Node Embedding Algorithm | by Philipp Brunenberg | Aug, 2023

## On this article, we are going to discover alongside a small instance how HashGNN hashes graph nodes into an embedding area.

Should you desire watching a video on this, you are able to do so here.

HashGG (#GNN) is a node embedding method, which employs ideas of Message Passing Neural Networks (MPNN) to seize high-order proximity and node properties. It considerably speeds-up calculation compared to conventional Neural Networks by using an approximation method known as MinHashing. Due to this fact, it’s a hash-based method and introduces a trade-off between effectivity and accuracy. On this article, we are going to perceive what all of which means and can discover alongside a small instance how the algorithms works.

Many graph machine studying use-cases like hyperlink prediction and node classification require the calculation of similarities of nodes. In a graph context, these similarities are most expressive after they seize (i) the neighborhood (i.e. the graph construction) and (ii) the properties of the node to be embedded. Node embedding algorithms undertaking nodes right into a low-dimensional embedding area — i.e. they assign every node a numerical vector. These vectors — the embeddings — can be utilized for additional numerical predictive evaluation (e.g. machine studying algorithms). Embedding algorithms optimize for the metric: Nodes with the same graph context (neighborhood) and/or properties must be mapped shut within the embedding area. Graph embedding algorithms often make use of two basic steps: (i) Outline a mechanism to pattern the context of nodes (Random stroll in node2vec, k-fold transition matrix in FastRP), and (ii) subsequently scale back the dimensionality whereas preserving pairwise distances (SGD in node2vec, random projections in FastRP).

The clue about HashGNN is that it doesn’t require us to coach a Neural Community primarily based on a loss operate, as we must in a standard Message Passing Neural Community. As node embedding algorithms optimize for “related nodes must be shut within the embedding area”, the analysis of the loss includes calculating the true similarity of a node pair. That is then used as suggestions to the coaching, how correct the predictions are, and to regulate the weights accordingly. Usually, the cosine similarity is used as similarity measure.

HashGNN circumvents the mannequin coaching and really doesn’t make use of a Neural Community altogether. As an alternative of coaching weight matrices and defining a loss operate, it makes use of a randomized hashing scheme, which hashes node vectors to the identical signature with the chance of their similarity, which means that we are able to embed nodes with out the need of evaluating nodes instantly in opposition to one another (i.e. no must calculate a cosine similarity). This hashing method is named MinHashing and was initially outlined to approximate the similarity of two units with out evaluating them. As units are encoded as binary vectors, HashGNN requires a binary node illustration. With a view to perceive how this can be utilized to embed nodes of a basic graph, a number of methods are required. Let’s take a look.

To begin with, let’s discuss MinHashing. MinHashing is a locality-sensitive hashing method to approximate the Jaccard similarity of two units. The Jaccard similarity measures the overlap (intersection) of two units by dividing the intersection measurement by the variety of the distinctive components current (union) within the two units. It’s outlined on units, that are encoded as binary vectors: Every factor within the universe (the set of all components) is assigned a singular row index. If a selected set incorporates a component, that is represented as worth 1 within the respective row of the set’s vector. The MinHashing algorithm hashes every set’s binary vector independently and makes use of `Ok` hash capabilities to generate a `Ok`-dimensional signature for it. The intuitive rationalization of MinHashing is to randomly choose a non-zero factor `Ok` instances, by selecting the one with the smallest hash worth. This may yield the signature vector for the enter set. The attention-grabbing half is, that if we do that for 2 units with out evaluating them, they may hash to the identical signature with the chance of their Jaccard similarity (if `Ok` is massive sufficient). In different phrases: The chance converges in opposition to the Jaccard similarity.

Within the illustration: Instance units `s1` and `s2` are represented as binary vectors. We are able to simply compute the Jaccard similarity by evaluating the 2 and counting the rows the place each vectors have a 1. These are fairly easy operations, however the complexity resides within the pair-wise comparability of vectors if we’ve got many vectors.

Our universe `U` has measurement 6 and we select `Ok` (the variety of hash capabilities) to be 3. We are able to simply generate new hash capabilities by utilizing a easy system and bounds for `a`, `b` and `c`. Now, what we truly do is to hash the indices of our vectors (`1-6`), every regarding a single factor in our universe, with every of our hash capabilities. This may give us 3 random permutations of the indices and due to this fact components in our universe. Subsequently, we are able to take our set vectors `s1` and `s2` and use them as masks on our permuted options. For every permutation and set vector, we choose the index of the smallest hash worth, which is current within the set. This may yield two three-d vectors, one for every set, which known as the MinHash signature of the set.

MinHashing merely selects random options from the enter units, and we want the hash capabilities solely as a method to breed this randomness equally on all enter units. Now we have to make use of the identical set of hash capabilities on all vectors, in order that the signature values of two enter units collide if each have the chosen factor current. The signature values will collide with the chance of the units’ Jaccard similarities.

HashGNN makes use of a message passing scheme as outlined in Weisfeiler-Lehman Kernel Neural Networks (WLKNN) to seize high-order graph construction and node properties. It defines the context sampling technique as talked about earlier for HashGNN. The WLK runs in `T` iterations. In every iteration, it generates a brand new node vector for every node by combining the node’s present vector with the vectors of all instantly linked neighbors. Due to this fact, it’s thought of to go messages (node vectors) alongside the sides to neighboring nodes.

After `T` iterations, every node incorporates info of nodes at `T` hops distance (high-order). The calculation of the brand new node vector in iteration `t` basically aggregates all neighbor messages (from iteration `t-1`) right into a single neighbor message after which combines it with the node vector of the earlier iteration. Moreover, the WLKNN employs three neural networks (weight matrices and activation operate); (i) for the aggregated neighbor vector, (ii) for the node vector and (iii) for the mix of the 2. It’s a notable characteristic of the WLK that the calculation in iteration `t` is simply depending on the results of iteration `t-1`. Due to this fact, it may be thought of a Markov chain.

Let’s discover how HashGNN combines the 2 approaches to embed graph vectors effectively to embedding vectors. Identical to the WLKNN, the HashGNN algorithm runs in `T` iterations, calculating a brand new node vector for each node by aggregating the neighbor vectors and its personal node vector from the earlier iteration. Nevertheless, as a substitute of coaching three weight matrices, it makes use of three hashing schemes for locality-sensitive hashing. Every of the hashing schemes consists of `Ok` randomly constructed hashing capabilities to attract `Ok` random options from a binary vector.

In every iteration we carry out the next steps:

Step 1: Calculate the node’s signature vector: Min-hash (randomly choose `Ok` options) the node vector from the earlier iteration utilizing hashing scheme 3. Within the first iteration the nodes are initialized with their binary characteristic vectors (we’ll discuss the best way to binarize nodes later). The ensuing signature (or message) vector is the one which is handed alongside the sides to all neighbors. Due to this fact, we’ve got to do that for all nodes first in each iteration.

Step 2: Assemble the neighbor vector: In every node, we’ll obtain the signature vectors from all instantly linked neighbors and mixture them right into a single binary vector. Subsequently, we use hashing scheme 2 to pick `Ok` random options from the aggregated neighbor vector and name the outcome the neighbor vector.

Step 3: Mix node and neighbor vector into new node vector: Lastly, we use hashing scheme 1 to randomly choose `Ok` options from the node vector of the earlier iteration and mix the outcome with the neighbor vector. The ensuing vector is the brand new node vector which is the start line for the following iteration. Observe, that this isn’t the identical as in step 1: There, we apply hashing scheme 3 on the node vector to assemble the message/signature vector.

As we are able to see, the ensuing (new) node vector has affect of its personal node options (3 & 5), in addition to its neighbors’ options (2 & 5). After iteration 1 the node vector will seize info from neighbors with 1 hop distance. Nevertheless, as we use this as enter for iteration 2, it’ll already be influenced by options 2 hops away.

HashGNN was applied by the Neo4j GDS (Graph Information Science Library) and provides some helpful generalizations to the algorithm.

One necessary auxiliary step within the GDS is characteristic binarization. MinHashing and due to this fact HashGNN requires binary vectors as enter. Neo4j presents an extra preparation step to remodel real-valued node vectors into binary characteristic vectors. They make the most of a method known as hyperplane rounding. The algorithm works as follows:

Step 1: Outline node options: Outline the (real-valued) node properties for use as node options. That is finished utilizing the parameter `featureProperties`. We are going to name this the node enter vector `f`.

Step 2: Assemble random binary classifiers: Outline one hyperplane for every goal dimension. The variety of ensuing dimensions is managed by the parameter `dimensions`. A hyperplane is a high-dimensional aircraft and, so long as it resides within the origin, will be described solely by its regular vector `n`. The `n` vector is orthogonal to the aircraft’s floor and due to this fact describes its orientation. In our case the `n` vector must be of the identical dimensionality because the node enter vectors (`dim(f) = dim(n)`). To assemble a hyperplane, we merely draw `dim(f)` instances from a Gaussian distribution.

Step 3: Classify node vectors: Calculate the dot-product of the node enter vector and every hyperplane vector, which yields the angle between the hyperplane and the enter vector. Utilizing a `threshold` parameter, we are able to resolve whether or not the enter vector is above (1) or beneath (0) the hyperplane and assign the respective worth to the ensuing binary characteristic vector. That is precisely what additionally occurs in a binary classification — with the one distinction that we don’t iteratively optimize our hyperplane, however relatively use random Gaussian classifiers.

In essence, we draw one random Gaussian classifier for every goal dimension and set a threshold parameter. Then, we classify our enter vectors for every goal dimension and acquire a `d`-dimensional binary vector which we use as enter for HashGNN.

HashGNN makes use of locality-sensitive hashing to embed node vectors into an embedding area. Through the use of this system, it circumvents the computationally intensive, iterative coaching of a Neural Community (or different optimization), in addition to direct node comparisons. The authors of the paper report a 2–4 orders of magnitude sooner runtime than studying primarily based algorithms like SEAL and P-GNN, whereas nonetheless producing extremely comparable (in some circumstances even higher) accuracy.

HashGNN is applied within the Neo4j GDS (Graph Information Science Library) and might due to this fact be used out-of-the-box in your Neo4j graph. Within the subsequent put up, I’ll go into sensible particulars about the best way to use it, and what to bear in mind.

Thanks for stopping by and see you subsequent time. 🚀