outlook
Machine learning models have become an integral part of today's business world. From spell correction to fraud detection systems to language models, systems of various types, sizes and requirements are being developed. These systems are usually based on data that is either sequentially or completely unrelated. However, for fields of application in which relationships between different entities are to be represented, graphs are often used as a basic data structure. Examples include social networks, customer interactions, transactions, bank transfers, but also molecules, railroad tracks or three-dimensional shapes. In order to incorporate these relationships into machine learning, so-called Graph Neural Networks (GNN) were developed.
This blog post is intended to give a brief insight into the options AWS's GraphStorm Package offers to quickly try out different GNN architectures and test them against different graph structures.
graphs
Just like real networks and relationships, graphs are very complex structures with a wide variety of imaging options. To illustrate this, a few terms related to graphs are briefly explained here:
- Nodes correspond to entities in a network, such as IBAN numbers, people, companies, credit cards or even atoms or stations.
- Edges correspond to the relationships between nodes.
- In weighted graph properties, such as distances, are assigned to the edges.
- In directed graphs The relationship between at least one pair of nodes only runs in one direction.
- From heterogeneous graphs We speak when there are different types of nodes in a graph. For example, you can display customers, accounts, goods and companies individually in a graph and connect them with each other whenever they have occurred together in a transaction. This results in a transaction network in which you can search for fraudsters, for example, but also optimize your logistics.
- From homogeneous graphene It is said when there is only one type of node in a graph.
If you want to develop a graph from an existing data set, the number of possible decisions (Which nodes are there? Which nodes are connected? When are nodes connected? ...) it is not always obvious what the best and most information-rich structure is for your own use case. Here, it is often necessary to create various graph structures and test them with different GNN algorithms.
GraphStorm Feature
GraphStorm is an open source Python library for rapidly deploying graph machine learning algorithms, which is actively being developed by AWS. It provides various pipelines for flexible training and inference on both a local machine or large multi-cluster systems such as AWS EC2 instances. Integration with SageMaker is also possible. The GraphStorm tools include Python scripts that make it possible to transform the node and edge data into the required formats and at the same time partition the graph for use on multiple machines. This makes it possible to process even graphs with billions of nodes and edges.
GraphStorm provides several graph convolution and attention algorithms, which can be selected using a configuration file in YAML format. Among other things, the number of layers, possible upstream or downstream multi-layer perceptrons, normalizations and early stopping can be selected here. Any setting that can be made in the configuration file can also be overwritten using CLI flags. If you only use provided algorithms, it is also possible to use the trained model using AWS' NeptuneML for live forecasting in combination with a graph database.
How does GraphStorm support prototyping?
The implementation of the neural network architecture and the associated training routines usually takes a lot of development time, since especially in prototyping, a lot of time has to be spent either on adaptable code or on adapting the code. This is especially true if training is to take place on multiple GPUs spread over different servers. Since this work has already been done with GraphStorm and is controlled by a configuration file or CLI flags, the main focus can be on constructing the graph itself so that the optimal architecture of the graph can be quickly found.
What should be considered when using GraphStorm?
Since GraphStorm is currently a work-in-progress, changes can occur at any time that are incompatible with existing code. In addition, due to the rapid addition of new features, the documentation is not always up to date. The architecture of the neural network is also not yet fully flexible. For example, it is not possible to run different GNN algorithms one after the other or to select different layer sizes for multiple layers.
conclusion
The AWS GraphStorm Library for Python enables data scientists to efficiently test, train, and use various neural networks and graph architectures for inference. Controlling the architecture via configuration files or CLI flags even makes it possible to test a wide variety of architectures fully automatically using a hyperparameter search.