Graph Neural Networks extend the learning bias imposed by Convolutional Neural Networks and Recurrent Neural Networks by generalising the concept of “proximity”, allowing us to have arbitrarily complex connections to handle not only traffic ahead or behind us, but also along adjacent and intersecting roads. In a Graph Neural Network, adjacent nodes pass messages to each other. By keeping this structure, we impose a locality bias where nodes will find it easier to rely on adjacent nodes (this only requires one message passing step). These mechanisms allow Graph Neural Networks to capitalise on the connectivity structure of the road network more effectively. Our experiments have demonstrated gains in predictive power from expanding to include adjacent roads that are not part of the main road. For example, think of how a jam on a side street can spill over to affect traffic on a larger road. By spanning multiple intersections, the model gains the ability to natively predict delays at turns, delays due to merging, and the overall traversal time in stop-and-go traffic. This ability of Graph Neural Networks to generalise over combinatorial spaces is what grants our modeling technique its power. Each Supersegment, which can be of varying length and of varying complexity – from simple two-segment routes to longer routes containing hundreds of nodes – can nonetheless be processed by the same Graph Neural Network model.
From basic research to production-ready machine learning models
A big challenge for a production machine learning system that is often overlooked in the academic setting involves the large variability that can exist across multiple training runs of the same model. While small differences in quality can simply be discarded as poor initialisations in more academic settings, these small inconsistencies can have a large impact when added together across millions of users. As such, making our Graph Neural Network robust to this variability in training took center stage as we pushed the model into production. We discovered that Graph Neural Networks are particularly sensitive to changes in the training curriculum – the primary cause of this instability being the large variability in graph structures used during training. A single batch of graphs could contain anywhere from small two-node graphs to large 100+ nodes graphs.
After much trial and error, however, we developed an approach to solve this problem by adapting a novel reinforcement learning technique for use in a supervised setting.
In training a machine learning system, the learning rate of a system specifies how ‘plastic’ – or changeable to new information – it is. Researchers often reduce the learning rate of their models over time, as there is a tradeoff between learning new things, and forgetting important features already learned–not unlike the progression from childhood to adulthood. We initially made use of an exponentially decaying learning rate schedule to stabilise our parameters after a pre-defined period of training. We also explored and analysed model ensembling techniques which have proven effective in previous work to see if we could reduce model variance between training runs.
In the end, the most successful approach to this problem was using MetaGradients to dynamically adapt the learning rate during training – effectively letting the system learn its own optimal learning rate schedule. By automatically adapting the learning rate while training, our model not only achieved higher quality than before, it also learned to decrease the learning rate automatically. This led to more stable results, enabling us to use our novel architecture in production.
Making models generalise through customised loss functions
While the ultimate goal of our modeling system is to reduce errors in travel estimates, we found that making use of a linear combination of multiple loss functions (weighted appropriately) greatly increased the ability of the model to generalise. Specifically, we formulated a multi-loss objective making use of a regularising factor on the model weights, L_2 and L_1 losses on the global traversal times, as well as individual Huber and negative-log likelihood (NLL) losses for each node in the graph. By combining these losses we were able to guide our model and avoid overfitting on the training dataset. While our measurements of quality in training did not change, improvements seen during training translated more directly to held-out tests sets and to our end-to-end experiments.
Currently we are exploring whether the MetaGradient technique can also be used to vary the composition of the multi-component loss-function during training, using the reduction in travel estimate errors as a guiding metric. This work is inspired by the MetaGradient efforts that have found success in reinforcement learning, and early experiments show promising results.
Thanks to our close and fruitful collaboration with the Google Maps team, we were able to apply these novel and newly developed techniques at scale. Together, we were able to overcome both research challenges as well as production and scalability problems. In the end, the final model and techniques led to a successful launch, improving the accuracy of ETAs on Google Maps and Google Maps Platform APIs around the world.
Working at Google scale with cutting-edge research represents a unique set of challenges. If you’re interested in applying cutting edge techniques such as Graph Neural Networks to address real-world problems, learn more about the team working on these problems here.
In collaboration with: Marc Nunkesser, Seongjae Lee, Xueying Guo, Austin Derrow-Pinion, David Wong, Peter Battaglia, Todd Hester, Petar Veličković, Vishal Gupta, Ang Li, Zhongwen Xu, Geoff Hulten, Jeffrey Hightower, Luis C. Cobo, Praveen Srinivasan & Harish Chandran.
Figures by Paulo Estriga & Adam Cain.