Linguae-ML Seminar

Machine Learning on Temporal Graphs

From Foundations to Emerging Frontiers

A Bit About Myself

🎓

Bachelor and Master in Computer Science at Imperial College and Cambridge (2015-2019)

🤖

PhD on Machine Learning on Graphs between Twitter and Imperial College (2019-2023)

🧬

ML Researcher on generative models for structural biology and drug discovery at Vant AI (2024-2025)

🐬

PostDoc on ML for decoding non-human communication @ Sapienza

Why Should We Care About ML on Graphs?

Networks are everywhere

And graphs are a great way to model them

Functional Networks

Interaction Networks

Social Networks

Molecules

Image Credit: wolfram.com · gatton.uky.edu · Papo et al., Frontiers in Human Neuroscience 2014 · Madhavicmu / Wikimedia Commons CC-BY-SA-4.0

Networks are everywhere

And graphs are a great way to model them

Syntax Trees

Cognitive Maps

What model should we use for graphs?

Modality

Data

Architecture

🖼️ Images

$Example image$

📝 Text

The cat sat on the mat

🕸️ Graphs

(Static) Graphs and Graph Tasks

Graphs: nodes, edges, and features

G = (V, E)

V = nodes

E = edges

A ∈ ℝ^n×n

A = Adjacency Matrix

X ∈ ℝ^n×d

X = node features

0.8

0.4

1.7

0.2

0.9

1.4

0.6

0.1

1.9

0.4

0.7

1.6

Tasks on Graphs

Graph classification

Predict a label for the whole graph.

Example: molecular property prediction, fraud detection, document classification.

Node classification

Predict labels for individual nodes using graph context.

Example: user attributes, protein function, paper topic prediction.

Link prediction

Score missing or future edges between pairs of nodes.

Example: recommendations, knowledge graph completion, social tie prediction.

(Static) Graph Neural Networks

Graph Neural Networks (GNNs)

Convolutional GNN

$\mathbf{H}^{(k)}$ = $\sigma(\tilde{\mathbf{A}} \mathbf{H}^{(k-1)} \mathbf{W}^{(k)})$ $\mathbf{H}^{(0)}$ = $\mathbf{X}$

Message-Passing GNN

$\mathbf{m}_i^{(k)}$ = $\text{AGG}^{(k)}\!\left(\{\{\mathbf{h}_j^{(k-1)}: (i,j) \in E\}\}\right)$ $\mathbf{h}_i^{(k)}$ = $\text{COM}^{(k)}\!\left(\mathbf{h}_i^{(k-1)}, \mathbf{m}_i^{(k)}\right)$ $\mathbf{h}_i^{(0)}$ = $\mathbf{x}_i$

Transformers are GNNs

On the fully connected graph

Graph Transformers

Transformer-style attention, but with graph structure injected

Core idea

Self-attention lets every node aggregate from all other nodes
Graph information enters through special positional encodings

$\mathrm{Attn}(i,j) \leftarrow q_i^\top k_j + b_{\mathrm{graph}}(i,j)$

Dynamic Graphs

Some Examples of Dynamic Graphs

Graphs changing over time

Social Networks

Interaction Networks

From Static to Dynamic Graphs

$G=(V,E,X)$

Static Graph

• No notion of time
• Examples: molecules, syntax trees

$G_t=(V,E,X_t)$

Spatio-Temporal Graph

• Fixed topology; changing features
• Regular time intervals
• Examples: traffic, weather sensors

$G(t)=\{x_{t_1},x_{t_2},...\}\quad t_1\leq t_2\leq ...$

Continuous-Time DG (CTDG)

• Most general formulation
• Sequence of timestamped events
• Examples: social, financial

$G_t=(V_t,E_t,X_t)$

Discrete-Time DG (DTDG)

• Changing topology and features
• Regular time intervals -> sequence of snapshots
• Examples: weekly trade networks

Less General

More General

Why Learning on Dynamic Graphs is Different

Temporal graph models must exploit the event history without collapsing it into one static snapshot

What changes

Events arrive in order: creation, deletion, update, interaction
The model should use both what happened and when it happened
New tasks ask what happens next, or when it happens

Why static GNNs are not enough

event history
G(t ≤ T)

→

latest snapshot only

Information loss: the last snapshot hides the evolution path
Inefficiency: every new event can force repeated computation
No timing prediction: static GNNs do not support predicting when something will happen

Models

Temporal Graph Model

Memory-based Models

Process events in order using an RNN, with a different hidden state per node
Hidden state (memory) is a compressed representation of all past interactions of a node
Memory is directly used as the temporal node embeddings

Update time

Edge (u,v) builds a message from partner state, features & Δt; only the two touched nodes update their memory.

Predict time

Memories serve directly as node embeddings, fed into decoder with no extra computation at query time.

Memory-based Models: Pros and Cons

A strong online model, but not yet a full temporal graph encoder

Pros

Strong sequentiality inductive bias
Cheap online updates after each new event

Cons

Inactive nodes can become stale
Graph context is mostly local or indirect
Forced to process previous edges in sequential order

Graph-based Models

GNN on the graph of previous interactions, with timestamps as edge features

Update time

Edge (u,v,t) is simply appended to G(t) — no computation performed at event time.

Predict time

GNN runs on full G(t) to produce node embeddings; must re-run for every new query — expensive at inference.

Graph-based Models: Pros and Cons

Explicit graph structure mitigates staleness, but the GNN must re-run on every query

Pros

No need for sequential processing in training
Using the graph explicitly → mitigates staleness problem

Cons

Can only handle edge addition events
Need to re-run GNN after each new event → inefficient at inference

TGN: Temporal Graph Networks

Our work: a modular framework that combines memory-based and graph-based temporal learning

TGN contribution

General framework combining the best of memory-based and graph-based approaches
Updates node memories from events, then uses a GNN over the interaction graph to produce embeddings
Generalizes previous memory-based models and graph-based models in one notation

TGN: State-of-the-Art in 2020

High accuracy with much lower per-epoch cost than the previous methods

TGN: Still a Strong Baseline Today

Years of follow-up work, yet TGN remains hard to beat on standard benchmarks

Benchmarks

Temporal Graph Benchmark

Diverse, large datasets and unified evaluation

Key Findings

Previous datasets were too easy and saturated
Exposed that simple historical baselines can be surprisingly strong
Large new datasets make scalability part of the benchmark

9 datasets · 5 domains · up to 72M edges

Temporal Graph Benchmark 2.0

TGB but for knowledge graphs

Key findings

Edge and relation type information is crucial for strong performance
Simple heuristic baselines remain competitive with more complex methods
Many methods fail to run on the largest datasets, making scalability a central result

8 datasets · 5 domains · up to 53M edges

Unifying Models for Discrete and Continuous Time Graphs

UTG: Unifying Snapshot & Event-Based Models

Huang, Poursafaei, Rabbany, Rabusseau, Rossi, LoG 2024

The problem

Snapshot (DTDG) and event-based (CTDG) models developed in isolation: limited cross-comparison and no unified evaluation.

Contribution

Input mapper: convert CTDG to snapshots and DTDG to events, so any model can run on any data
UTG training: use streaming training to make snapshot-based models operate on event streams
Output mapper: align predictions to either continuous-time or discrete-time tasks

UTG: Key Findings

Speed

Snapshot-based models are ≥ 10× faster at inference than most event-based models.

Performance

With UTG training, snapshot-based models match TGN & GraphMixer even on event-based (CTDG) datasets.

Insight

NAT & DyGFormer's edge comes from joint neighbourhood features, not from the event-based format: these can be added to snapshot models too.

Next Frontiers

Foundation Models for TG

Existing TG models are trained and tested on the same graph.

Can we train a single TG model that works on unseen graphs, ideally from different domains?

MiNT: Multi-Network Transfer Benchmark for Temporal Graph Learning

New dataset: 84 distinct ERC-20 token transaction networks
New framework to train existing TG models on multiple graphs simultaneously
Scaling law for TGs: model performance improves as it is trained on more networks

Can we use LMs directly on temporal graphs?

TGTalker

Translate temporal graph structure into natural language and feed it to a pre-trained LLM — no fine-tuning required.

Four prompt components

Background set — recent edges as context
Example set — 5-shot Q&A pairs
Query set — target edge to predict
Temporal neighbors — 1-hop recent neighbours

TGTalker: Results & Explainability

Results

Competitive without fine-tuning
Consistently outperforms TGN and HTGN

10 explanation categories discovered

Most Recent Interaction ≈ EdgeBank heuristic
Most Frequent Destination ≈ PopTrack heuristic
Novel patterns: sequence logic, analogy-based inference

Resources for TGL

Blog posts, talks, surveys, libraries & datasets

📝 Blog Posts

Temporal Graph Learning in 2023

Temporal Graph Learning in 2024

🎙️ Talks & Reading Groups

Andy Huang Lecture Series

Temporal Graph Reading Group

My Current Directions

🔊 Communicating Sound Through Natural Language

A sender agent describes a sound; a receiver agent decodes the description back to audio

Original audio

→

Transmitted sentence

"At entry the envelope carries mid-power, punchy, and extreme-oscillation; through sustain and release, the trace returns swift-onset and aggressive..."

→

Reconstructed audio

Comparison

🔊 Communicating Sound Through Natural Language

A sender agent describes a sound; a receiver agent decodes the description back to audio

Original audio

→

Transmitted sentence

"At entry the envelope carries mid-power, punchy, and extreme-oscillation; through sustain and release, the trace returns swift-onset and aggressive..."

→

Reconstructed audio

Waveforms

🐬 OpenWhistle Dataset

Very rich and unique dolphin dataset collected over 5 years
New benchmark for whistle detection and classification

Dataset	# Whistles	Voc. hours	Time span (yrs)	Stable pod (# indiv.)	Setting
OpenWhistle Pretraining	~180,000*	114.3	5.0	(5)	Semi-nat.
OpenWhistle Expert subset	8,354	1.9	0.42	(5)	Semi-nat.
DOLPHINFREE	4,600	7.3	2.0		Wild
Di Nardo et al., 2025	3,111	0.6	0.003	(7)	Captive
Watkins MMSD	566	N/R	70+		Wild
Korkmaz et al., 2023	~29,000*	6.8	0.07		Semi-nat.
Sicily Strait PAM	14,048	N/R	1.2		Wild
DCLDE 2011	6,011	0.7	4.0		Wild
SDWD	N/R	N/R	43+	(293)	Wild (C&R)

* Estimated from total vocalization duration and mean whistle duration.

Thank You

Questions?

ML on Temporal Graphs · Linguae-ML Seminar ·

Method	Human readability	LLM-native transport	Semantic editing	Acoustic interpretability	Training-free	Generative decoding	Bandwidth efficiency
Lossless codec (FLAC, WAV)
Handcrafted descriptors (MFCC, spectral centroid)
Neural codec (EnCodec, SoundStream)
Audio-language tokenizers (AudioLM, TASTE)
Unconstrained text caption
LAC (this paper)