Róbert Csordás
I am a postdoctoral researcher at the Stanford NLP group, supervised by
Prof. Christopher Manning and
Prof. Christopher Potts.
Previously I did my PhD at the Swiss AI lab IDSIA, working with
Prof. Jürgen Schmidhuber.
I work on compositionality and systematic generalization, with the goal of improving the reasoning capabilities,
trustworthiness, and data efficiency of neural models. My goal is to shift the preference of neural networks to
learn general, soft algorithms when possible, instead of relying on fuzzy memorization of seen patterns.
Instead of relying on symbolic systems, I want to bias the networks to learn rules implicitly, without explicit supervision,
with minimal hardcoded structure, and without reducing their expressivity. I am also interested in
mechanistic interpretability, which can provide a deep understanding of how the models work and the bottlenecks causing the networks to fail in composition and generalization.
I consider the lack of compositionality and systematic generation to be the main obstacle to a more generally
applicable artificial intelligence.
During the summer of 2022, I did an internship at DeepMind.
Before starting my PhD, I received a master's degree from Budapest University of Technology and Economics and worked as a research scientist at AImotive on developing self driving cars.
Email  / 
CV  / 
GitHub  / 
Google Scholar  / 
Twitter  / 
Thesis
|
|
|
Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations
Róbert Csordás,
Christopher Potts,
Christopher D. Manning,
Atticus Geiger
arXiv:2408.10920
pdf /
code /
bib
We show that RNNs, under certain conditions, prefer to store sequences in a magnitude-based encoding that violates the Linear Representation Hypothesis (LRH). This counterexample strongly indicates that interpretability research should not be confined by the LRH.
|
|
MoEUT: Mixture-of-Experts Universal Transformers
Róbert Csordás,
Kazuki Irie,
Jürgen Schmidhuber,
Christopher Potts,
Christopher D. Manning
arXiv:2405.16039
pdf /
code /
training code /
checkpoints /
bib
We propose a novel Mixture-of-Experts Universal Transformer using σ-MoE, SwitchHead, layer grouping, and a novel peri-layernorm. Our method slightly outperforms standard Transformers on language modeling and zero-shot downstream tasks with less compute and memory requirements.
|
|
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Róbert Csordás,
Piotr Piękos,
Kazuki Irie,
Jürgen Schmidhuber
arXiv:2312.07987
pdf /
code /
bib
We propose a novel MoE attention, which can match the performance of parameter-matched dense models with a fraction of the compute and memory requirements. We also present the "SwitchAll" model, where each layer is an MoE.
|
|
Approximating Two-Layer Feedforward Networks for Efficient Transformers
Róbert Csordás,
Kazuki Irie,
Jürgen Schmidhuber
EMNLP Findings 2023
pdf /
code /
bib
We present different approximation methods for two-layer feedforward networks in a unified framework. Based on this, we develop a better-performing MoE, which matches or even outperforms the parameter-equivalent dense models.
|
|
Topological Neural Discrete Representation Learning à la Kohonen
Kazuki Irie*,
Róbert Csordás*,
Jürgen Schmidhuber
International Conference on Artificial Neural Networks (ICANN), 2024
pdf /
code /
bib
We show that vector quantization is a special case of self-organizing maps (SOMs). Using the SOM formulation proposed by Kohonen in his 1982 paper improves converge speed, makes the training more robust, and results in a topologically organized representation space.
|
|
Randomized Positional Encodings Boost Length Generalization of Transformers
Anian Ruoss,
Gregoire Deletang,
Tim Genewein,
Jordi Grau-Moya,
Róbert Csordás,
Mehdi Bennani,
Shane Legg,
Joel Veness
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
pdf /
bib
We show that the transformer's poor length generalization is linked to the positional encodings being out-of-distribution. We introduce a novel positional encoding that samples a randomized ordered subset of sinusoidal positional encodings. We show the befit of this positional encoding on various algorithmic tasks.
|
|
CTL++: Evaluating Generalization on Never-Seen Compositional Patterns of Known Functions, and Compatibility of Neural Representations
Róbert Csordás,
Kazuki Irie,
Jürgen Schmidhuber
Empirical Methods in Natural Language Processing (EMNLP), 2022
pdf /
code /
poster /
bib
We developed a new dataset for testing systematicity based on CTL by partitioning the data based on functional groups. Using this, we were able to show that Transformers naturally learn multiple, incompatible representations of the same symbol. As a result, the network fails when the symbol is fed to a function that has not seen that specific representation before.
|
|
A Generalist Neural Algorithmic Learner
Borja Ibarz,
Vitaly Kurin,
George Papamakarios,
Kyriacos Nikiforou,
Mehdi Bennani,
Róbert Csordás,
Andrew Dudzik,
Matko Bošnjak,
Alex Vitvitskyi,
Yulia Rubanova,
Andreea Deac,
Beatrice Bevilacqua,
Yaroslav Ganin,
Charles Blundell,
Petar Veličković
Learning on Graphs (LoG), 2022
pdf /
code /
bib
We train multi-task generalist reasoning architecture on the CLRS algorithmic reasoning benchmark that shares a single, universal processor among all tasks. Furthermore, we introduce numerous improvements to the previous best architecture, achieving new SOTA even in the single-task case.
|
|
The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention
Kazuki Irie*,
Róbert Csordás*,
Jürgen Schmidhuber
International Conference on Machine Learning (ICML), 2022
pdf /
code /
bib
Linear layers in neural networks (NNs) trained by gradient descent can be expressed as a key-value memory system which stores all training datapoints and the initial weights, and
produces outputs using unnormalised dot attention over the entire training experience. While this has been technically known since the ’60s, no prior work has effectively studied the
operations of NNs in such a form. We conduct experiments on this dual formulation and study the potential of directly visualising how an NN makes use of training patterns at test time,
as well as its limitations.
|
|
A Modern Self-Referential Weight Matrix That Learns to Modify Itself
Kazuki Irie,
Imanol Schlag,
Róbert Csordás,
Jürgen Schmidhuber
International Conference on Machine Learning (ICML), 2022
pdf /
code /
bib
The weight matrix (WM) of a neural network (NN) is its program which remains fixed after training. The WM or program of a self-referential NN, however, can keep rapidly modifying all of itself during runtime. In principle, such NNs are capable of recursive self-improvement. Here we propose a scalable self-referential WM (SRWM) that uses self-generated training patterns, outer products and the delta update rule to modify itself.
|
|
The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization
Róbert Csordás,
Kazuki Irie,
Jürgen Schmidhuber
International Conference on Learning Representations (ICLR), 2022
pdf /
code /
slides /
poster /
bib
We look at Transformers as a system for routing relevant information to the right node/operation at the right time in the grid represented by its column.
To facilitate learning useful control flow, we propose two modifications to the Transformer architecture: copy gate and geometric attention. The resulting
Neural Data Router (NDR) architecture achieves length generalization compositional table lookup task, as well as generalization across computational depth
on the simple arithmetic task and ListOps.
|
|
The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers
Róbert Csordás,
Kazuki Irie,
Jürgen Schmidhuber
Empirical Methods in Natural Language Processing (EMNLP), 2021
pdf /
code /
slides /
poster /
bib
We improve the systematic generalization of Transformers on SCAN (0 -> 100% with length cutoff=26), CFQ (66 -> 81% on output length split),
PCFG (50 -> 85% on productivity split, 72 -> 96% on systematicity split), COGS (35 -> 81%), and Mathematics dataset, by revisiting model configurations as basic
as scaling of embeddings, early stopping, relative positional embedding, and weight sharing. We also show that relative positional embeddings largely mitigate the EOS
decision problem. Importantly, differences between these models are typically invisible on the IID data split, which calls for proper generalization validation sets.
|
|
Going Beyond Linear Transformers with Recurrent Fast Weight Programmers
Kazuki Irie,
Imanol Schlag,
Róbert Csordás,
Jürgen Schmidhuber
Conference on Neural Information Processing Systems (NeurIPS), 2021
pdf /
code /
bib
Inspired by the effectiveness of Fast Weight Programmers in the context of Linear Transformers, in this work we explore the recurrent Fast Weight Programmers (FWPs), which exhibit advantageous properties of both Transformers and RNNs.
|
|
Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks
Róbert Csordás,
Sjoerd van Steenkiste,
Jürgen Schmidhuber
International Conference on Learning Representations (ICLR), 2021
pdf /
code /
slides /
poster /
bib
This paper presents a novel method based on learning binary weight masks to identify individual weights and subnets responsible for specific functions. We contribute an extensive study of emerging modularity in NNs that covers several standard architectures and datasets using this powerful tool. We demonstrate how common NNs fail to reuse submodules and offer new insights into systematic generalization on language tasks.
|
|
Improving Differentiable Neural Computers Through Memory Masking, De-allocation, and Link Distribution Sharpness Control
Róbert Csordás,
Jürgen Schmidhuber
International Conference on Learning Representations (ICLR), 2019
NeurIPS Workshop on Relational Representation Learning, 2018
pdf /
code /
slides /
poster /
bib
We propose three improvements for the DNC architecture, which significantly improves its performance on algorithmic reasoning tasks. First, the lack of key-value separation makes the address distribution dependent also on the stored value. Second,
DNC leaves deallocated data in the memory, which results in aliasing. Third, the temporal linkage matrix quickly degrades the sharpness of the address distribution. Our proposed fixes improve the mean error rate on the bAbI question answering dataset by 43%.
|
|
Improving Baselines in the Wild
Kazuki Irie,
Imanol Schlag,
Róbert Csordás,
Jürgen Schmidhuber
NeurIPS DistShift Workshop, 2021
pdf /
code
We present our critical observations on the iWildCam and FMoW datasets of the recently released WILDS benchmark. We show that (1) Conducting separate cross-validation for each evaluation metric is crucial for both datasets, (2) A weak correlation between validation and test performance might make model development difficult for iWildCam, (3) Minor changes in the training of hyper-parameters improve the baseline by a relatively large margin, (4) There is a strong correlation between certain domains and certain target labels.
|
Talks
- In May 2024 I gave a talk in Apple about my work
- In May 2024 I gave a talk in Adobe on Mixture-of-Experts with focus on Universal Transformers
- In March 2024 I gave a talk in Nvidia about my work
- In November 2022 I gave a talk at Rycolab at ETH Zürich on the Neural Sequence Models Theory group on how ideas from compositionality improve systematic generalization.
- In June 2022 I gave a talk on the Neural Sequence Models Theory group on how ideas from compositionality improve systematic generalization.
- In June 2022 I gave a talk on the Stanford NLP Seminar on how ideas from compositionality improve systematic generalization.
- In April 2022 I gave a talk for Jacob Andreas' group on how ideas from compositionality improve systematic generalization.
|
|