Publications

* denotes equal contribution and joint lead authorship.

ICML

Generalisation and Safety Critical Evaluations at Sharp Minima: A Geometric Reappraisal.

Israel Mason-Williams and Gabryel Mason-Williams and Helen Yannakoudakis.

In Workshop on High-dimensional Learning Dynamics (HiLD) 2025.

Abstract PDF

The geometric flatness of neural network minima has long been associated with desirable generalisation properties. In this paper, we extensively explore the hypothesis that robust, calibrated and functionally similar models sit at flatter minima, inline with prevailing understandings of the relationship between flatness and generalisation. Contrary to common assertions in the literature, we find a relationship between increased sharpness, generalisaton, calibration and robustness in neural networks across architectures when using Sharpness Aware Minimisation, augmentation and weight decay as regulariser controls. Our findings suggest that the role of increased sharpness should be considered independently for individual models when reasoning about the geometric properties of neural networks. We show that sharpness can be related to generalisation and safety-relevant properties against the flatter minima found without the use of our regularisation controls. Understanding these properties calls for a re-thinking of the role of sharpness in geometric landscapes.

ICML

Data Free Metrics Are Not Reparameterisation Invariant Under the Critical and Robust Layer Phenomena.

Gabryel Mason-Williams and Israel Mason-Williams and Fredrik Dahlqvist.

In Workshop on High-dimensional Learning Dynamics (HiLD) 2025.

Abstract PDF

Data-free methods for analysing and understanding the layers of neural networks have offered many metrics for quantifying notions of strong" versus weak" layers, with the promise of increased interpretability. We examine how robust data-free metrics are under random control conditions of critical and robust layers. Contrary to the literature, we find counter-examples that provide counter-evidence to the efficacy of data-free methods. We show that data-free metrics are not reparameterisation invariant in these conditions and lose predictive capacity across correlation measures, RMSE, Person Coefficient and Kendall's Tau measure. Thus, we argue that to understand neural networks fundamentally, we must rigorously analyse the interactions between data, weights, and resulting functions that contribute to their outputs -- contrary to traditional Random Matrix Theory perspectives.

ICML

Decomposed Learning an Avenue for Mitigating Grokking.

Gabryel Mason-Williams and Israel Mason-Williams .

In Workshop on Methods and Opportunities at Small Scale (MOSS) 2025.

Abstract PDF Code

Grokking is a delayed transition from memorisation to generalisation in neural networks. It challenges efficient learning, particularly in structured tasks and small-data regimes. We explore grokking in modular arithmetic from the perspective of a training pathology. We use Singular Value Decomposition (SVD) to modify the weight matrices of neural networks by changing the representation of the weight matrix, $W$, into the product of three matrices, $U$, $\Sigma$ and $V^T$ . Through empirical evaluations on the modular addition task, we show that this representation significantly reduces the effect of grokking and, in some cases, eliminates it.

ICML

Reproducibility: The New Frontier in AI Governance.

Israel Mason-Williams and Gabryel Mason-Williams.

In Workshop on Technical AI Governance at ICML. 2025.

Abstract PDF Code

AI Policymakers are responsible for delivering effective governance mechanisms that can provide safe, aligned and trustworthy AI development. However, the information environment offered to policymakers is characterized by an unnecessarily low signal-to-noise ratio, favouring regulatory capture and creating deep uncertainty and divides on which risks should be prioritized from a governance perspective. We posit that the current speed of publication in AI combined with the lack of strong scientific standards, via weak reproducibility protocols, effectively erodes the power of policymakers to enact meaningful policy and governance protocols. Our paper outlines how AI research could adopt stricter reproducibility guidelines to assist governance endeavours and improve consensus on the risk landscapes posed by AI. We evaluate the forthcoming reproducibility crisis within AI research through the lens of reproducibility crises in other scientific domains and provide a commentary on how adopting preregistration, increased statistical power and negative result publication reproducibility protocols can enable effective AI governance. While we maintain that AI governance must be reactive due to AI's significant societal implications we argue that policymakers and governments must consider reproducibility protocols as a core tool in the governance arsenal and demand higher standards for AI research.

2025

NeurIPS

Knowledge Distilation: The Functional Perspective.

Israel Mason-Williams, Gabryel Mason-Williams*, and Mark Sandler.

In Workshop on Science of Deep Learning at NeurIPS. 2024.

Abstract PDF

Empirical findings of accuracy correlations between students and teachers in the knowledge distillation framework have served as supporting evidence for knowledge transfer. In this paper, we sought to explain and understand the knowledge transfer derived from knowledge distillation via functional similarity, hypothesising that knowledge distillation provides a functionally similar student to its teacher model. While we accept this hypothesis for two out of three architectures across a range of metrics for functional analysis against four controls, the results show that knowledge transfer is significant but it is less pronounced than expected for conditions that maximise opportunities for functional similarity. Furthermore, results from the use of Uniform and Gaussian Noise as teachers suggest that the knowledge-sharing aspects of knowledge distillation inadequately describe the accuracy benefits witnessed when using the knowledge distillation training setup itself. Moreover, in the first instance, we show that knowledge distillation is not a compression mechanism but primarily a data-dependent training regulariser with a small capacity to transfer knowledge in the best case.
NeurIPS

Explicit Regularisation, Sharpness and Calibration.

Israel Mason-Williams, Fredrik Ekholm*, and Ferenc Huszár.

In Workshop on Science of Deep Learning at NeurIPS. 2024.

Abstract PDF Code

We probe the relation between flatness, generalisation and calibration in neural networks, using explicit regularisation as a control variable. Our findings indicate that the range of flatness metrics surveyed fail to positively correlate with variation in generalisation or calibration. In fact, the correlation is often opposite to what has been hypothesized or claimed in prior work, with calibrated models typically existing at sharper minima compared to relative baselines, this relation exists across model classes and dataset complexities.
ICLR

Neural Network Compression: The Functional Perspective.

Israel Mason-Williams.

Abstract PDF Code

Compression techniques, such as Knowledge distillation, Pruning, and Quantiza- tion reduce the computational costs of model inference and enable on-edge ma- chine learning. The efficacy of compression methods is often evaluated through the proxy of accuracy and loss to understand similarity of the compressed model. This study aims to explore the functional divergence between compressed and uncompressed models. The results indicate that Quantization and Pruning create models that are functionally similar to the original model. In contrast, Knowl- edge distillation creates models that do not functionally approximate their teacher models. The compressed model resembles the dissimilarity of function observed in independently trained models. Therefore, it is verified, via a functional under- standing, that Knowledge distillation is not a compression method. Thus, leading to the definition of Knowledge distillation as a training regulariser given that no knowledge is distilled from a teacher to a student.

Publications

* denotes equal contribution and joint lead authorship.

Generalisation and Safety Critical Evaluations at Sharp Minima: A Geometric Reappraisal.

Data Free Metrics Are Not Reparameterisation Invariant Under the Critical and Robust Layer Phenomena.

Decomposed Learning an Avenue for Mitigating Grokking.

Reproducibility: The New Frontier in AI Governance.

2025

Knowledge Distilation: The Functional Perspective.

Explicit Regularisation, Sharpness and Calibration.

Neural Network Compression: The Functional Perspective.

2024