Outstanding Paper

For many differentially private algorithms, such as the prominent noisy stochastic gradient descent (DPSGD), the analysis needed to bound the privacy leakage of a single training run is well understood. However, few studies have reasoned about the privacy leakage resulting from the multiple training runs needed to fine tune the value of the training algorithm’s hyperparameters. In this work, we first illustrate how simply setting hyperparameters based on nonprivate training runs can leak private information. Motivated by this observation, we then provide privacy guarantees for hyperparameter search procedures within the framework of Renyi Differential Privacy. Our results improve and extend the work of Liu and Talwar (STOC 2019). Our analysis supports our previous observation that tuning hyperparameters does indeed leak private information, but we prove that, under certain assumptions, this leakage is modest, as long as each candidate training run needed to select hyperparameters is itself differentially private. 
Outstanding Paper

For many differentially private algorithms, such as the prominent noisy stochastic gradient descent (DPSGD), the analysis needed to bound the privacy leakage of a single training run is well understood. However, few studies have reasoned about the privacy leakage resulting from the multiple training runs needed to fine tune the value of the training algorithm’s hyperparameters. In this work, we first illustrate how simply setting hyperparameters based on nonprivate training runs can leak private information. Motivated by this observation, we then provide privacy guarantees for hyperparameter search procedures within the framework of Renyi Differential Privacy. Our results improve and extend the work of Liu and Talwar (STOC 2019). Our analysis supports our previous observation that tuning hyperparameters does indeed leak private information, but we prove that, under certain assumptions, this leakage is modest, as long as each candidate training run needed to select hyperparameters is itself differentially private. 
Outstanding Paper

Convolutional neural networks typically contain several downsampling operators, such as strided convolutions or pooling layers, that progressively reduce the resolution of intermediate representations. This provides some shiftinvariance while reducing the computational complexity of the whole architecture. A critical hyperparameter of such layers is their stride: the integer factor of downsampling. As strides are not differentiable, finding the best configuration either requires crossvalidation or discrete optimization (e.g. architecture search), which rapidly become prohibitive as the search space grows exponentially with the number of downsampling layers. Hence, exploring this search space by gradient descent would allow finding better configurations at a lower computational cost. This work introduces DiffStride, the first downsampling layer with learnable strides. Our layer learns the size of a cropping mask in the Fourier domain, that effectively performs resizing in a differentiable way. Experiments on audio and image classification show the generality and effectiveness of our solution: we use DiffStride as a dropin replacement to standard downsampling layers and outperform them. In particular, we show that introducing our layer into a ResNet18 architecture allows keeping consistent high performance on CIFAR10, CIFAR100 and ImageNet even when training starts from poor random stride configurations. Moreover, formulating strides as learnable variables allows … 
Outstanding Paper

Convolutional neural networks typically contain several downsampling operators, such as strided convolutions or pooling layers, that progressively reduce the resolution of intermediate representations. This provides some shiftinvariance while reducing the computational complexity of the whole architecture. A critical hyperparameter of such layers is their stride: the integer factor of downsampling. As strides are not differentiable, finding the best configuration either requires crossvalidation or discrete optimization (e.g. architecture search), which rapidly become prohibitive as the search space grows exponentially with the number of downsampling layers. Hence, exploring this search space by gradient descent would allow finding better configurations at a lower computational cost. This work introduces DiffStride, the first downsampling layer with learnable strides. Our layer learns the size of a cropping mask in the Fourier domain, that effectively performs resizing in a differentiable way. Experiments on audio and image classification show the generality and effectiveness of our solution: we use DiffStride as a dropin replacement to standard downsampling layers and outperform them. In particular, we show that introducing our layer into a ResNet18 architecture allows keeping consistent high performance on CIFAR10, CIFAR100 and ImageNet even when training starts from poor random stride configurations. Moreover, formulating strides as learnable variables allows … 
Outstanding Paper

AnalyticDPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models
Diffusion probabilistic models (DPMs) represent a class of powerful generative models. Despite their success, the inference of DPMs is expensive since it generally needs to iterate over thousands of timesteps. A key problem in the inference is to estimate the variance in each timestep of the reverse process. In this work, we present a surprising result that both the optimal reverse variance and the corresponding optimal KL divergence of a DPM have analytic forms w.r.t. its score function. Building upon it, we propose \textit{AnalyticDPM}, a trainingfree inference framework that estimates the analytic forms of the variance and KL divergence using the Monte Carlo method and a pretrained scorebased model. Further, to correct the potential bias caused by the scorebased model, we derive both lower and upper bounds of the optimal variance and clip the estimate for a better result. Empirically, our analyticDPM improves the loglikelihood of various DPMs, produces highquality samples, and meanwhile enjoys a $20\times$ to $80\times$ speed up.

Outstanding Paper

Metalearning empowers artificial intelligence to increase its efficiency by learning how to learn. Unlocking this potential involves overcoming a challenging metaoptimisation problem. We propose an algorithm that tackles this problem by letting the metalearner teach itself. The algorithm first bootstraps a target from the metalearner, then optimises the metalearner by minimising the distance to that target under a chosen (pseudo)metric. Focusing on metalearning with gradients, we establish conditions that guarantee performance improvements and show that metric can be used to control metaoptimisation. Meanwhile, the bootstrapping mechanism can extend the effective metalearning horizon without requiring backpropagation through all updates. We achieve a new stateofthe art for modelfree agents on the Atari ALE benchmark and demonstrate that it yields both performance and efficiency gains in multitask metalearning. Finally, we explore how bootstrapping opens up new possibilities and find that it can metalearn efficient exploration in an epsilongreedy Qlearning agent  without backpropagating through the update rule. 
Outstanding Paper

Measuring the discrepancy between two probability distributions is a fundamental problem in machine learning and statistics. We propose a new class of discrepancies based on the optimal loss for a decision task  two distributions are different if the optimal decision loss is higher on their mixture than on each individual distribution. By suitably choosing the decision task, this generalizes the JensenShannon divergence and the maximum mean discrepancy family. We apply our approach to twosample tests, and on various benchmarks, we achieve superior test power compared to competing methods. In addition, a modeler can directly specify their preferences when comparing distributions through the decision loss. We apply this property to understanding the effects of climate change on different social and economic activities, evaluating sample quality, and selecting features targeting different decision tasks. 
Outstanding Paper

Characterizing the separation power of graph neural networks (GNNs) provides an understanding of their limitations for graph learning tasks. Results regarding separation power are, however, usually geared at specific GNNs architectures, and tools for understanding arbitrary GNN architectures are generally lacking. We provide an elegant way to easily obtain bounds on the separation power of GNNs in terms of the WeisfeilerLeman (WL) tests, which have become the yardstick to measure the separation power of GNNs. The crux is to view GNNs as expressions in a procedural tensor language describing the computations in the layers of the GNNs. Then, by a simple analysis of the obtained expressions, in terms of the number of indexes used and the nesting depth of summations, bounds on the separation power in terms of the WLtests readily follow. We use tensor language to define HigherOrder MessagePassing Neural Networks (or kMPNNs), a natural extension of MPNNs. Furthermore, the tensor language point of view allows for the derivation of universality results for classes of GNNs in a natural way. Our approach provides a toolbox with which GNN architecture designers can analyze the separation power of their GNNs, without needing to know the intricacies of the WLtests. We also … 
Outstanding Paper

Characterizing the separation power of graph neural networks (GNNs) provides an understanding of their limitations for graph learning tasks. Results regarding separation power are, however, usually geared at specific GNNs architectures, and tools for understanding arbitrary GNN architectures are generally lacking. We provide an elegant way to easily obtain bounds on the separation power of GNNs in terms of the WeisfeilerLeman (WL) tests, which have become the yardstick to measure the separation power of GNNs. The crux is to view GNNs as expressions in a procedural tensor language describing the computations in the layers of the GNNs. Then, by a simple analysis of the obtained expressions, in terms of the number of indexes used and the nesting depth of summations, bounds on the separation power in terms of the WLtests readily follow. We use tensor language to define HigherOrder MessagePassing Neural Networks (or kMPNNs), a natural extension of MPNNs. Furthermore, the tensor language point of view allows for the derivation of universality results for classes of GNNs in a natural way. Our approach provides a toolbox with which GNN architecture designers can analyze the separation power of their GNNs, without needing to know the intricacies of the WLtests. We also … 
Outstanding Paper

AnalyticDPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models
Diffusion probabilistic models (DPMs) represent a class of powerful generative models. Despite their success, the inference of DPMs is expensive since it generally needs to iterate over thousands of timesteps. A key problem in the inference is to estimate the variance in each timestep of the reverse process. In this work, we present a surprising result that both the optimal reverse variance and the corresponding optimal KL divergence of a DPM have analytic forms w.r.t. its score function. Building upon it, we propose \textit{AnalyticDPM}, a trainingfree inference framework that estimates the analytic forms of the variance and KL divergence using the Monte Carlo method and a pretrained scorebased model. Further, to correct the potential bias caused by the scorebased model, we derive both lower and upper bounds of the optimal variance and clip the estimate for a better result. Empirically, our analyticDPM improves the loglikelihood of various DPMs, produces highquality samples, and meanwhile enjoys a $20\times$ to $80\times$ speed up.

Outstanding Paper

Measuring the discrepancy between two probability distributions is a fundamental problem in machine learning and statistics. We propose a new class of discrepancies based on the optimal loss for a decision task  two distributions are different if the optimal decision loss is higher on their mixture than on each individual distribution. By suitably choosing the decision task, this generalizes the JensenShannon divergence and the maximum mean discrepancy family. We apply our approach to twosample tests, and on various benchmarks, we achieve superior test power compared to competing methods. In addition, a modeler can directly specify their preferences when comparing distributions through the decision loss. We apply this property to understanding the effects of climate change on different social and economic activities, evaluating sample quality, and selecting features targeting different decision tasks. 
Outstanding Paper

Metalearning empowers artificial intelligence to increase its efficiency by learning how to learn. Unlocking this potential involves overcoming a challenging metaoptimisation problem. We propose an algorithm that tackles this problem by letting the metalearner teach itself. The algorithm first bootstraps a target from the metalearner, then optimises the metalearner by minimising the distance to that target under a chosen (pseudo)metric. Focusing on metalearning with gradients, we establish conditions that guarantee performance improvements and show that metric can be used to control metaoptimisation. Meanwhile, the bootstrapping mechanism can extend the effective metalearning horizon without requiring backpropagation through all updates. We achieve a new stateofthe art for modelfree agents on the Atari ALE benchmark and demonstrate that it yields both performance and efficiency gains in multitask metalearning. Finally, we explore how bootstrapping opens up new possibilities and find that it can metalearn efficient exploration in an epsilongreedy Qlearning agent  without backpropagating through the update rule. 
Outstanding Paper

The recently discovered Neural Collapse (NC) phenomenon occurs pervasively in today's deep net training paradigm of driving crossentropy (CE) loss towards zero. During NC, lastlayer features collapse to their classmeans, both classifiers and classmeans collapse to the same Simplex Equiangular Tight Frame, and classifier behavior collapses to the nearestclassmean decision rule. Recent works demonstrated that deep nets trained with mean squared error (MSE) loss perform comparably to those trained with CE. As a preliminary, we empirically establish that NC emerges in such MSEtrained deep nets as well through experiments on three canonical networks and five benchmark datasets. We provide, in a Google Colab notebook, PyTorch code for reproducing MSENC and CENC: https://colab.research.google.com/github/neuralcollapse/neuralcollapse/blob/main/neuralcollapse.ipynb. The analyticallytractable MSE loss offers more mathematical opportunities than the hardtoanalyze CE loss, inspiring us to leverage MSE loss towards the theoretical investigation of NC. We develop three main contributions: (I) We show a new decomposition of the MSE loss into (A) terms directly interpretable through the lens of NC and which assume the lastlayer classifier is exactly the leastsquares classifier; and (B) a term capturing the deviation from this leastsquares classifier. (II) We exhibit experiments on canonical datasets and networks demonstrating that term(B) is negligible during training. … 
Outstanding Paper

The recently discovered Neural Collapse (NC) phenomenon occurs pervasively in today's deep net training paradigm of driving crossentropy (CE) loss towards zero. During NC, lastlayer features collapse to their classmeans, both classifiers and classmeans collapse to the same Simplex Equiangular Tight Frame, and classifier behavior collapses to the nearestclassmean decision rule. Recent works demonstrated that deep nets trained with mean squared error (MSE) loss perform comparably to those trained with CE. As a preliminary, we empirically establish that NC emerges in such MSEtrained deep nets as well through experiments on three canonical networks and five benchmark datasets. We provide, in a Google Colab notebook, PyTorch code for reproducing MSENC and CENC: https://colab.research.google.com/github/neuralcollapse/neuralcollapse/blob/main/neuralcollapse.ipynb. The analyticallytractable MSE loss offers more mathematical opportunities than the hardtoanalyze CE loss, inspiring us to leverage MSE loss towards the theoretical investigation of NC. We develop three main contributions: (I) We show a new decomposition of the MSE loss into (A) terms directly interpretable through the lens of NC and which assume the lastlayer classifier is exactly the leastsquares classifier; and (B) a term capturing the deviation from this leastsquares classifier. (II) We exhibit experiments on canonical datasets and networks demonstrating that term(B) is negligible during training. … 
Honorable Mention

Most graph neural networks (GNNs) use the message passing paradigm, in which node features are propagated on the input graph. Recent works pointed to the distortion of information flowing from distant nodes as a factor limiting the efficiency of message passing for tasks relying on longdistance interactions. This phenomenon, referred to as 'oversquashing', has been heuristically attributed to graph bottlenecks where the number of $k$hop neighbors grows rapidly with $k$. We provide a precise description of the oversquashing phenomenon in GNNs and analyze how it arises from bottlenecks in the graph. For this purpose, we introduce a new edgebased combinatorial curvature and prove that negatively curved edges are responsible for the oversquashing issue. We also propose and experimentally test a curvaturebased graph rewiring method to alleviate the oversquashing.

Honorable Mention

Most graph neural networks (GNNs) use the message passing paradigm, in which node features are propagated on the input graph. Recent works pointed to the distortion of information flowing from distant nodes as a factor limiting the efficiency of message passing for tasks relying on longdistance interactions. This phenomenon, referred to as 'oversquashing', has been heuristically attributed to graph bottlenecks where the number of $k$hop neighbors grows rapidly with $k$. We provide a precise description of the oversquashing phenomenon in GNNs and analyze how it arises from bottlenecks in the graph. For this purpose, we introduce a new edgebased combinatorial curvature and prove that negatively curved edges are responsible for the oversquashing issue. We also propose and experimentally test a curvaturebased graph rewiring method to alleviate the oversquashing.

Honorable Mention

Partial label learning (PLL) is an important problem that allows each training example to be labeled with a coarse candidate set, which well suits many realworld data annotation scenarios with label ambiguity. Despite the promise, the performance of PLL often lags behind the supervised counterpart. In this work, we bridge the gap by addressing two key research challenges in PLLrepresentation learning and label disambiguationin one coherent framework. Specifically, our proposed framework PiCO consists of a contrastive learning module along with a novel class prototypebased label disambiguation algorithm. PiCO produces closely aligned representations for examples from the same classes and facilitates label disambiguation. Theoretically, we show that these two components are mutually beneficial, and can be rigorously justified from an expectationmaximization (EM) algorithm perspective. Extensive experiments demonstrate that PiCO significantly outperforms the current stateoftheart approaches in PLL and even achieves comparable results to fully supervised learning. Code and data available: https://github.com/hbzju/PiCO. 
Honorable Mention

A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on longrange dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of $10000$ or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) \( x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) \), and showed that for appropriate choices of the state matrix \( A \), this system could handle longrange dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning \( A \) with a lowrank correction, allowing it to be diagonalized stably and reducing the SSM to the wellstudied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, …

Honorable Mention

A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on longrange dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of $10000$ or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) \( x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) \), and showed that for appropriate choices of the state matrix \( A \), this system could handle longrange dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning \( A \) with a lowrank correction, allowing it to be diagonalized stably and reducing the SSM to the wellstudied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, …

Honorable Mention

[ Virtual ] Partial label learning (PLL) is an important problem that allows each training example to be labeled with a coarse candidate set, which well suits many realworld data annotation scenarios with label ambiguity. Despite the promise, the performance of PLL often lags behind the supervised counterpart. In this work, we bridge the gap by addressing two key research challenges in PLLrepresentation learning and label disambiguationin one coherent framework. Specifically, our proposed framework PiCO consists of a contrastive learning module along with a novel class prototypebased label disambiguation algorithm. PiCO produces closely aligned representations for examples from the same classes and facilitates label disambiguation. Theoretically, we show that these two components are mutually beneficial, and can be rigorously justified from an expectationmaximization (EM) algorithm perspective. Extensive experiments demonstrate that PiCO significantly outperforms the current stateoftheart approaches in PLL and even achieves comparable results to fully supervised learning. Code and data available: https://github.com/hbzju/PiCO. 