-

What Does a Visual Formal Analysis of the World's 500 Most Famous Paintings Tell Us About Multimodal LLMs? ( Oral ) > link

This work introduces ArtQA, a new benchmark for multimodal LLMs through the lens of formal analysis of paintings. We focus on key elements such as line, shape, space, color, form, value, and texture—collectively referred to as the elements of art in visual formal analysis. ArtQA contains questions spanning 4 metrics, further divided into 16 fine-grained categories. We leverage the power of LLMs to generate VQA questions based on formal analysis of 500 renowned paintings. These questions undergo a rigorous filtering process by both model annotation and human experts, ensuring ArtQA's quality and reliability.

Link

Muzi Tao · Saining Xie 🔗

-

Utilizing Cross-Version Consistency for Domain Adaptation: A Case Study on Music Audio ( Oral ) > link

Deep learning models are commonly trained on large annotated corpora, often in a specific domain. Generalization to another domain without annotated data is usually challenging. In this paper, we address such unsupervised domain adaptation based on the teacher--student learning paradigm. For improved efficacy in the target domain, we propose to exploit cross-version scenarios, i.e., corresponding data pairs assumed to obtain the same yet unknown labels. More specifically, our idea is to compare teacher annotations across versions and use only consistent annotations as labels to train the student model. Examples of cross-version data include the same text by different speakers (in speech recognition) or the same character by different writers (in handwritten text recognition). In our case study on music audio, versions are different recorded performances of the same composition, aligned with music synchronization techniques. Taking pitch estimation (a multi-label classification task) as an example task, we show that enforcing consistency across versions in student training helps to improve the transfer from a source domain (piano) to unseen and more complex target domains (singing/orchestra).

Link

Lele Liu · Christof Weiß 🔗

-

DFWLayer: Differentiable Frank-Wolfe Optimization Layer ( Oral ) > link

Differentiable optimization has received a significant amount of attention due to its foundational role in the domain of machine learning based on neural networks. This paper proposes a differentiable layer, named Differentiable Frank-Wolfe Layer (DFWLayer), by rolling out the Frank-Wolfe method, a well-known optimization algorithm which can solve constrained optimization problems without projections and Hessian matrix computations, thus leading to an efficient way of dealing with large-scale convex optimization problems with norm constraints. Experimental results demonstrate that the DFWLayer not only attains competitive accuracy in solutions and gradients but also consistently adheres to constraints.

Link

Zixuan Liu · Liu Liu · Xueqian Wang · Peilin Zhao 🔗

-

Revamp: Automated Simulations of Adversarial Attacks on Arbitrary Objects in Realistic Scenes ( Oral ) > link

Deep learning models, such as those used in autonomous vehicles are vulnerable to adversarial attacks where attackers could place adversarial objects in the environment to induce incorrect detections. While generating such adversarial objects in the digital realm is well-studied, successfully transferring these attacks to the physical realm remains challenging, especially when accounting for real-world environmental factors. We address these challenges with REVAMP, a first-of-its-kind Python library for creating attack scenarios with arbitrary objects in scenes with realistic environmental factors, lighting, reflection, and refraction. REVAMP empowers researchers and practitioners to swiftly explore diverse scenarios, offering a wide range of configurable options for experiment design and using differentiable rendering to replicate physically-plausible adversarial objects. REVAMP is open-source and available at https://anonymous.4open.science/r/revamp and a demo video is available at https://youtu.be/ogCRO15R7-E.

Link

Matthew Hull · Zijie Wang · Polo Chau 🔗

-

CMFPN: Context Modeling Meets Feature Pyramid Network ( Oral ) > link

Feature fusion is a powerful technique that enables predictors to access a semantically rich representation of an image. Feature Pyramid Networks (FPNs) are the most widely used models for fusing features. However, the context within the FPN layers is inconsistent, leading to false predictions. This article addresses the context inconsistency in FPN and proposes CMFPN, a new design that improves feature fusion by decoupling feature aggregation from context modeling. Experimental results, based on the COCO dataset, show that CMFPN effectively resolves the context issues and enhances the Average Precision (AP) results for both object detection and instance segmentation by $2.30\%$ and $1.7\%$, respectively.

Link

Faroq AL-Tam · Muhammad AL-Qurishi · Thariq Khalid · Riad Souissi 🔗

-

Lost in Translation: GANs' Inability to Generate Simple Probability Distributions ( Oral ) > link

Since its inception, Generative Adversarial Networks (GAN) have marked a triumph in generative modeling. Its impeccable capacity to mimic observations from unknown probability distributions has positioned it as a widely used simulation tool. In typical applications, GANs find themselves simulating data rich in semantic information such as images or text out of random noise. As such, it is reasonable to expect that large parametric models such as GANs must be able to estimate standard theoretical probability densities with ease. In this paper, based on a series of disillusioning experimental findings, we show that GANs often fail to induce the simplest of statistical transformations between distributions. For example, starting with a standard Gaussian noise, GANs with 2-deep generators are unable to perform a positional translation. Supporting theoretical tests on generated data further corroborates our rather unsettling conclusions.

Link

Debanjan Dutta · Anish Chakrabarty · Swagatam Das 🔗

-

Weighted Branch Aggregation Based Deep Learning Model for Track Detection in Autonomous Racing ( Oral ) > link

Intelligent track detection is a vital component of autonomous racing cars. We develop a novel Weighted Branch Aggregation based Convolutional Neural Network (WeBACNN) model that can accurately detect the track while being robust against image blurring due to high speed, and can work independently of lane markings. The code and dataset for this work is available at (anonymous).

Link

Shreya Ghosh · Yi-Huan Chen · Ching-Hsiang Huang · Abu Shafin Mohammad Mahdee Jameel · Aly El Gamal · Samuel Labi 🔗

-

Parameter and Data Efficient Spectral Style-DCGAN ( Oral ) > link

We present a simple, highly parameter, and data-efficient adversarial network for unconditional face generation. Our method: Spectral Style-DCGAN or SSD utilizes only 6.574 million parameters and 4739 dog faces from the Animal Faces HQ (AFHQ) dataset as training samples while preserving fidelity at low resolutions up to 64x64. Code available at Anonymous-repo.

Link

Aryan Garg 🔗

Main Navigation

Affinity Workshop

Tiny Papers Oral Session 2

Krystal Maughan · Thomas F Burns

Halle A 3

Schedule