Workshop
Will Synthetic Data Finally Solve the Data Access Problem?
Zheng Xu · Peter Kairouz · Herbie Bradley · Rachel Cummings · Giulia Fanti · Lipika Ramaswamy · Chulin Xie
Accessing large scale and high quality data has been shown to be one of the most important factors to the performance of machine learning models. Recent works show that large (language) models can greatly benefit from training with massive data from diverse (domain specific) sources and aligning with user intention. However, the use of certain data sources can trigger privacy, fairness, copyright, and safety concerns. The impressive performance of generative artificial intelligence popularized the usage of synthetic data, and many recent works suggest (guided) synthesization can be useful for both general purpose and domain specific applications. For example, Yu et al. 2024, Xie et al. 2024, Hou et al. 2024 demonstrate promising preliminary results in synthesizing private-like data, while Wu et al. 2024 highlight existing gaps and challenges. As techniques like self-instruct (Wang et al. 2021) and self-alignment (Li et al. 2024) gain traction, researchers are questioning the implications of synthetic data (Alemohammad et al. 2023, Dohmatob et al. 2024, Shumailov et al. 2024). Will synthetic data ultimately solve the data access problem for machine learning? This workshop seeks to address this question by highlighting the limitations and opportunities of synthetic data. It aims to bring together researchers working on algorithms and applications of synthetic data, general data access for machine learning, privacy-preserving methods such as federated learning and differential privacy, and large model training experts to discuss lessons learned and chart important future directions.
Schedule
|
Sat 5:55 p.m. - 6:00 p.m.
|
Opening remarks
(
Intro
)
>
SlidesLive Video |
Zheng Xu 🔗 |
|
Sat 6:00 p.m. - 6:30 p.m.
|
From Synthetic Data to Digital Twins: The Next Frontier in Machine Learning (Invited talk: Mihaela van der Schaar)
(
Invited Talk
)
>
SlidesLive Video |
Mihaela van der Schaar 🔗 |
|
Sat 6:30 p.m. - 7:00 p.m.
|
Three morning spotlight talks
(
Spotlight Talks
)
>
SlidesLive Video |
Charlie Hou · Pan Li · Alisia Lupidi 🔗 |
|
Sat 7:00 p.m. - 7:30 p.m.
|
Break
|
🔗 |
|
Sat 7:30 p.m. - 8:00 p.m.
|
Model Collapse Does Not Mean What You Think (Invited talk: Sanmi Koyejo)
(
Invited Talk
)
>
SlidesLive Video |
Sanmi Koyejo 🔗 |
|
Sat 8:00 p.m. - 8:30 p.m.
|
Differentially private synthetic data: why, how and what's next (Invited Talk: Natalia Ponomareva)
(
Invited Talk
)
>
SlidesLive Video |
NATALIA PONOMAREVA 🔗 |
|
Sat 8:30 p.m. - 9:30 p.m.
|
Poster Session
(
Poster
)
>
|
🔗 |
|
Sat 9:30 p.m. - 10:30 p.m.
|
Lunch Break
|
🔗 |
|
Sat 10:30 p.m. - 11:30 p.m.
|
Panel Discussion
(
Panel
)
>
SlidesLive Video |
Lipika Ramaswamy · Matthias Gerstgrasser · Tao Lin · Mohamed El Amine Seddik · Karsten Kreis · Peter Kairouz 🔗 |
|
Sat 11:30 p.m. - 12:00 a.m.
|
SuperBPE: Tokenization across whitespaces for more efficient LLMs (Invited talk: Sewoong Oh)
(
Invited Talk
)
>
SlidesLive Video |
Sewoong Oh 🔗 |
|
Sun 12:00 a.m. - 12:30 a.m.
|
Break
|
🔗 |
|
Sun 12:30 a.m. - 1:00 a.m.
|
Three afternoon spotlight talks
(
Spotlight Talks
)
>
SlidesLive Video |
Shripad Gade · Haolin Wang · Giulia DeSalvo 🔗 |
|
Sun 1:00 a.m. - 1:30 a.m.
|
Grounding Medical LLMs in Clinical Narratives: Scalable and Participatory Synthesis of Plausible Patient Data (Invited talk: Mary-Anne Hartley)
(
Invited Talk
)
>
SlidesLive Video |
Mary-Anne Hartley 🔗 |
|
Sun 1:30 a.m. - 2:00 a.m.
|
TxT360 WORCS: an Open Recipe and Framework for Language Model Pretraining Data (Invited talk: Hector Zhengzhong Liu)
(
Invited Talk
)
>
SlidesLive Video |
Zhengzhong Liu 🔗 |
|
Sun 2:00 a.m. - 2:10 a.m.
|
Concluding remarks
(
Intro
)
>
SlidesLive Video |
Zheng Xu 🔗 |
|
-
|
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources ( Poster ) > link | Alisia Lupidi · Carlos Gemmell · Nicola Cancedda · Jane Dwivedi-Yu · Jason E Weston · Jakob Foerster · Roberta Raileanu · Maria Lomeli 🔗 |
|
-
|
Can LLMs Replace Economic Choice Prediction Labs? The Case of Language-based Persuasion Games ( Poster ) > link | Eilam Shapira · Omer Madmon · Roi Reichart · Moshe Tennenholtz 🔗 |
|
-
|
Human-like compositional learning of visually-grounded concepts using synthetic data ( Poster ) > link | Zijun Lin · M Ganesh Kumar · Cheston Tan 🔗 |
|
-
|
TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records ( Poster ) > link | Hejie Cui · Alyssa Unell · Bowen Chen · Jason Fries · Emily Alsentzer · Sanmi Koyejo · Nigam Shah 🔗 |
|
-
|
Did You Hear That? Introducing AADG: A Framework for Generating Benchmark Data in Audio Anomaly Detection ( Poster ) > link | Ksheeraja Raghavan · Samiran Gode · Ankit Parag Shah · Surabhi Raghavan · Wolfram Burgard · Bhiksha Raj · Rita Singh 🔗 |
|
-
|
Grounding QA Generation in Knowledge Graphs and Literature: A Scalable LLM Framework for Scientific Discovery ( Poster ) > link | Marc Boubnovski Martell · Kaspar Märtens · Lawrence Phillips · Daniel Keitley · Maria Dermit · Julien Fauqueur 🔗 |
|
-
|
Compositional World Knowledge leads to High Utility Synthetic data ( Poster ) > link | Sachit Gaudi · Gautam Sreekumar · Vishnu Boddeti 🔗 |
|
-
|
Is API Access to LLMs Useful for Generating Private Synthetic Tabular Data? ( Poster ) > link | Marika Swanberg · Ryan McKenna · Edo Roth · Albert Cheu · Peter Kairouz 🔗 |
|
-
|
DIET-PATE: Knowledge Transfer in PATE without Public Data ( Poster ) > link | Michel Meintz · Adam Dziedzic · Franziska Boenisch 🔗 |
|
-
|
Accelerating Differentially Private Federated Learning via Adaptive Extrapolation ( Poster ) > link | Shokichi Takakura · Seng Pei Liew · Satoshi Hasegawa 🔗 |
|
-
|
[Tiny] Synthetic-based retrieval of patient medical data ( Poster ) > link | Rinat Mullahmetov · Ilya Pershin 🔗 |
|
-
|
SyntheRela: A Benchmark For Synthetic Relational Database Generation ( Poster ) > link | Martin Jurkovic · Valter Hudovernik · Erik Štrumbelj 🔗 |
|
-
|
Orchestrating Synthetic Data with Reasoning ( Poster ) > link | Tim R. Davidson · Hamza Harkous · Benoit Seguin · Enrico Bacis · Cesar Ilharco 🔗 |
|
-
|
Breaking Focus: Contextual Distraction Curse in Large Language Models ( Poster ) > link | Yanbo Wang · Zixiang Xu · Yue Huang · Chujie Gao · Siyuan Wu · Jiayi Ye · Xiuying Chen · Pin-Yu Chen · Xiangliang Zhang 🔗 |
|
-
|
LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph Generation ( Poster ) > link | Mufei Li · Viraj Shitole · Eli Chien · Changhai Man · Zhaodong Wang · Srinivas · Ying Zhang · Tushar Krishna · Pan Li 🔗 |
|
-
|
AN OPTIMAL CRITERION FOR STEERING DATA DISTRIBUTIONS TO ACHIEVE EXACT FAIRNESS ( Poster ) > link | Mohit Sharma · Amit Jayant Deshpande · Chiranjib Bhattacharyya · Rajiv Ratn Shah 🔗 |
|
-
|
Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models ( Poster ) > link | Muna Said · Aarib Zaidi · Rabia Usman · Sonia Okon · Praneeth Medepalli · Kevin Zhu · Vasu Sharma 🔗 |
|
-
|
Training-Free Safe Denoisers For Safe Use of Diffusion Models ( Poster ) > link | Mingyu Kim · Dongjun Kim · Amman Yusuf · Stefano Ermon · Mijung Park 🔗 |
|
-
|
How Well Does Your Tabular Generator Learn the Structure of Tabular Data? ( Poster ) > link | Xiangjian Jiang · Nikola Simidjievski · Mateja Jamnik 🔗 |
|
-
|
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models ( Poster ) > link |
14 presentersJieyu Zhang · Le Xue · Linxin Song · Jun Wang · Weikai Huang · Manli Shu · An Yan · Zixian Ma · Juan Carlos Niebles · silvio savarese · Caiming Xiong · Zeyuan Chen · Ranjay Krishna · Ran Xu |
|
-
|
SoftSRV: Learn to generate targeted synthetic data. ( Poster ) > link | Giulia DeSalvo · Jean-François Kagy · Lazaros Karydas · Afshin Rostamizadeh · Sanjiv Kumar 🔗 |
|
-
|
Synthetic Poisoning Attacks: The Impact of Poisoned MRI Image on U-Net Brain Tumor Segmentation ( Poster ) > link | Tianhao Li · Tianyu Zeng · Yujia Zheng · ZHANG CHULONG · Jingyu Lu · Haotian Huang · Chuangxin Chu · Fang-Fang Yin · Zhenyu Yang 🔗 |
|
-
|
Out-of-Distribution Detection using Synthetic Data Generation ( Poster ) > link | Momin Abbas · Muneeza Azmat · Raya Horesh · Mikhail Yurochkin 🔗 |
|
-
|
Synthetic Data for Blood Vessel Network Extraction ( Poster ) > link | Joël Mathys · Andreas Plesner · Jorel Elmiger · Roger Wattenhofer 🔗 |
|
-
|
Efficient Randomized Experiments Using Foundation Models ( Poster ) > link | Piersilvio De Bartolomeis · Javier Abad · Guanbo Wang · Konstantin Donhauser · Raymond Duch · Fanny Yang · Issa Dahabreh 🔗 |
|
-
|
[Tiny] Parameterized Synthetic Text Generation with SimpleStories ( Poster ) > link | Lennart Finke · Juan Rodriguez · Thomas Dooms · Mat Allen · Thomas Marshall · Noa Nabeshima · Dan Braun 🔗 |
|
-
|
Can Transformers Learn Full Bayesian Inference In Context? ( Poster ) > link | Arik Reuter · Tim G. J. Rudner · Vincent Fortuin · David Rügamer 🔗 |
|
-
|
Improved Density Ratio Estimation for Evaluating Synthetic Data Quality ( Poster ) > link | Lukas Gruber · Markus Holzleitner · Sepp Hochreiter · Werner Zellinger 🔗 |
|
-
|
Augmented Conditioning Is Enough For Effective Training Image Generation ( Poster ) > link | Jiahui Chen · Amy Zhang · Adriana Romero-Soriano 🔗 |
|
-
|
Synthetic Data Pruning in High Dimensions: A Random Matrix Perspective ( Poster ) > link | Aymane El Firdoussi · Mohamed El Amine Seddik · Soufiane Hayou · Reda Alami · Ahmed Alzubaidi · Hakim Hacid 🔗 |
|
-
|
V-LASIK: Consistent Glasses-Removal from Videos Using Synthetic Data ( Poster ) > link | Rotem Shalev-Arkushin · Aharon Azulay · Tavi Halperin · Eitan Richardson · Amit Bermano · Ohad Fried 🔗 |
|
-
|
Stronger Models are NOT Always Stronger Teachers for Instruction Tuning ( Poster ) > link | Zhangchen Xu · Fengqing Jiang · Luyao Niu · Bill Yuchen Lin · Radha Poovendran 🔗 |
|
-
|
[Tiny] Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation ( Poster ) > link | Yunbo Long · Liming Xu · Alexandra Brintrup 🔗 |
|
-
|
Private Federated Learning using Preference-Optimized Synthetic Data ( Poster ) > link | Charlie Hou · Mei-Yu Wang · Yige Zhu · Daniel Lazar · Giulia Fanti 🔗 |
|
-
|
Differentially Private Synthetic Data via APIs 3: Using Simulators Instead of Foundation Model ( Poster ) > link | Zinan Lin · Tadas Baltrusaitis · Sergey Yekhanin 🔗 |
|
-
|
Towards Internet-Scale Training For Agents ( Poster ) > link | Brandon Trabucco · Gunnar Sigurdsson · Robinson Piramuthu · Russ Salakhutdinov 🔗 |
|
-
|
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions ( Poster ) > link | jiarui zhang · Ollie Liu · Tianyu Yu · Jinyi Hu · Willie Neiswanger 🔗 |
|
-
|
Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation ( Poster ) > link | Samuel Maddock · Shripad Gade · Graham Cormode · Will Bullock 🔗 |
|
-
|
[Tiny] Understanding the Impact of Data Domain Extraction on Synthetic Data Privacy ( Poster ) > link | Georgi Ganev · Meenatchi Sundaram Muthu Selva Annamalai · Sofiane Mahiou · Emiliano De Cristofaro 🔗 |
|
-
|
Text to 3D Object Generation for Scalable Room Assembly ( Poster ) > link | Sonia Laguna · Alberto Garcia-Garcia · Marie-Julie Rakotosaona · Stylianos Moschoglou · Leonhard Helminger · Sergio Orts-Escolano 🔗 |
|
-
|
TRIG-Bench: A Benchmark for Text-Rich Image Grounding ( Poster ) > link | Ming Li · Ruiyi Zhang · Jian Chen · Tianyi Zhou 🔗 |
|
-
|
Empowering LLMs in Decision Games through Algorithmic Data Synthesis ( Poster ) > link | Haolin Wang · Xueyan Li · Yazhe Niu · Shuai Hu · Hongsheng Li 🔗 |
|
-
|
Benchmarking Differentially Private Tabular Data Synthesis Algorithms ( Poster ) > link | Kai Chen · Xiaochen Li · Chen GONG · Ryan McKenna · Tianhao Wang 🔗 |