About
SynthData @ ICLR 2025
Welcome to the Synthetic Data × Data Access Problem workshop co-located with ICLR 2025!
Accessing large scale and high quality data has been shown to be one of the most important factors to the performance of machine learning models. Recent works show that large (language) models can greatly benefit from training with massive data from diverse (domain specific) sources and aligning with user intention. However, the use of certain data sources can trigger privacy, fairness, copyright, and safety concerns. The impressive performance of generative artificial intelligence popularized the usage of synthetic data, and many recent works suggest (guided) synthesization can be useful for both general purpose and domain specific applications.
Will synthetic data ultimately solve the data access problem for machine learning? This workshop seeks to address this question by highlighting the limitations and opportunities of synthetic data. It aims to bring together researchers working on algorithms and applications of synthetic data, general data access for machine learning, privacy-preserving methods such as federated learning and differential privacy, and large model training experts to discuss lessons learned and chart important future directions.
Topics of interest include, but are not limited to, the following:
- Risks and limitations of synthetic data.
- New algorithms for synthetic data generation.
- New applications of using synthetic data (e.g. in healthcare, finance, gaming and simulation, education, scientific research, or autonomous systems).
- Synthetic data for model training and evaluation.
- Synthetic data for improving specific model capabilities (e.g., reasoning, math, coding).
- Synthetic data to address privacy, fairness, safety and other data concerns.
- Evaluation of synthetic data quality and models trained on synthetic data.
- Conditional and unconditional synthetic data generation.
- Fine-grained control of synthetic data generation.
- Data access with federated learning and privacy-preserving methods.
- New paradigm of accessing data for machine learning.
- Mixing synthetic and natural data.
Calls
Call for Papers
Important Dates
- Submission Due Date: February 6th, 2025 4pm PT
- Notification of Acceptance: March 5th, 2025, AoE
- Free Registration Application Due: March 12th, 2025 AoE
- Camera-ready papers due: April 12th, 2025
- Workshop Dates: April 27th, 2025, Singapore
Submission Instructions
Submissions are processed in OpenReview. Submissions should be double-blind, no more than 6 pages long (excluding references), and following the ICLR'25 template. An optional appendix of any length can be put at the end of the draft (after references).
Our workshop does not have formal proceedings, i.e., it is non-archival. Accepted papers and their review comments will be posted on OpenReview in public (after the end of the review process), while rejected and withdrawn papers and their reviews will remain private.
We welcome sumbissions from novel research, ongoing (incomplete) projects, draft currently under review at other venues, as well as recently published results. In addition, we have the following policies.
- [Submission on previous conference and workshop papers] We request significant updates if the work has previously been presented at major machine learning conferences or workshops before, or has been presented at any conferences or workshops before February 1st 2025.
- [Submission on previous journal papers] For published work in journals that have not been presented in conferences or workshops, we will let the authors decide how novel it is for the community. Though the machine learning community moves fast, the workshop is inclusive for subareas that may have taken a slower pace, and values submission stands for fundamental long-lasting research.
- [Dual submission to other workshops at the same time, e.g., another ICLR workshop] We generally discourage dual submission to other workshops at the same time as it would be a waste of our program committees' efforts, and we request an in-person presentation by either talk or poster upon acceptance at our workshop. That being said, as our workshop is non-archival, we leave the final decision to the authors for dual submission.
Tiny Papers Submissions
[Remark] This year, ICLR is discontinuing the separate Tiny Papers track, and is instead requiring each workshop to accept short (3–5 pages in ICLR format, exact page length to be determined by each workshop) paper submissions, with an eye towards inclusion. Authors of these papers will be earmarked for potential funding from ICLR, but need to submit a separate application for Financial Assistance that evaluates their eligibility. This application for Financial Assistance to attend ICLR 2025 will become available at the beginning of February and close on March 2nd.
We encourage submission of short papers relevant to the workshop topics. Following Tiny Papers Track in previous years' ICLR main conference, we encourage submissions from historically underrepresented group, and example topics such as
- An implementation and experimentation of a novel (not published elsewhere) yet simple idea, or a modest and self-contained theoretical result
- A follow-up experiment to or re-analysis of a previously published paper
- A new perspective on a previously published paper
The tiny papers will be peer reviewed. Submissions should be double-blind, no more than 3 pages long (excluding references), and following the ICLR'25 template. Use the same sumbission portal in OpenReview. In addition,
- Please clearly add a tag [Tiny] at the beginning of the submission title.
Camera Ready Instructions
Please keep using the ICLR template for camera ready, and feel free to update the footnote/header in the template from ICLR main conference to workshop. We allow an extra page (i.e., max 7 pages for regular paper and max 4 pages for tiny papers) for the camera ready to properly address reviewers' comments, add authors and acknoweledgement information. The accepted paper pdf files will be released on openreview after the camera ready deadline. Camera ready draft can be updated by replacing the pdf file in OpenReview.
Presentation Instructions
All accepted papers are expected to be presented in person. While we aim to provide accessibility to virtual attendees of the workshop, we are not planning to provide support for virtual talks or posters.
All accepted papers are expected to have in-person posters, which should be portrait orientation up to A1 size 23.4"w x 33.1"h (W 59.4 x H 89.1 cm) size.
Each spotlight presentation including QA is 10 min.
See ICLR poster instructions for onsite poster print services.
Awards
Awards
Best Paper Awards
The organizing committee will select best paper award(s) supported by our sponsors.
Early Career Free Registration
The workshop can provide limited number of free (full ICLR'25 conference) registration to our attendees, which will prioritize early career students, and promote diversity, equity and inclusion (DEI). If you are interested, please email us at synth-workshop-iclr25@googlegroups.com following the instructions:
- Email has to be sent before March 12th to be considered.
- Email title starts with [Synth-ICLR25 free registration].
- Includes link(s) to your accepted, or submitted paper(s) to our workshop.
- Includes a short paragraph describing why it is important for your research and career.
- (Optional) includes link(s) to your webpage and resume.
- The awardees will be announced in March 22nd.
Best Reviewers Free Registration
The workshop encourages high quality reviews. We provide limited number of free (full ICLR'25 conference) registration for self-nominated reviewers who have written high-quality reviews. If you are interested, please email us at synth-workshop-iclr25@googlegroups.com following the instructions:
- Email has to be sent before March 12th to be considered.
- Email title starts with [Synth-ICLR25 free registration: reviewer].
- Includes link(s), or screenshots to your reviews.
- The awardees will be announced in March 22nd.
Free Registration Awardees
Lennart Finke, Xiangjian Jiang, Martin Jurkovič, Muna Numan, Rotem Shalev-Arkushin, Yanbo Wang
Program
Workshop Program
- In-person location: Peridot 202 - 203, Singapore EXPO - 1 Expo Drive, Singapore 486150.
- ICLR page: https://iclr.cc/virtual/2025/workshop/24001
Local Time (UTC+8) | Activity |
08:55AM - 09:00AM | Opening Remarks by Zheng Xu |
09:00AM - 09:30AM | (Remote) Invited Talk by Mihaela van der Schaar: From Synthetic Data to Digital Twins: The Next Frontier in Machine Learning |
09:30AM - 09:40AM | Spotlight Talk by Charlie Hou: Private Federated Learning using Preference-Optimized Synthetic Data |
09:40AM - 09:50AM | Spotlight Talk by Pan Li: LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph Generation |
09:50AM - 10:00AM | Spotlight Talk by Alisia Lupidi: Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources |
10:00AM - 10:30AM | Break |
10:30AM - 11:00AM | Invited Talk by Sanmi Koyejo: Model Collapse Does Not Mean What You Think |
11:00AM - 11:30AM | Invited Talk by Natalia Ponomareva: Differentially private synthetic data: why, how and what's next |
11:30AM - 12:30PM | Poster Session |
12:30PM - 13:30PM | Lunch break |
13:30PM - 14:30PM | Panel Discussion by Lipika Ramaswamy, Matthias Gerstgrasser, Tao Lin, Mohamed El Amine Seddik, Karsten Kreis, Peter Kairouz |
14:30PM - 15:00PM | Invited Talk by Sewoong Oh: SuperBPE: Tokenization across whitespaces for more efficient LLMs |
15:00PM - 15:30PM | Break |
15:30PM - 15:40PM | Spotlight Talk by Haolin Wang: Empowering LLMs in Decision Games through Algorithmic Data Synthesis |
15:40PM - 15:50PM | Spotlight Talk by Shripad Gade: Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation |
15:50PM - 16:00PM | Spotlight Talk by Giulia DeSalvo: SoftSRV: Learn to generate targeted synthetic data. |
16:00PM - 16:30PM | Invited Talk by Mary-Anne Hartley: Grounding Medical LLMs in Clinical Narratives: Scalable and Participatory Synthesis of Plausible Patient Data |
16:30PM - 17:00PM | Invited Talk by Hector Zhengzhong Liu: TxT360 WORCS: an Open Recipe and Framework for Language Model Pretraining Data |
17:00PM - 17:05PM | Concluding Remarks by Zheng Xu |
Talks
Invited Speakers

Mary-Anne Hartley
EPFL & Harvard-Chan & CMU-Africa
Sanmi Koyejo
Stanford
Sewoong Oh
University of Washington
Natalia Ponomareva
Google
Mihaela van der Schaar
University of Cambridge
Hector Liu
MBZUAIPanel Discussion
Panelists

Lipika Ramaswamy
NVIDIA
Matthias Gerstgrasser
OpenAI
Tao Lin
Westlake University
Mohamed El Amine Seddik
Technology Innovation Institute
Karsten Kreis
NVIDIA
Peter Kairouz
GoogleAccepted Papers
Accepted Papers
Spotlight Presentations
(Each talk including QA is 10 min)
Morning Session
-
Private Federated Learning using Preference-Optimized Synthetic Data
Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti. -
LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph Generation
Mufei Li, Viraj Shitole, Eli Chien, Changhai Man, Zhaodong Wang, Srinivas, Ying Zhang, Tushar Krishna, Pan Li. -
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
Alisia Maria Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Yu, Jason E Weston, Jakob Nicolaus Foerster, Roberta Raileanu, Maria Lomeli.
Afternoon Session
-
Empowering LLMs in Decision Games through Algorithmic Data Synthesis
Haolin Wang, Xueyan Li, Yazhe Niu, Shuai Hu, Hongsheng Li. -
Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation
Samuel Maddock, Shripad Gade, Graham Cormode, Will Bullock. -
SoftSRV: Learn to generate targeted synthetic data.
Giulia DeSalvo, Jean-François Kagy, Lazaros Karydas, Afshin Rostamizadeh, Sanjiv Kumar.
Accepted Papers (Openreview)
-
Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models
Muna Numan Said, Aarib Zaidi, Rabia Usman, Sonia Okon, Praneeth Medepalli, Kevin Zhu, Vasu Sharma, Sean O'Brien. -
Orchestrating Synthetic Data with Reasoning
Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, Hamza Harkous. -
SyntheRela: A Benchmark For Synthetic Relational Database Generation
Martin Jurkovic, Valter Hudovernik, Erik Štrumbelj. -
Towards Internet-Scale Training For Agents
Brandon Trabucco, Gunnar A Sigurdsson, Robinson Piramuthu, Ruslan Salakhutdinov. -
Empowering LLMs in Decision Games through Algorithmic Data Synthesis
Haolin Wang, Xueyan Li, Yazhe Niu, Shuai Hu, Hongsheng Li. -
Text to 3D Object Generation for Scalable Room Assembly
Sonia Laguna, Alberto Garcia-Garcia, Marie-Julie Rakotosaona, Stylianos Moschoglou, Leonhard Helminger, Sergio Orts-Escolano. -
AN OPTIMAL CRITERION FOR STEERING DATA DISTRIBUTIONS TO ACHIEVE EXACT FAIRNESS
mohit sharma, Amit Deshpande, Chiranjib Bhattacharyya, Rajiv Ratn Shah. -
Training-Free Safe Denoisers For Safe Use of Diffusion Models
Mingyu Kim, Dongjun Kim, Amman Yusuf, Stefano Ermon, Mijung Park. -
Breaking Focus: Contextual Distraction Curse in Large Language Models
Yanbo Wang, Zixiang Xu, Yue Huang, Chujie Gao, Siyuan Wu, Jiayi Ye, Xiuying Chen, Pin-Yu Chen, Xiangliang Zhang. -
[Tiny] Synthetic-based retrieval of patient medical data
Rinat Mullahmetov, Ilya Pershin. -
Compositional World Knowledge leads to High Utility Synthetic data
Sachit Gaudi, Gautam Sreekumar, Vishnu Boddeti. -
Differentially Private Synthetic Data via APIs 3: Using Simulators Instead of Foundation Model
Zinan Lin, Tadas Baltrusaitis, Sergey Yekhanin. -
Synthetic Poisoning Attacks: The Impact of Poisoned MRI Image on U-Net Brain Tumor Segmentation
Tianhao Li, Tianyu Zeng, Yujia Zheng, ZHANG CHULONG, Jingyu Lu, Haotian Huang, Chuangxin Chu, Fang-Fang Yin, Zhenyu Yang. -
[Tiny] Understanding the Impact of Data Domain Extraction on Synthetic Data Privacy
Georgi Ganev, Meenatchi Sundaram Muthu Selva Annamalai, Sofiane Mahiou, Emiliano De Cristofaro. -
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions
Jiarui Zhang, Ollie Liu, Tianyu Yu, Jinyi Hu, Willie Neiswanger. -
Is API Access to LLMs Useful for Generating Private Synthetic Tabular Data?
Marika Swanberg, Ryan McKenna, Edo Roth, Albert Cheu, Peter Kairouz. -
Augmented Conditioning Is Enough For Effective Training Image Generation
Jiahui Chen, Amy Zhang, Adriana Romero-Soriano. -
Grounding QA Generation in Knowledge Graphs and Literature: A Scalable LLM Framework for Scientific Discovery
Marc Boubnovski Martell, Kaspar Märtens, Lawrence Phillips, Daniel Keitley, Maria Dermit, Julien Fauqueur. -
Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation
Yunbo Long, Liming Xu, Alexandra Brintrup. -
Stronger Models are NOT Always Stronger Teachers for Instruction Tuning
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Radha Poovendran. -
[Tiny] Parameterized Synthetic Text Generation with SimpleStories
Lennart Finke, Thomas Dooms, Mat Allen, Juan Diego Rodriguez, Noa Nabeshima, Dan Braun. -
Did You Hear That? Introducing AADG: A Framework for Generating Benchmark Data in Audio Anomaly Detection
Ksheeraja Raghavan, Samiran Gode, Ankit Shah, Surabhi Raghavan, Wolfram Burgard, Bhiksha Raj, Rita Singh. -
Efficient Randomized Experiments Using Foundation Models
Piersilvio De Bartolomeis, Javier Abad, Guanbo Wang, Konstantin Donhauser, Raymond M Duch, Fanny Yang, Issa Dahabreh. -
Synthetic Data for Blood Vessel Network Extraction
Joël Mathys, Andreas Plesner, Jorel Elmiger, Roger Wattenhofer. -
Private Federated Learning using Preference-Optimized Synthetic Data
Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti. -
Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation
Samuel Maddock, Shripad Gade, Graham Cormode, Will Bullock. -
Out-of-Distribution Detection using Synthetic Data Generation
Momin Abbas, Muneeza Azmat, Raya Horesh, Mikhail Yurochkin. -
SoftSRV: Learn to generate targeted synthetic data.
Giulia DeSalvo, Jean-François Kagy, Lazaros Karydas, Afshin Rostamizadeh, Sanjiv Kumar. -
Improved Density Ratio Estimation for Evaluating Synthetic Data Quality
Lukas Gruber, Markus Holzleitner, Sepp Hochreiter, Werner Zellinger. -
V-LASIK: Consistent Glasses-Removal from Videos Using Synthetic Data
Rotem Shalev Arkushin, Aharon Azulay, Tavi Halperin, Eitan Richardson, Amit Haim Bermano, Ohad Fried. -
Can Transformers Learn Full Bayesian Inference In Context?
Arik Reuter, Tim G. J. Rudner, Vincent Fortuin, David Rügamer. -
Benchmarking Differentially Private Tabular Data Synthesis Algorithms
Kai Chen, Xiaochen Li, Chen GONG, Ryan McKenna, Tianhao Wang. -
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
Alisia Maria Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Yu, Jason E Weston, Jakob Nicolaus Foerster, Roberta Raileanu, Maria Lomeli. -
TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records
Hejie Cui, Alyssa Unell, Bowen Chen, Jason Alan Fries, Emily Alsentzer, Sanmi Koyejo, Nigam Shah. -
Accelerating Differentially Private Federated Learning via Adaptive Extrapolation
Shokichi Takakura, Seng Pei Liew, Satoshi Hasegawa. -
DIET-PATE: Knowledge Transfer in PATE without Public Data
Michel Meintz, Adam Dziedzic, Franziska Boenisch. -
LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph Generation
Mufei Li, Viraj Shitole, Eli Chien, Changhai Man, Zhaodong Wang, Srinivas, Ying Zhang, Tushar Krishna, Pan Li. -
Human-like compositional learning of visually-grounded concepts using synthetic data
Zijun Lin, M Ganesh Kumar, Cheston Tan. -
Can LLMs Replace Economic Choice Prediction Labs? The Case of Language-based Persuasion Games
Eilam Shapira, Omer Madmon, Roi Reichart, Moshe Tennenholtz. -
TRIG-Bench: A Benchmark for Text-Rich Image Grounding
Ming Li, Ruiyi Zhang, Jian Chen, Tianyi Zhou. -
Synthetic Data Pruning in High Dimensions: A Random Matrix Perspective
Aymane El Firdoussi, Mohamed El Amine Seddik, Soufiane Hayou, Reda ALAMI, Ahmed Alzubaidi, Hakim Hacid. -
How Well Does Your Tabular Generator Learn the Structure of Tabular Data?
Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik.
Organization
Workshop Organizers

Herbie Bradley
UK AI Safety Institute
Rachel Cummings
Columbia University
Giulia Fanti
Carnegie Mellon University
Peter Kairouz
GoogleReview
Review
Review Guide
Please take a look at the ICLR'25 reviewer guide. This workshop accepts regular submissions of up to 6 pages and tiny papers of up to 3 pages, both are excluding appendixes. See CFP section for submission formatting.
- Review period: February 7th, 2025 to February 26th, 2025 AoE