SynthData-ICLR2025

About

SynthData @ ICLR 2025

Welcome to the Synthetic Data × Data Access Problem workshop co-located with ICLR 2025!

Accessing large scale and high quality data has been shown to be one of the most important factors to the performance of machine learning models. Recent works show that large (language) models can greatly benefit from training with massive data from diverse (domain specific) sources and aligning with user intention. However, the use of certain data sources can trigger privacy, fairness, copyright, and safety concerns. The impressive performance of generative artificial intelligence popularized the usage of synthetic data, and many recent works suggest (guided) synthesization can be useful for both general purpose and domain specific applications.

Will synthetic data ultimately solve the data access problem for machine learning? This workshop seeks to address this question by highlighting the limitations and opportunities of synthetic data. It aims to bring together researchers working on algorithms and applications of synthetic data, general data access for machine learning, privacy-preserving methods such as federated learning and differential privacy, and large model training experts to discuss lessons learned and chart important future directions.

Topics of interest include, but are not limited to, the following:

Risks and limitations of synthetic data.
New algorithms for synthetic data generation.
New applications of using synthetic data (e.g. in healthcare, finance, gaming and simulation, education, scientific research, or autonomous systems).
Synthetic data for model training and evaluation.
Synthetic data for improving specific model capabilities (e.g., reasoning, math, coding).
Synthetic data to address privacy, fairness, safety and other data concerns.
Evaluation of synthetic data quality and models trained on synthetic data.
Conditional and unconditional synthetic data generation.
Fine-grained control of synthetic data generation.
Data access with federated learning and privacy-preserving methods.
New paradigm of accessing data for machine learning.
Mixing synthetic and natural data.

Calls

Call for Papers

Important Dates

Submission Due Date: February 6th, 2025 4pm PT
Notification of Acceptance: March 5th, 2025, AoE
Free Registration Application Due: March 12th, 2025 AoE
Camera-ready papers due: April 12th, 2025
Workshop Dates: April 27th, 2025, Singapore

Submission Instructions

Submissions are processed in OpenReview. Submissions should be double-blind, no more than 6 pages long (excluding references), and following the ICLR'25 template. An optional appendix of any length can be put at the end of the draft (after references).

Our workshop does not have formal proceedings, i.e., it is non-archival. Accepted papers and their review comments will be posted on OpenReview in public (after the end of the review process), while rejected and withdrawn papers and their reviews will remain private.

We welcome sumbissions from novel research, ongoing (incomplete) projects, draft currently under review at other venues, as well as recently published results. In addition, we have the following policies.

[Submission on previous conference and workshop papers] We request significant updates if the work has previously been presented at major machine learning conferences or workshops before, or has been presented at any conferences or workshops before February 1st 2025.
[Submission on previous journal papers] For published work in journals that have not been presented in conferences or workshops, we will let the authors decide how novel it is for the community. Though the machine learning community moves fast, the workshop is inclusive for subareas that may have taken a slower pace, and values submission stands for fundamental long-lasting research.
[Dual submission to other workshops at the same time, e.g., another ICLR workshop] We generally discourage dual submission to other workshops at the same time as it would be a waste of our program committees' efforts, and we request an in-person presentation by either talk or poster upon acceptance at our workshop. That being said, as our workshop is non-archival, we leave the final decision to the authors for dual submission.

Tiny Papers Submissions

[Remark] This year, ICLR is discontinuing the separate Tiny Papers track, and is instead requiring each workshop to accept short (3–5 pages in ICLR format, exact page length to be determined by each workshop) paper submissions, with an eye towards inclusion. Authors of these papers will be earmarked for potential funding from ICLR, but need to submit a separate application for Financial Assistance that evaluates their eligibility. This application for Financial Assistance to attend ICLR 2025 will become available at the beginning of February and close on March 2nd.

We encourage submission of short papers relevant to the workshop topics. Following Tiny Papers Track in previous years' ICLR main conference, we encourage submissions from historically underrepresented group, and example topics such as

An implementation and experimentation of a novel (not published elsewhere) yet simple idea, or a modest and self-contained theoretical result
A follow-up experiment to or re-analysis of a previously published paper
A new perspective on a previously published paper

The tiny papers will be peer reviewed. Submissions should be double-blind, no more than 3 pages long (excluding references), and following the ICLR'25 template. Use the same sumbission portal in OpenReview. In addition,

Please clearly add a tag [Tiny] at the beginning of the submission title.

Camera Ready Instructions

Please keep using the ICLR template for camera ready, and feel free to update the footnote/header in the template from ICLR main conference to workshop. We allow an extra page (i.e., max 7 pages for regular paper and max 4 pages for tiny papers) for the camera ready to properly address reviewers' comments, add authors and acknoweledgement information. The accepted paper pdf files will be released on openreview after the camera ready deadline. Camera ready draft can be updated by replacing the pdf file in OpenReview.

Presentation Instructions

All accepted papers are expected to be presented in person. While we aim to provide accessibility to virtual attendees of the workshop, we are not planning to provide support for virtual talks or posters.

All accepted papers are expected to have in-person posters, which should be portrait orientation up to A1 size 23.4"w x 33.1"h (W 59.4 x H 89.1 cm) size.

Each spotlight presentation including QA is 10 min.

See ICLR poster instructions for onsite poster print services.

Awards

Best Paper Awards

The organizing committee selected the following papers for best paper award.

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
Alisia Maria Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Yu, Jason E Weston, Jakob Nicolaus Foerster, Roberta Raileanu, Maria Lomeli.
(Honorable mention) Private Federated Learning using Preference-Optimized Synthetic Data
Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti.

Early Career Free Registration

The workshop can provide limited number of free (full ICLR'25 conference) registration to our attendees, which will prioritize early career students, and promote diversity, equity and inclusion (DEI). If you are interested, please email us at synth-workshop-iclr25@googlegroups.com following the instructions:

Email has to be sent before March 12th to be considered.
Email title starts with [Synth-ICLR25 free registration].
Includes link(s) to your accepted, or submitted paper(s) to our workshop.
Includes a short paragraph describing why it is important for your research and career.
(Optional) includes link(s) to your webpage and resume.
The awardees will be announced in March 22nd.

Best Reviewers Free Registration

The workshop encourages high quality reviews. We provide limited number of free (full ICLR'25 conference) registration for self-nominated reviewers who have written high-quality reviews. If you are interested, please email us at synth-workshop-iclr25@googlegroups.com following the instructions:

Email has to be sent before March 12th to be considered.
Email title starts with [Synth-ICLR25 free registration: reviewer].
Includes link(s), or screenshots to your reviews.
The awardees will be announced in March 22nd.

Free Registration Awardees

Lennart Finke, Xiangjian Jiang, Martin Jurkovič, Muna Numan, Rotem Shalev-Arkushin, Yanbo Wang

Program

Workshop Program

In-person location: Peridot 202 - 203, Singapore EXPO - 1 Expo Drive, Singapore 486150.
ICLR page: https://iclr.cc/virtual/2025/workshop/24001

Local Time (UTC+8)	Activity
08:55AM - 09:00AM	Opening Remarks by Zheng Xu
09:00AM - 09:30AM	(Remote) Invited Talk by Mihaela van der Schaar: From Synthetic Data to Digital Twins: The Next Frontier in Machine Learning
09:30AM - 09:40AM	Spotlight Talk by Charlie Hou: Private Federated Learning using Preference-Optimized Synthetic Data
09:40AM - 09:50AM	Spotlight Talk by Pan Li: LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph Generation
09:50AM - 10:00AM	Spotlight Talk by Alisia Lupidi: Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
10:00AM - 10:30AM	Break
10:30AM - 11:00AM	Invited Talk by Sanmi Koyejo: Model Collapse Does Not Mean What You Think
11:00AM - 11:30AM	Invited Talk by Natalia Ponomareva: Differentially private synthetic data: why, how and what's next
11:30AM - 12:30PM	Poster Session
12:30PM - 13:30PM	Lunch break
13:30PM - 14:30PM	Panel Discussion by Lipika Ramaswamy, Matthias Gerstgrasser, Tao Lin, Mohamed El Amine Seddik, Karsten Kreis, Peter Kairouz
14:30PM - 15:00PM	Invited Talk by Sewoong Oh: SuperBPE: Tokenization across whitespaces for more efficient LLMs
15:00PM - 15:30PM	Break
15:30PM - 15:40PM	Spotlight Talk by Haolin Wang: Empowering LLMs in Decision Games through Algorithmic Data Synthesis
15:40PM - 15:50PM	Spotlight Talk by Shripad Gade: Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation
15:50PM - 16:00PM	Spotlight Talk by Giulia DeSalvo: SoftSRV: Learn to generate targeted synthetic data.
16:00PM - 16:30PM	Invited Talk by Mary-Anne Hartley: Grounding Medical LLMs in Clinical Narratives: Scalable and Participatory Synthesis of Plausible Patient Data
16:30PM - 17:00PM	Invited Talk by Hector Zhengzhong Liu: TxT360 WORCS: an Open Recipe and Framework for Language Model Pretraining Data
17:00PM - 17:05PM	Concluding Remarks by Zheng Xu

Talks

Invited Speakers

Mary-Anne Hartley

EPFL & Harvard-Chan & CMU-Africa

Sanmi Koyejo

Stanford

Sewoong Oh

University of Washington

Natalia Ponomareva

Google

Mihaela van der Schaar

University of Cambridge

Hector Liu

MBZUAI

Panel Discussion

Panelists

Lipika Ramaswamy

NVIDIA

Matthias Gerstgrasser

OpenAI

Tao Lin

Westlake University

Mohamed El Amine Seddik

Technology Innovation Institute

Karsten Kreis

NVIDIA

Peter Kairouz

Google

Accepted Papers

Spotlight Presentations

(Each talk including QA is 10 min)

Morning Session

Private Federated Learning using Preference-Optimized Synthetic Data
Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti.
LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph Generation
Mufei Li, Viraj Shitole, Eli Chien, Changhai Man, Zhaodong Wang, Srinivas, Ying Zhang, Tushar Krishna, Pan Li.
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
Alisia Maria Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Yu, Jason E Weston, Jakob Nicolaus Foerster, Roberta Raileanu, Maria Lomeli.

Afternoon Session

Empowering LLMs in Decision Games through Algorithmic Data Synthesis
Haolin Wang, Xueyan Li, Yazhe Niu, Shuai Hu, Hongsheng Li.
Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation
Samuel Maddock, Shripad Gade, Graham Cormode, Will Bullock.
SoftSRV: Learn to generate targeted synthetic data.
Giulia DeSalvo, Jean-François Kagy, Lazaros Karydas, Afshin Rostamizadeh, Sanjiv Kumar.

Accepted Papers (Openreview)

Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models
Muna Numan Said, Aarib Zaidi, Rabia Usman, Sonia Okon, Praneeth Medepalli, Kevin Zhu, Vasu Sharma, Sean O'Brien.
Orchestrating Synthetic Data with Reasoning
Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, Hamza Harkous.
SyntheRela: A Benchmark For Synthetic Relational Database Generation
Martin Jurkovic, Valter Hudovernik, Erik Štrumbelj.
Towards Internet-Scale Training For Agents
Brandon Trabucco, Gunnar A Sigurdsson, Robinson Piramuthu, Ruslan Salakhutdinov.
Empowering LLMs in Decision Games through Algorithmic Data Synthesis
Haolin Wang, Xueyan Li, Yazhe Niu, Shuai Hu, Hongsheng Li.
Text to 3D Object Generation for Scalable Room Assembly
Sonia Laguna, Alberto Garcia-Garcia, Marie-Julie Rakotosaona, Stylianos Moschoglou, Leonhard Helminger, Sergio Orts-Escolano.
AN OPTIMAL CRITERION FOR STEERING DATA DISTRIBUTIONS TO ACHIEVE EXACT FAIRNESS
mohit sharma, Amit Deshpande, Chiranjib Bhattacharyya, Rajiv Ratn Shah.
Training-Free Safe Denoisers For Safe Use of Diffusion Models
Mingyu Kim, Dongjun Kim, Amman Yusuf, Stefano Ermon, Mijung Park.
Breaking Focus: Contextual Distraction Curse in Large Language Models
Yanbo Wang, Zixiang Xu, Yue Huang, Chujie Gao, Siyuan Wu, Jiayi Ye, Xiuying Chen, Pin-Yu Chen, Xiangliang Zhang.
[Tiny] Synthetic-based retrieval of patient medical data
Rinat Mullahmetov, Ilya Pershin.
Compositional World Knowledge leads to High Utility Synthetic data
Sachit Gaudi, Gautam Sreekumar, Vishnu Boddeti.
Differentially Private Synthetic Data via APIs 3: Using Simulators Instead of Foundation Model
Zinan Lin, Tadas Baltrusaitis, Sergey Yekhanin.
Synthetic Poisoning Attacks: The Impact of Poisoned MRI Image on U-Net Brain Tumor Segmentation
Tianhao Li, Tianyu Zeng, Yujia Zheng, ZHANG CHULONG, Jingyu Lu, Haotian Huang, Chuangxin Chu, Fang-Fang Yin, Zhenyu Yang.
[Tiny] Understanding the Impact of Data Domain Extraction on Synthetic Data Privacy
Georgi Ganev, Meenatchi Sundaram Muthu Selva Annamalai, Sofiane Mahiou, Emiliano De Cristofaro.
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions
Jiarui Zhang, Ollie Liu, Tianyu Yu, Jinyi Hu, Willie Neiswanger.
Is API Access to LLMs Useful for Generating Private Synthetic Tabular Data?
Marika Swanberg, Ryan McKenna, Edo Roth, Albert Cheu, Peter Kairouz.
Augmented Conditioning Is Enough For Effective Training Image Generation
Jiahui Chen, Amy Zhang, Adriana Romero-Soriano.
Grounding QA Generation in Knowledge Graphs and Literature: A Scalable LLM Framework for Scientific Discovery
Marc Boubnovski Martell, Kaspar Märtens, Lawrence Phillips, Daniel Keitley, Maria Dermit, Julien Fauqueur.
Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation
Yunbo Long, Liming Xu, Alexandra Brintrup.
Stronger Models are NOT Always Stronger Teachers for Instruction Tuning
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Radha Poovendran.
[Tiny] Parameterized Synthetic Text Generation with SimpleStories
Lennart Finke, Thomas Dooms, Mat Allen, Juan Diego Rodriguez, Noa Nabeshima, Dan Braun.
Did You Hear That? Introducing AADG: A Framework for Generating Benchmark Data in Audio Anomaly Detection
Ksheeraja Raghavan, Samiran Gode, Ankit Shah, Surabhi Raghavan, Wolfram Burgard, Bhiksha Raj, Rita Singh.
Efficient Randomized Experiments Using Foundation Models
Piersilvio De Bartolomeis, Javier Abad, Guanbo Wang, Konstantin Donhauser, Raymond M Duch, Fanny Yang, Issa Dahabreh.
Synthetic Data for Blood Vessel Network Extraction
Joël Mathys, Andreas Plesner, Jorel Elmiger, Roger Wattenhofer.
Private Federated Learning using Preference-Optimized Synthetic Data
Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti.
Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation
Samuel Maddock, Shripad Gade, Graham Cormode, Will Bullock.
Out-of-Distribution Detection using Synthetic Data Generation
Momin Abbas, Muneeza Azmat, Raya Horesh, Mikhail Yurochkin.
SoftSRV: Learn to generate targeted synthetic data.
Giulia DeSalvo, Jean-François Kagy, Lazaros Karydas, Afshin Rostamizadeh, Sanjiv Kumar.
Improved Density Ratio Estimation for Evaluating Synthetic Data Quality
Lukas Gruber, Markus Holzleitner, Sepp Hochreiter, Werner Zellinger.
V-LASIK: Consistent Glasses-Removal from Videos Using Synthetic Data
Rotem Shalev Arkushin, Aharon Azulay, Tavi Halperin, Eitan Richardson, Amit Haim Bermano, Ohad Fried.
Can Transformers Learn Full Bayesian Inference In Context?
Arik Reuter, Tim G. J. Rudner, Vincent Fortuin, David Rügamer.
Benchmarking Differentially Private Tabular Data Synthesis Algorithms
Kai Chen, Xiaochen Li, Chen GONG, Ryan McKenna, Tianhao Wang.
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
Alisia Maria Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Yu, Jason E Weston, Jakob Nicolaus Foerster, Roberta Raileanu, Maria Lomeli.
TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records
Hejie Cui, Alyssa Unell, Bowen Chen, Jason Alan Fries, Emily Alsentzer, Sanmi Koyejo, Nigam Shah.
Accelerating Differentially Private Federated Learning via Adaptive Extrapolation
Shokichi Takakura, Seng Pei Liew, Satoshi Hasegawa.
DIET-PATE: Knowledge Transfer in PATE without Public Data
Michel Meintz, Adam Dziedzic, Franziska Boenisch.
LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph Generation
Mufei Li, Viraj Shitole, Eli Chien, Changhai Man, Zhaodong Wang, Srinivas, Ying Zhang, Tushar Krishna, Pan Li.
Human-like compositional learning of visually-grounded concepts using synthetic data
Zijun Lin, M Ganesh Kumar, Cheston Tan.
Can LLMs Replace Economic Choice Prediction Labs? The Case of Language-based Persuasion Games
Eilam Shapira, Omer Madmon, Roi Reichart, Moshe Tennenholtz.
TRIG-Bench: A Benchmark for Text-Rich Image Grounding
Ming Li, Ruiyi Zhang, Jian Chen, Tianyi Zhou.
Synthetic Data Pruning in High Dimensions: A Random Matrix Perspective
Aymane El Firdoussi, Mohamed El Amine Seddik, Soufiane Hayou, Reda ALAMI, Ahmed Alzubaidi, Hakim Hacid.
How Well Does Your Tabular Generator Learn the Structure of Tabular Data?
Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik.

Organization

Workshop Organizers

Herbie Bradley

UK AI Safety Institute

Rachel Cummings

Columbia University

Giulia Fanti

Carnegie Mellon University

Peter Kairouz

Google

Lipika Ramaswamy

NVIDIA

Chulin Xie

University of Illinois Urbana-Champaign

Zheng Xu

Google

Review

Review Guide

Please take a look at the ICLR'25 reviewer guide. This workshop accepts regular submissions of up to 6 pages and tiny papers of up to 3 pages, both are excluding appendixes. See CFP section for submission formatting.

Review period: February 7th, 2025 to February 26th, 2025 AoE

Program Commitee

Ahmed M. Abdelmoniem (Queen Mary, University of London)

Ali Shahin Shamsabadi (Brave Software)

Alp Yurtsever (Umeå University)

Amr Abourayya (Universität Duisburg-Essen)

Anran Li (Nanyang Technological University)

Anshuman Suri (Northeastern University)

Antonious M. Girgis (University of California, Los Angeles)

Ang Li (University of Maryland, College Park)

Arun Ganesh (Google)

Avijit Mitra (University of Massachusetts, Amherst)

Benedikt Schesch (ETH Zurich)

Bing Luo (Duke Kunshan University)

Bowen Tan (Carnegie Mellon University)

Chejian Xu (University of Illinois at Urbana-Champaign)

Chuan Xu (INRIA)

Chulhee Yun (Korea Advanced Institute of Science & Technology)

Chulin Xie (University of Illinois at Urbana-Champaign)

Chunhui Zhang (Dartmouth College)

Daogao Liu (University of Washington)

Dhruv Nathawani (NVIDIA)

Edwige Cyffers (ISTA)

Emiliano De Cristofaro (University of California, Riverside)

Fan Mo (Huawei Technologies Ltd.)

Giulia Fanti (Carnegie Mellon University)

Graham Cormode (Facebook)

Guoyizhe Wei (Johns Hopkins University)

Haibo Yang (Rochester Institute of Technology)

Haonan Duan (University of Toronto)

Huseyin A Inan (Microsoft)

James Bell-Clark (Google)

Jiayuan Ye (National University of Singapore)

Jiayi Wang (Oak Ridge National Laboratory)

Jinhyun So (Daegu Gyeongbuk Institute of Science and Technology)

Kai Yue (North Carolina State University)

Krishna Pillutla (IIT Madras)

Kumar Kshitij Patel (Toyota Technological Institute at Chicago)

Lie He (Shanghai University of Finance and Economics)

Lingxiao Wang (Toyota Technological Institute at Chicago)

Lorenzo Sani (University of Cambridge)

Lydia Zakynthinou (University of California, Berkeley)

Mikko A. Heikkilä (University of Helsinki)

Ming Li (University of Maryland, College Park)

Olga Ohrimenko (University of Melbourne)

Robin Staab (ETH Zurich)

Ryan McKenna (Google)

Salma Kharrat (King Abdullah University of Science and Technology)

Sangyun Lee (Carnegie Mellon University)

Sebastian U Stich (CISPA Helmholtz Center for Information Security)

Sina Alemohammad (Rice University)

Sonia Laguna (ETH Zurich)

Swanand Kadhe (International Business Machines)

Tahseen Rabbani (Yale University)

Tong Wu (Princeton University)

Weiwei Kong (Google)

Yi Zhou (International Business Machines)

Yu-Xiang Wang (University of California, San Diego)

Yuzheng Hu (University of Illinois at Urbana-Champaign)

Zhenyu Sun (Northwestern University)

Zidi Xiong (Harvard University)

Zheng Xu (Google)

Will Synthetic Data Finally Solve the Data Access Problem?

Workshop at ICLR 2025

About

Calls

Important Dates

Submission Instructions

Tiny Papers Submissions

Camera Ready Instructions

Presentation Instructions

Awards

Best Paper Awards

Early Career Free Registration

Best Reviewers Free Registration

Free Registration Awardees

Program

Talks

Panel Discussion

Accepted Papers

Spotlight Presentations

Accepted Papers (Openreview)

Organization

Review

Review Guide

Program Commitee

Sponsors

Contact us

Email us at synth-workshop-iclr25@googlegroups.com