About

SynthData @ ICLR 2025

Welcome to the Synthetic Data × Data Access Problem workshop co-located with ICLR 2025!

Accessing large scale and high quality data has been shown to be one of the most important factors to the performance of machine learning models. Recent works show that large (language) models can greatly benefit from training with massive data from diverse (domain specific) sources and aligning with user intention. However, the use of certain data sources can trigger privacy, fairness, copyright, and safety concerns. The impressive performance of generative artificial intelligence popularized the usage of synthetic data, and many recent works suggest (guided) synthesization can be useful for both general purpose and domain specific applications.

Will synthetic data ultimately solve the data access problem for machine learning? This workshop seeks to address this question by highlighting the limitations and opportunities of synthetic data. It aims to bring together researchers working on algorithms and applications of synthetic data, general data access for machine learning, privacy-preserving methods such as federated learning and differential privacy, and large model training experts to discuss lessons learned and chart important future directions.


Topics of interest include, but are not limited to, the following:

  • Risks and limitations of synthetic data.
  • New algorithms for synthetic data generation.
  • New applications of using synthetic data (e.g. in healthcare, finance, gaming and simulation, education, scientific research, or autonomous systems).
  • Synthetic data for model training and evaluation.
  • Synthetic data for improving specific model capabilities (e.g., reasoning, math, coding).
  • Synthetic data to address privacy, fairness, safety and other data concerns.
  • Evaluation of synthetic data quality and models trained on synthetic data.
  • Conditional and unconditional synthetic data generation.
  • Fine-grained control of synthetic data generation.
  • Data access with federated learning and privacy-preserving methods.
  • New paradigm of accessing data for machine learning.
  • Mixing synthetic and natural data.

Calls

Call for Papers

Important Dates
  • Submission Due Date: February 6th, 2025 4pm PT
  • Notification of Acceptance: March 5th, 2025, AoE
  • Free Registration Application Due: March 12th, 2025 AoE
  • Camera-ready papers due: April 12th, 2025
  • Workshop Dates: April 27th, 2025, Singapore
Submission Instructions

Submissions are processed in OpenReview. Submissions should be double-blind, no more than 6 pages long (excluding references), and following the ICLR'25 template. An optional appendix of any length can be put at the end of the draft (after references).

Our workshop does not have formal proceedings, i.e., it is non-archival. Accepted papers and their review comments will be posted on OpenReview in public (after the end of the review process), while rejected and withdrawn papers and their reviews will remain private.

We welcome sumbissions from novel research, ongoing (incomplete) projects, draft currently under review at other venues, as well as recently published results. In addition, we have the following policies.

  • [Submission on previous conference and workshop papers] We request significant updates if the work has previously been presented at major machine learning conferences or workshops before, or has been presented at any conferences or workshops before February 1st 2025.
  • [Submission on previous journal papers] For published work in journals that have not been presented in conferences or workshops, we will let the authors decide how novel it is for the community. Though the machine learning community moves fast, the workshop is inclusive for subareas that may have taken a slower pace, and values submission stands for fundamental long-lasting research.
  • [Dual submission to other workshops at the same time, e.g., another ICLR workshop] We generally discourage dual submission to other workshops at the same time as it would be a waste of our program committees' efforts, and we request an in-person presentation by either talk or poster upon acceptance at our workshop. That being said, as our workshop is non-archival, we leave the final decision to the authors for dual submission.

Tiny Papers Submissions

[Remark] This year, ICLR is discontinuing the separate ​Tiny Papers track, and is instead requiring each workshop to accept short (3–5 pages in ICLR format, exact page length to be determined by each workshop) paper submissions, with an eye towards inclusion. Authors of these papers will be earmarked for potential funding from ICLR, but need to submit a separate application for Financial Assistance that evaluates their eligibility. This application for Financial Assistance to attend ICLR 2025 will become ​available at the beginning of February and close on March 2nd.

We encourage submission of short papers relevant to the workshop topics. Following Tiny Papers Track in previous years' ICLR main conference, we encourage submissions from historically underrepresented group, and example topics such as

  • An implementation and experimentation of a novel (not published elsewhere) yet simple idea, or a modest and self-contained theoretical result
  • A follow-up experiment to or re-analysis of a previously published paper
  • A new perspective on a previously published paper

The tiny papers will be peer reviewed. Submissions should be double-blind, no more than 3 pages long (excluding references), and following the ICLR'25 template. Use the same sumbission portal in OpenReview. In addition,

  • Please clearly add a tag [Tiny] at the beginning of the submission title.
Camera Ready Instructions

Please keep using the ICLR template for camera ready, and feel free to update the footnote/header in the template from ICLR main conference to workshop. We allow an extra page (i.e., max 7 pages for regular paper and max 4 pages for tiny papers) for the camera ready to properly address reviewers' comments, add authors and acknoweledgement information. The accepted paper pdf files will be released on openreview after the camera ready deadline. Camera ready draft can be updated by replacing the pdf file in OpenReview.

Presentation Instructions

All accepted papers are expected to be presented in person. While we aim to provide accessibility to virtual attendees of the workshop, we are not planning to provide support for virtual talks or posters.

All accepted papers are expected to have in-person posters, which should be portrait orientation up to A1 size 23.4"w x 33.1"h (W 59.4 x H 89.1 cm) size.

Each spotlight presentation including QA is 10 min.

See ICLR poster instructions for onsite poster print services.

Awards

Awards

Best Paper Awards

The organizing committee will select best paper award(s) supported by our sponsors.

Early Career Free Registration

The workshop can provide limited number of free (full ICLR'25 conference) registration to our attendees, which will prioritize early career students, and promote diversity, equity and inclusion (DEI). If you are interested, please email us at synth-workshop-iclr25@googlegroups.com following the instructions:

  • Email has to be sent before March 12th to be considered.
  • Email title starts with [Synth-ICLR25 free registration].
  • Includes link(s) to your accepted, or submitted paper(s) to our workshop.
  • Includes a short paragraph describing why it is important for your research and career.
  • (Optional) includes link(s) to your webpage and resume.
  • The awardees will be announced in March 22nd.
Best Reviewers Free Registration

The workshop encourages high quality reviews. We provide limited number of free (full ICLR'25 conference) registration for self-nominated reviewers who have written high-quality reviews. If you are interested, please email us at synth-workshop-iclr25@googlegroups.com following the instructions:

  • Email has to be sent before March 12th to be considered.
  • Email title starts with [Synth-ICLR25 free registration: reviewer].
  • Includes link(s), or screenshots to your reviews.
  • The awardees will be announced in March 22nd.
Free Registration Awardees

Lennart Finke, Xiangjian Jiang, Martin Jurkovič, Muna Numan, Rotem Shalev-Arkushin, Yanbo Wang

Program

Workshop Program


Local Time (UTC+8) Activity
08:55AM - 09:00AM Opening Remarks by Zheng Xu
09:00AM - 09:30AM (Remote) Invited Talk by Mihaela van der Schaar: From Synthetic Data to Digital Twins: The Next Frontier in Machine Learning
09:30AM - 09:40AM Spotlight Talk by Charlie Hou: Private Federated Learning using Preference-Optimized Synthetic Data
09:40AM - 09:50AM Spotlight Talk by Pan Li: LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph Generation
09:50AM - 10:00AM Spotlight Talk by Alisia Lupidi: Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
10:00AM - 10:30AM Break
10:30AM - 11:00AM Invited Talk by Sanmi Koyejo: Model Collapse Does Not Mean What You Think
11:00AM - 11:30AM Invited Talk by Natalia Ponomareva: Differentially private synthetic data: why, how and what's next
11:30AM - 12:30PM Poster Session
12:30PM - 13:30PM Lunch break
13:30PM - 14:30PM Panel Discussion by Lipika Ramaswamy, Matthias Gerstgrasser, Tao Lin, Mohamed El Amine Seddik, Karsten Kreis, Peter Kairouz
14:30PM - 15:00PM Invited Talk by Sewoong Oh: SuperBPE: Tokenization across whitespaces for more efficient LLMs
15:00PM - 15:30PM Break
15:30PM - 15:40PM Spotlight Talk by Haolin Wang: Empowering LLMs in Decision Games through Algorithmic Data Synthesis
15:40PM - 15:50PM Spotlight Talk by Shripad Gade: Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation
15:50PM - 16:00PM Spotlight Talk by Giulia DeSalvo: SoftSRV: Learn to generate targeted synthetic data.
16:00PM - 16:30PM Invited Talk by Mary-Anne Hartley: Grounding Medical LLMs in Clinical Narratives: Scalable and Participatory Synthesis of Plausible Patient Data
16:30PM - 17:00PM Invited Talk by Hector Zhengzhong Liu: TxT360 WORCS: an Open Recipe and Framework for Language Model Pretraining Data
17:00PM - 17:05PM Concluding Remarks by Zheng Xu

Talks

Invited Speakers

Mary-Anne Hartley

EPFL & Harvard-Chan & CMU-Africa

Sanmi Koyejo

Stanford

Sewoong Oh

University of Washington

Mihaela van der Schaar

University of Cambridge

Hector Liu

MBZUAI

Panel Discussion

Panelists

Tao Lin

Westlake University

Mohamed El Amine Seddik

Technology Innovation Institute

Karsten Kreis

NVIDIA

Peter Kairouz

Google

Accepted Papers

Accepted Papers

Spotlight Presentations

(Each talk including QA is 10 min)

Morning Session

  • Private Federated Learning using Preference-Optimized Synthetic Data
    Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti.
  • LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph Generation
    Mufei Li, Viraj Shitole, Eli Chien, Changhai Man, Zhaodong Wang, Srinivas, Ying Zhang, Tushar Krishna, Pan Li.
  • Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
    Alisia Maria Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Yu, Jason E Weston, Jakob Nicolaus Foerster, Roberta Raileanu, Maria Lomeli.

Afternoon Session

  • Empowering LLMs in Decision Games through Algorithmic Data Synthesis
    Haolin Wang, Xueyan Li, Yazhe Niu, Shuai Hu, Hongsheng Li.
  • Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation
    Samuel Maddock, Shripad Gade, Graham Cormode, Will Bullock.
  • SoftSRV: Learn to generate targeted synthetic data.
    Giulia DeSalvo, Jean-François Kagy, Lazaros Karydas, Afshin Rostamizadeh, Sanjiv Kumar.
Accepted Papers (Openreview)
  • Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models
    Muna Numan Said, Aarib Zaidi, Rabia Usman, Sonia Okon, Praneeth Medepalli, Kevin Zhu, Vasu Sharma, Sean O'Brien.
  • Orchestrating Synthetic Data with Reasoning
    Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, Hamza Harkous.
  • SyntheRela: A Benchmark For Synthetic Relational Database Generation
    Martin Jurkovic, Valter Hudovernik, Erik Štrumbelj.
  • Towards Internet-Scale Training For Agents
    Brandon Trabucco, Gunnar A Sigurdsson, Robinson Piramuthu, Ruslan Salakhutdinov.
  • Empowering LLMs in Decision Games through Algorithmic Data Synthesis
    Haolin Wang, Xueyan Li, Yazhe Niu, Shuai Hu, Hongsheng Li.
  • Text to 3D Object Generation for Scalable Room Assembly
    Sonia Laguna, Alberto Garcia-Garcia, Marie-Julie Rakotosaona, Stylianos Moschoglou, Leonhard Helminger, Sergio Orts-Escolano.
  • AN OPTIMAL CRITERION FOR STEERING DATA DISTRIBUTIONS TO ACHIEVE EXACT FAIRNESS
    mohit sharma, Amit Deshpande, Chiranjib Bhattacharyya, Rajiv Ratn Shah.
  • Training-Free Safe Denoisers For Safe Use of Diffusion Models
    Mingyu Kim, Dongjun Kim, Amman Yusuf, Stefano Ermon, Mijung Park.
  • Breaking Focus: Contextual Distraction Curse in Large Language Models
    Yanbo Wang, Zixiang Xu, Yue Huang, Chujie Gao, Siyuan Wu, Jiayi Ye, Xiuying Chen, Pin-Yu Chen, Xiangliang Zhang.
  • [Tiny] Synthetic-based retrieval of patient medical data
    Rinat Mullahmetov, Ilya Pershin.
  • Compositional World Knowledge leads to High Utility Synthetic data
    Sachit Gaudi, Gautam Sreekumar, Vishnu Boddeti.
  • Differentially Private Synthetic Data via APIs 3: Using Simulators Instead of Foundation Model
    Zinan Lin, Tadas Baltrusaitis, Sergey Yekhanin.
  • Synthetic Poisoning Attacks: The Impact of Poisoned MRI Image on U-Net Brain Tumor Segmentation
    Tianhao Li, Tianyu Zeng, Yujia Zheng, ZHANG CHULONG, Jingyu Lu, Haotian Huang, Chuangxin Chu, Fang-Fang Yin, Zhenyu Yang.
  • [Tiny] Understanding the Impact of Data Domain Extraction on Synthetic Data Privacy
    Georgi Ganev, Meenatchi Sundaram Muthu Selva Annamalai, Sofiane Mahiou, Emiliano De Cristofaro.
  • Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions
    Jiarui Zhang, Ollie Liu, Tianyu Yu, Jinyi Hu, Willie Neiswanger.
  • Is API Access to LLMs Useful for Generating Private Synthetic Tabular Data?
    Marika Swanberg, Ryan McKenna, Edo Roth, Albert Cheu, Peter Kairouz.
  • Augmented Conditioning Is Enough For Effective Training Image Generation
    Jiahui Chen, Amy Zhang, Adriana Romero-Soriano.
  • Grounding QA Generation in Knowledge Graphs and Literature: A Scalable LLM Framework for Scientific Discovery
    Marc Boubnovski Martell, Kaspar Märtens, Lawrence Phillips, Daniel Keitley, Maria Dermit, Julien Fauqueur.
  • Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation
    Yunbo Long, Liming Xu, Alexandra Brintrup.
  • Stronger Models are NOT Always Stronger Teachers for Instruction Tuning
    Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Radha Poovendran.
  • [Tiny] Parameterized Synthetic Text Generation with SimpleStories
    Lennart Finke, Thomas Dooms, Mat Allen, Juan Diego Rodriguez, Noa Nabeshima, Dan Braun.
  • Did You Hear That? Introducing AADG: A Framework for Generating Benchmark Data in Audio Anomaly Detection
    Ksheeraja Raghavan, Samiran Gode, Ankit Shah, Surabhi Raghavan, Wolfram Burgard, Bhiksha Raj, Rita Singh.
  • Efficient Randomized Experiments Using Foundation Models
    Piersilvio De Bartolomeis, Javier Abad, Guanbo Wang, Konstantin Donhauser, Raymond M Duch, Fanny Yang, Issa Dahabreh.
  • Synthetic Data for Blood Vessel Network Extraction
    Joël Mathys, Andreas Plesner, Jorel Elmiger, Roger Wattenhofer.
  • Private Federated Learning using Preference-Optimized Synthetic Data
    Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti.
  • Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation
    Samuel Maddock, Shripad Gade, Graham Cormode, Will Bullock.
  • Out-of-Distribution Detection using Synthetic Data Generation
    Momin Abbas, Muneeza Azmat, Raya Horesh, Mikhail Yurochkin.
  • SoftSRV: Learn to generate targeted synthetic data.
    Giulia DeSalvo, Jean-François Kagy, Lazaros Karydas, Afshin Rostamizadeh, Sanjiv Kumar.
  • Improved Density Ratio Estimation for Evaluating Synthetic Data Quality
    Lukas Gruber, Markus Holzleitner, Sepp Hochreiter, Werner Zellinger.
  • V-LASIK: Consistent Glasses-Removal from Videos Using Synthetic Data
    Rotem Shalev Arkushin, Aharon Azulay, Tavi Halperin, Eitan Richardson, Amit Haim Bermano, Ohad Fried.
  • Can Transformers Learn Full Bayesian Inference In Context?
    Arik Reuter, Tim G. J. Rudner, Vincent Fortuin, David Rügamer.
  • Benchmarking Differentially Private Tabular Data Synthesis Algorithms
    Kai Chen, Xiaochen Li, Chen GONG, Ryan McKenna, Tianhao Wang.
  • Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
    Alisia Maria Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Yu, Jason E Weston, Jakob Nicolaus Foerster, Roberta Raileanu, Maria Lomeli.
  • TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records
    Hejie Cui, Alyssa Unell, Bowen Chen, Jason Alan Fries, Emily Alsentzer, Sanmi Koyejo, Nigam Shah.
  • Accelerating Differentially Private Federated Learning via Adaptive Extrapolation
    Shokichi Takakura, Seng Pei Liew, Satoshi Hasegawa.
  • DIET-PATE: Knowledge Transfer in PATE without Public Data
    Michel Meintz, Adam Dziedzic, Franziska Boenisch.
  • LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph Generation
    Mufei Li, Viraj Shitole, Eli Chien, Changhai Man, Zhaodong Wang, Srinivas, Ying Zhang, Tushar Krishna, Pan Li.
  • Human-like compositional learning of visually-grounded concepts using synthetic data
    Zijun Lin, M Ganesh Kumar, Cheston Tan.
  • Can LLMs Replace Economic Choice Prediction Labs? The Case of Language-based Persuasion Games
    Eilam Shapira, Omer Madmon, Roi Reichart, Moshe Tennenholtz.
  • TRIG-Bench: A Benchmark for Text-Rich Image Grounding
    Ming Li, Ruiyi Zhang, Jian Chen, Tianyi Zhou.
  • Synthetic Data Pruning in High Dimensions: A Random Matrix Perspective
    Aymane El Firdoussi, Mohamed El Amine Seddik, Soufiane Hayou, Reda ALAMI, Ahmed Alzubaidi, Hakim Hacid.
  • How Well Does Your Tabular Generator Learn the Structure of Tabular Data?
    Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik.

Organization

Workshop Organizers

Herbie Bradley

UK AI Safety Institute

Rachel Cummings

Columbia University

Giulia Fanti

Carnegie Mellon University

Peter Kairouz

Google

Chulin Xie

University of Illinois Urbana-Champaign

Zheng Xu

Google

Review

Review

Review Guide

Please take a look at the ICLR'25 reviewer guide. This workshop accepts regular submissions of up to 6 pages and tiny papers of up to 3 pages, both are excluding appendixes. See CFP section for submission formatting.

  • Review period: February 7th, 2025 to February 26th, 2025 AoE
Program Commitee
  • Ahmed M. Abdelmoniem (Queen Mary, University of London)
  • Ali Shahin Shamsabadi (Brave Software)
  • Alp Yurtsever (Umeå University)
  • Amr Abourayya (Universität Duisburg-Essen)
  • Anran Li (Nanyang Technological University)
  • Anshuman Suri (Northeastern University)
  • Antonious M. Girgis (University of California, Los Angeles)
  • Ang Li (University of Maryland, College Park)
  • Arun Ganesh (Google)
  • Avijit Mitra (University of Massachusetts, Amherst)
  • Benedikt Schesch (ETH Zurich)
  • Bing Luo (Duke Kunshan University)
  • Bowen Tan (Carnegie Mellon University)
  • Chejian Xu (University of Illinois at Urbana-Champaign)
  • Chuan Xu (INRIA)
  • Chulhee Yun (Korea Advanced Institute of Science & Technology)
  • Chulin Xie (University of Illinois at Urbana-Champaign)
  • Chunhui Zhang (Dartmouth College)
  • Daogao Liu (University of Washington)
  • Dhruv Nathawani (NVIDIA)
  • Edwige Cyffers (ISTA)
  • Emiliano De Cristofaro (University of California, Riverside)
  • Fan Mo (Huawei Technologies Ltd.)
  • Giulia Fanti (Carnegie Mellon University)
  • Graham Cormode (Facebook)
  • Guoyizhe Wei (Johns Hopkins University)
  • Haibo Yang (Rochester Institute of Technology)
  • Haonan Duan (University of Toronto)
  • Huseyin A Inan (Microsoft)
  • James Bell-Clark (Google)
  • Jiayuan Ye (National University of Singapore)
  • Jiayi Wang (Oak Ridge National Laboratory)
  • Jinhyun So (Daegu Gyeongbuk Institute of Science and Technology)
  • Kai Yue (North Carolina State University)
  • Krishna Pillutla (IIT Madras)
  • Kumar Kshitij Patel (Toyota Technological Institute at Chicago)
  • Lie He (Shanghai University of Finance and Economics)
  • Lingxiao Wang (Toyota Technological Institute at Chicago)
  • Lorenzo Sani (University of Cambridge)
  • Lydia Zakynthinou (University of California, Berkeley)
  • Mikko A. Heikkilä (University of Helsinki)
  • Ming Li (University of Maryland, College Park)
  • Olga Ohrimenko (University of Melbourne)
  • Robin Staab (ETH Zurich)
  • Ryan McKenna (Google)
  • Salma Kharrat (King Abdullah University of Science and Technology)
  • Sangyun Lee (Carnegie Mellon University)
  • Sebastian U Stich (CISPA Helmholtz Center for Information Security)
  • Sina Alemohammad (Rice University)
  • Sonia Laguna (ETH Zurich)
  • Swanand Kadhe (International Business Machines)
  • Tahseen Rabbani (Yale University)
  • Tong Wu (Princeton University)
  • Weiwei Kong (Google)
  • Yi Zhou (International Business Machines)
  • Yu-Xiang Wang (University of California, San Diego)
  • Yuzheng Hu (University of Illinois at Urbana-Champaign)
  • Zhenyu Sun (Northwestern University)
  • Zidi Xiong (Harvard University)
  • Zheng Xu (Google)
  • Sponsors

    Sponsors

    Google                  NVIDIA