About
DATA-FM @ ICLR 2026
Welcome to the Navigating and Addressing Data Problems for Foundation Models Workshop (DATA-FM), co-located with ICLR 2026!
Foundation models (FMs) continue to progress rapidly, with advances in reasoning, multimodal understanding and generation, and emerging agentic behaviors. These developments rely on increasingly diverse forms of data, including large-scale pre-training corpora; post-training data such as instruction, preference, reasoning, and multi-turn interaction traces; aligned multimodal datasets; and high-quality synthetic data throughout the pipeline. As reliance on broad and heterogeneous data sources grows, longstanding challenges in curation, attribution, copyright, privacy, fairness, safety, and evaluation have become more pressing. Understanding and improving the data layer is now a central scientific and engineering priority for the next generation of FMs.
Building on the success of the previous two editions (DPFM @ ICLR 2024 and DATA-FM @ ICLR 2025), the 3rd DATA-FM workshop aims to deepen a principled understanding of data challenges across the FM pipeline. We welcome a broad community of participants, including but not limited to researchers and engineers working on pre-training, post-training, multimodality, and agentic systems; experts in law, policy, and economics; and practitioners from industry, including frontier labs and startups. Our goal is to clarify emerging data problems, identify actionable research opportunities, and foster interdisciplinary collaboration toward a more rigorous and responsible data ecosystem for AI.
Topics of interest include, but are not limited to:
- Data curation: collection, cleaning, deduplication, selection, and mixture optimization
- Data attribution, provenance, and valuation
- Data marketplaces and emerging economic models for data exchange
- Data scarcity, discovery, and sourcing strategies
- Synthetic data generation: quality, diversity, and mitigation of model collapse
- Principled methodologies for model evaluation and benchmark design
- Small-scale experimentation for guiding large-scale training (e.g., scaling laws, μP)
- Data-centric approaches to alignment and AI safety
- Responsible data practices: privacy, security, copyright, and fairness
- Legal, regulatory, and governance frameworks for data in foundation models
Calls
Call for Papers
Important Dates
- Submission Deadline: Feb 6th, 2026, AoE
- Notification of Acceptance: March 1st, 2026, AoE
- Camera-ready Deadline: March 10th, 2026, AoE
- Workshop Date & Location: April 26th/27th, 2026 @ Rio de Janeiro, Brazil
Regular Submission Instructions
Regular submissions may be research or position papers. All submissions are handled through OpenReview and must be anonymized for double-blind review. Papers should be no more than 10 pages (excluding references) and follow the Overleaf template adapted from ICLR. An optional appendix of any length may be included at the end of the draft after the references.
Our workshop does not have formal proceedings, i.e., it is non-archival. Accepted papers and their review comments will be posted on OpenReview in public (after the end of the review process), while rejected and withdrawn papers and their reviews will remain private.
We welcome submissions presenting novel research, ongoing or incomplete projects, manuscripts currently under review at other venues, as well as recently published results. In addition, we adopt the following policies:
- [Submission on previous conference papers] We allow submissions that have been accepted at major machine learning conferences within one year of ICLR 2026 (i.e., after May 2025), including papers recently accepted to the ICLR 2026 main conference. However, as workshops are primarily intended to showcase novel or ongoing research, submissions based on previously published work may be deprioritized for oral presentations.
- [Submission on previous journal papers] For work published in journals, we leave it to the authors to assess the novelty and relevance of the submission for the community. While the machine learning field moves quickly, this workshop aims to be inclusive of subareas that may progress at a different pace and values contributions that emphasize fundamental and long-lasting research.
Short Paper Submission Instructions (3–5 pages)
Since 2025, ICLR has discontinued the separate “Tiny Papers” track, and is instead requiring each workshop to accept short (3–5 pages in ICLR format, exact page length to be determined by each workshop) paper submissions, with an eye towards inclusion; see https://iclr.cc/Conferences/2025/CallForTinyPapers for a history of the ICLR tiny papers initiative. Authors of these papers will be earmarked for potential funding from ICLR, but need to submit a separate application for Financial Assistance that evaluates their eligibility. This application for Financial Assistance to attend ICLR 2026 will become available on https://iclr.cc/Conferences/2026/ at the beginning of February and close early March.
Building on last year's practice, our workshop continues to welcome short paper submissions intended to support underrepresented, under-resourced, and early-career researchers who may not yet have the means to submit full papers. This track is intended for work at the early stages of a project: for example, a concise but self-contained theoretical result, a novel observation from preliminary experiments, or a fresh perspective on an existing problem. The goal is to foster early-stage ideas and provide a platform for researchers to receive constructive feedback and guidance as they develop their work further.
Short papers will be peer reviewed. Submissions should be anonymized, 3–5 pages long (excluding references), using the same submission portal in OpenReview and following the same Overleaf template . In addition, please clearly add a tag [Short] at the beginning of the submission title.
In accordance with ICLR policy, AI-generated papers are not permitted in the short paper track.
Author-Reviewer Policy
The workshop program committee plays an important role in identifying and giving feedback on up-and-coming work that would most benefit from discussion and visibility at the workshop. To sustain our review and program selection processes, we expect at least one author of each submitted paper to volunteer to participate as a reviewer for the DATA-FM 2026 workshop.
Large Language Model Usage Policy
DATA-FM 2026 adheres to the ICLR 2026 policies on large language model (LLM) usage: https://blog.iclr.cc/2025/08/26/policies-on-large-language-model-usage-at-iclr-2026/.
In particular, authors may use LLM-based tools to assist with writing, editing, coding, or experimentation, provided that any such use is disclosed, and that all human authors take full responsibility for the content and originality of the submission.
Talks
Invited Speakers
Maria De-Arteaga
ESADE Business School
Kelvin Guu
Google DeepMind
Hanna Hajishirzi
AI2 / University of Washington
Junyang Lin
Alibaba QwenOrganization
Workshop Organizers
Luxi He
Princeton University
Yuzheng Hu
University of Illinois Urbana-Champaign
Martin Jaggi
EPFL
Ruoxi Jia
Virginia Tech
Pratyush Maini
DatologyAI / CMU
Monica Ribero
Google
Jiachen (Tianhao) Wang
Princeton University