Marc Marone
I’m a final year PhD student at Johns Hopkins University advised by Benjamin Van Durme. My research centers around understanding the datasets that enable foundation models – from knowledge acquisition in pretraining to systematic data pipelines for reasoning, code, and multilingual capabilities. I recently interned at Meta, working on the team that builds datasets for large scale models (like Llama). I also interned at Databricks Mosaic Research, working on LLM datasets. My PhD work at JHU explores knowledge from pretraining and tools for understanding datasets - our academic work was honored at CoLM 2024 and I recently released a state-of-the-art multilingual encoder that is better, faster, and more multilingual than similar models. See News for more!
On the Job Market!
Some research problems I’m interested in:
- How can we efficiently query large datasets for LLMs (Data Portraits)? How does large data influence models as knowledge sources (Dated Data, outstanding paper award at CoLM 2024)?
- What do we need to build a highly multilingual encoder (MMBert)? How do encoders and decoders differ when trained on the same data (Ettin)?
- How can we find high quality web data?
- How can we efficiently balance compute used to create datasets with compute spent on training models? Are there favorable FLOP tradeoffs?
Before coming to Johns Hopkins, I worked with the research group at Microsoft Translate (under Hany Hassan Awadalla). My undergraduate degree is from Georgia Tech where I worked in Jacob Eisenstein’s lab. I also spent lots of time organizing educational events and campus outreach groups. I previously interned at Microsoft Semantic Machines and my other work experience includes undergrad SWE internships at Microsoft and a quantitative trading firm.
news
| Sep 2025 | We release Multilingual Modern Bert - mmbert! 100k downloads on HF |
|---|---|
| May 2025 | Interning with the pretraining and data team at Meta Superintelligence Labs |
| Apr 2025 | With Jack Zhang, released BloomScrub! This follows our line of work on quoting and copying from training data (EMNLP 2025, EACL 2024, NAACL 2025). |
| Oct 2024 | Attending CoLM 2024. Dated Data wins Outstanding Paper 🏆! |
| Jul 2024 | Dated Data accepted @ CoLM 2024 - tweets |
| May 2024 | Interning with the LLM data team @ Databricks/MosaicML |
| Feb 2024 | StarCoder2 and The Stack v2 released - paper link |
| May 2023 | StarCoder, a 15B param open-source code LLM, uses my Data Portraits for membership testing! |
| Mar 2023 | Released Data Portraits for dataset documentation! Twitter thread. Presented @ NeurIPS 2023! |
| May 2021 | Interning at Microsoft Semantic Machines - encoders for multilingual semantic parsing |