Marc Marone

profile.jpg

I’m a final year PhD student at Johns Hopkins University advised by Benjamin Van Durme. My research centers around understanding the datasets that enable foundation models – from knowledge acquisition in pretraining to systematic data pipelines for reasoning, code, and multilingual capabilities. I recently interned at Meta, working on the team that builds datasets for large scale models (like Llama). I also interned at Databricks Mosaic Research, working on LLM datasets. My PhD work at JHU explores knowledge from pretraining and tools for understanding datasets - our academic work was honored at CoLM 2024 and I recently released a state-of-the-art multilingual encoder that is better, faster, and more multilingual than similar models. See News for more!

On the Job Market!

I'm looking for full-time opportunities for data focused research. Topics like large scale curation, synthetic data, and models for emerging modalities are of special interest. Please reach out!

Resume Contact

Some research problems I’m interested in:

  • How can we efficiently query large datasets for LLMs (Data Portraits)? How does large data influence models as knowledge sources (Dated Data, outstanding paper award at CoLM 2024)?
  • What do we need to build a highly multilingual encoder (MMBert)? How do encoders and decoders differ when trained on the same data (Ettin)?
  • How can we find high quality web data?
  • How can we efficiently balance compute used to create datasets with compute spent on training models? Are there favorable FLOP tradeoffs?

Before coming to Johns Hopkins, I worked with the research group at Microsoft Translate (under Hany Hassan Awadalla). My undergraduate degree is from Georgia Tech where I worked in Jacob Eisenstein’s lab. I also spent lots of time organizing educational events and campus outreach groups. I previously interned at Microsoft Semantic Machines and my other work experience includes undergrad SWE internships at Microsoft and a quantitative trading firm.

news

Sep 2025 We release Multilingual Modern Bert - mmbert! 100k downloads on HF
May 2025 Interning with the pretraining and data team at Meta Superintelligence Labs
Apr 2025 With Jack Zhang, released BloomScrub! This follows our line of work on quoting and copying from training data (EMNLP 2025, EACL 2024, NAACL 2025).
Oct 2024 Attending CoLM 2024. Dated Data wins Outstanding Paper 🏆!
Jul 2024 Dated Data accepted @ CoLM 2024 - tweets
May 2024 Interning with the LLM data team @ Databricks/MosaicML
Feb 2024 StarCoder2 and The Stack v2 released - paper link
May 2023 StarCoder, a 15B param open-source code LLM, uses my Data Portraits for membership testing!
Mar 2023 Released Data Portraits for dataset documentation! Twitter thread. Presented @ NeurIPS 2023!
May 2021 Interning at Microsoft Semantic Machines - encoders for multilingual semantic parsing