Marc Marone

profile_square3.jpg

I’m a PhD student at Johns Hopkins University advised by Benjamin Van Durme. My research interests center around datasets for language understanding systems - from knowledge in large language language models to machine translation. I’ve worked on projects in data documentation, code generation, multilinguality, and efficient finetuning. I’m currently a Research Scientist Intern at Databricks Mosaic Research, working with the datasets team.

Some research problems I’m currently interested in:

  • How can we efficiently understand the contents of large datasets for LLMs? Data Portraits
  • How does pretraining data influence large language models, especially as knowledge sources? Dated Data (Outstanding paper award at CoLM 2024!), According-To
  • What features indicate high quality web data sources?
  • How can we use synthetic data for pre-training, post-training, and evaluation?

Before coming to Johns Hopkins, I worked with the research group at Microsoft Translate. My undergraduate degree is from Georgia Tech where I worked in Jacob Eisenstein’s lab. I also spent lots of time organizing educational events and campus outreach groups. I previously interned at Microsoft Semantic Machines. My other work experience includes research at Microsoft Translate (working under Hany Hassan) and undergrad SWE internships at Microsoft and a quantitative trading firm.

I’m interested in opportunities around large scale datasets or other exciting data-centric research problems in NLP - please reach out!

news

Oct 2024 Attending CoLM 2024. Dated Data wins Outstanding Paper 🏆!
Jul 2024 Dated Data accepted @ CoLM 2024 - tweets
May 2024 Interning with the LLM data team @ Databricks/MosaicML
Feb 2024 StarCoder2 and The Stack v2 released - paper link
May 2023 StarCoder, a 15B param open-source code LLM, uses my Data Portraits for membership testing!
Mar 2023 Released Data Portraits for dataset documentation! Twitter thread. Presented @ NeurIPS 2023!
May 2021 Interning at Microsoft Semantic Machines - encoders for multilingual semantic parsing
Aug 2020 Started my PhD @ JHU