Marc Marone
I’m a PhD student at Johns Hopkins University advised by Benjamin Van Durme. My research interests center around datasets for language understanding systems - from knowledge in large language language models to machine translation. I’ve worked on projects in data documentation, code generation, multilinguality, and efficient finetuning. I’m currently a Research Scientist Intern at Databricks Mosaic Research, working with the datasets team.
Some research problems I’m currently interested in:
- How can we efficiently understand the contents of large datasets for LLMs? Data Portraits
- How does pretraining data influence large language models, especially as knowledge sources? Dated Data (Outstanding paper award at CoLM 2024!), According-To
- What features indicate high quality web data sources?
- How can we use synthetic data for pre-training, post-training, and evaluation?
Before coming to Johns Hopkins, I worked with the research group at Microsoft Translate. My undergraduate degree is from Georgia Tech where I worked in Jacob Eisenstein’s lab. I also spent lots of time organizing educational events and campus outreach groups. I previously interned at Microsoft Semantic Machines. My other work experience includes research at Microsoft Translate (working under Hany Hassan) and undergrad SWE internships at Microsoft and a quantitative trading firm.
I’m interested in opportunities around large scale datasets or other exciting data-centric research problems in NLP - please reach out!
news
Oct 2024 | Attending CoLM 2024. Dated Data wins Outstanding Paper 🏆! |
---|---|
Jul 2024 | Dated Data accepted @ CoLM 2024 - tweets |
May 2024 | Interning with the LLM data team @ Databricks/MosaicML |
Feb 2024 | StarCoder2 and The Stack v2 released - paper link |
May 2023 | StarCoder, a 15B param open-source code LLM, uses my Data Portraits for membership testing! |
Mar 2023 | Released Data Portraits for dataset documentation! Twitter thread. Presented @ NeurIPS 2023! |
May 2021 | Interning at Microsoft Semantic Machines - encoders for multilingual semantic parsing |
Aug 2020 | Started my PhD @ JHU |