Marc Marone


I’m a PhD student at Johns Hopkins University advised by Benjamin Van Durme. I’m interested in natural language processing and large datasets. I’ve worked on projects in data documentation, code generation, multilinguality, and efficient finetuning. I’ve recently become interested in building efficient datastructures and tooling around large scale NLP models and datasets - especially when analyzing LLMs as knowledge sources.

Before coming to Johns Hopkins, I worked with the research group at Microsoft Translate. My undergraduate degree is from Georgia Tech where I worked in Jacob Eisenstein’s lab. I also spent lots of time organizing educational events and campus outreach groups.

I previously interned at Microsoft Semantic Machines. My other work experience includes research at Microsoft Translate (working under Hany Hassan) and undergrad SWE internships at Microsoft and a quantitative trading firm.

I’m interested in opportunities around large scale dataset curation, efficient generation, or other exciting research problems in NLP - please reach out!


Dec 2023 Presenting Data Portraits at NeurIPS 2023!
May 2023 StarCoder, a 15B param open-source code LLM, uses my Data Portraits for membership testing!
Mar 2023 Released Data Portraits for dataset documentation! Twitter thread.
Apr 2022 Paper on Pretrained Models for Federated Learning accepted @ NAACL 2022
Aug 2021 Paper on Cross-Lingual IE @ EMNLP 2021
May 2021 Interning at Microsoft Semantic Machines - encoders for multilingual semantic parsing
Aug 2020 Started my PhD @ JHU