Hello, I’m Mayee!

I am a PhD student in Computer Science at Stanford University, advised by Prof. Christopher Ré and part of the Hazy Research Lab.

I’m interested in studying and improving the fundamentals of modern machine learning through data (often known as data-centric AI). On the model training side, I work on data mixing, synthetic data, data representations, and data labeling. On the inference side, I work on test-time algorithms to produce higher-quality model generations, such as ensembling and routing. Currently, I am thinking about how to develop and operationalize a more principled understanding of how models learn from data (what skills does data teach the model? Does it matter if the data is synthetic or real?)

Previously, I graduated summa cum laude from Princeton University with a concentration in Operations Research and Financial Engineering (ORFE) and a certificate in Applications of Computing, where I worked with Prof. Elad Hazan and Prof. Miklos Racz.

Please get in touch with me via email if you would like to chat about research or collaboration!

Publications and Preprints

For a chronological order of my publications, please check out my Google Scholar/CV.

Training Data

Aioli: A Unified Optimization Framework for Language Model Data Mixing
Mayee F. Chen*, Michael Y. Hu*, Nicholas Lourie, Kyunghyun Cho, Christopher Ré. International Conference on Learning Representations (ICLR), 2025.
paper | code
DataComp-LM: In search of the next generation of training sets for language models
Jeffrey Li, …, Ludwig Schmidt, Vaishaal Shankar (59 authors, including Mayee F. Chen). Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024.
paper | code | project page
Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates
Avanika Narayan*, Mayee F. Chen*, Kush Bhatia, Christopher Ré. Conference on Language Modeling (COLM), 2024.
paper
Skill-it! A data-driven skills framework for understanding and training language models.
Mayee F. Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, Christopher Ré. Conference on Neural Information Processing Systems (NeurIPS), 2023. Spotlight (top 3.1% of submissions).
paper | AllenAI talk | code

Test-time Improvements

Archon: An Architecture Search Framework for Inference-Time Techniques
Jon Saad-Falcon, Adrian Gamarra Lafuente, Shlok Natarajan, Nahum Maru, Hristo Todorov, Etash Kumar Guha, E. Kelly Buchanan, Mayee F. Chen, Neel Guha, Christopher Ré, Azalia Mirhoseini. In submission, 2024.
paper
Smoothie: Label Free Language Model Routing
Neel Guha*, Mayee F. Chen*, Trevor Chow, Ishan S. Khare, Christopher Ré. Conference on Neural Information Processing Systems (NeurIPS), 2024.
paper | code | blog
Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot Classification.
Neel Guha*, Mayee F. Chen*, Kush Bhatia*, Azalia Mirhoseini, Frederic Sala, Christopher Ré. Conference on Neural Information Processing Systems (NeurIPS), 2023.
paper
Ask Me Anything: A simple strategy for prompting language models
Simran Arora*, Avanika Narayan*, Mayee F. Chen, Laurel J. Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, Christopher Ré. International Conference on Learning Representations (ICLR), 2023. Notable top 25% of acceptances.
paper | code

Data Labeling

Shoring Up the Foundations: Fusing Model Embeddings and Weak Supervision
Mayee F. Chen*, Daniel Y. Fu*, Dyah Adila, Michael Zhang, Frederic Sala, Kayvon Fatahalian, and Christopher Ré. Uncertainty in Artificial Intelligence (UAI), 2022. Best Student Paper Runner-Up Award, Oral Presentation.
paper | code slides | blog | Snorkel talk
Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation.
Mayee F. Chen*, Benjamin Cohen-Wang*, Steve Mussmann, Frederic Sala, and Christopher Ré. Artificial Intelligence and Statistics (AISTATS), 2021.
paper | slides
Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods.
Daniel Y. Fu*, Mayee F. Chen*, Frederic Sala, Sarah M. Hooper, Kayvon Fatahalian, and Christopher Ré. International Conference on Machine Learning (ICML), 2020.
paper | code | video | blog

Data Representations

Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning
Mayee F. Chen*, Daniel Y. Fu*, Avanika Narayan, Michael Zhang, Zhao Song, Kayvon Fatahalian, and Christopher Ré. International Conference on Machine Learning (ICML), 2022.
paper | code | blog
TABi: Type-Aware Bi-encoders for End-to-End Entity Retrieval
Megan E. Leszczynski, Daniel Y. Fu, Mayee F. Chen, and Christopher Ré. To Appear in the Findings of the Association for Computational Linguistics (ACL), 2022.
paper | code | blog
The Details Matter: Preventing Class Collapse in Supervised Contrastive Learning
Mayee F. Chen*, Daniel Y. Fu*, Michael Zhang, Kayvon Fatahalian, and Christopher Ré. AAAI Workshop on Artificial Intelligence with Biased or Scarce Data, 2022. Best Paper Award.
paper | code

Science/Health Applications

A case for reframing automated medical image classification as segmentation.
Sarah Hooper, Mayee F. Chen, Khaled Kamal Saab, Kush Bhatia, Curtis Langlotz, Christopher Ré. Conference on Neural Information Processing Systems (NeurIPS), 2023.
Anomaly Detection with Multiple Reference Datasets
Mayee F. Chen, Benjamin Nachman, Frederic Sala. Journal of High Energy Physics (JHEP), 2023. Machine Learning and the Physical Sciences (ML4PS) Workshop at NeurIPS, 2022.
paper | code
Reducing Reliance on Spurious Features in Medical Image Classification with Spatial Specificity.
Khaled Saab, Sarah M. Hooper, Mayee F. Chen, Michael Zhang, Daniel Rubin, Christopher Ré. Machine Learning for Healthcare (MLHC), 2022.
paper | code

Model evaluation

Mandoline: Model Evaluation under Distribution Shift
Mayee F. Chen*, Karan Goel*, Nimit Sohoni*, Fait Poms, Kayvon Fatahalian, and Christopher Ré. International Conference on Machine Learning (ICML), 2021.
paper | code | slides | MedAI talk

Older

An Adversarial Model of Network Disruption: Maximizing Disagreement and Polarization in Social Networks.
Mayee F. Chen and Miklos Z. Racz. IEEE Transactions on Network Science and Engineering (TNSE), 2021.
paper | code

Effect of Rotational Grazing on Plant and Animal Production.
Mayee F. Chen and Junping Shi. Journal of Mathematical Biosciences and Engineering, vol. 15, no. 2. 2018.
paper | slides
Efficient GCD Computation for Big Integers on Xeon Phi Coprocessor.
Jie Chen, William Watson, and Mayee F. Chen. IEEE Conference on Networking, Architecture, and Storage (NAS). 2014.
paper | slides

Mayee Chen