Hello, I’m Mayee!
I am a PhD student in Computer Science at Stanford University, advised by Prof. Christopher Ré and part of the Hazy Research Lab.
I’m interested in studying and improving the fundamentals of modern machine learning through data (often known as data-centric AI). On the model training side, I work on data mixing, synthetic data, data representations, and data labeling. On the inference side, I work on test-time algorithms to produce higher-quality model generations, such as ensembling and routing. Currently, I am thinking about how to develop and operationalize a more principled understanding of how models learn from data (what skills does data teach the model? Does it matter if the data is synthetic or real?)
Previously, I graduated summa cum laude from Princeton University with a concentration in Operations Research and Financial Engineering (ORFE) and a certificate in Applications of Computing, where I worked with Prof. Elad Hazan and Prof. Miklos Racz.
Please get in touch with me via email if you would like to chat about research or collaboration!
Publications and Preprints
For a chronological order of my publications, please check out my Google Scholar/CV.
Training Data
Aioli: A Unified Optimization Framework for Language Model Data Mixing
Mayee F. Chen*, Michael Y. Hu*, Nicholas Lourie, Kyunghyun Cho, Christopher Ré. In submission, 2024.
paper | codeDataComp-LM: In search of the next generation of training sets for language models
Jeffrey Li, …, Ludwig Schmidt, Vaishaal Shankar (59 authors, including Mayee F. Chen). Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024.
paper | code | project pageCookbook: A framework for improving LLM generative abilities via programmatic data generating templates
Avanika Narayan*, Mayee F. Chen*, Kush Bhatia, Christopher Ré. Conference on Language Modeling (COLM), 2024.
paperSkill-it! A data-driven skills framework for understanding and training language models.
Mayee F. Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, Christopher Ré. Conference on Neural Information Processing Systems (NeurIPS), 2023. Spotlight (top 3.1% of submissions).
paper | AllenAI talk | code
Test-time Improvements
Archon: An Architecture Search Framework for Inference-Time Techniques
Jon Saad-Falcon, Adrian Gamarra Lafuente, Shlok Natarajan, Nahum Maru, Hristo Todorov, Etash Kumar Guha, E. Kelly Buchanan, Mayee F. Chen, Neel Guha, Christopher Ré, Azalia Mirhoseini. In submission, 2024.Smoothie: Label Free Language Model Routing
Neel Guha*, Mayee F. Chen*, Trevor Chow, Ishan S. Khare, Christopher Ré. Conference on Neural Information Processing Systems (NeurIPS), 2024.Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot Classification.
Neel Guha*, Mayee F. Chen*, Kush Bhatia*, Azalia Mirhoseini, Frederic Sala, Christopher Ré. Conference on Neural Information Processing Systems (NeurIPS), 2023.
paperAsk Me Anything: A simple strategy for prompting language models
Simran Arora*, Avanika Narayan*, Mayee F. Chen, Laurel J. Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, Christopher Ré. International Conference on Learning Representations (ICLR), 2023. Notable top 25% of acceptances.
paper | code
Data Labeling
Shoring Up the Foundations: Fusing Model Embeddings and Weak Supervision
Mayee F. Chen*, Daniel Y. Fu*, Dyah Adila, Michael Zhang, Frederic Sala, Kayvon Fatahalian, and Christopher Ré. Uncertainty in Artificial Intelligence (UAI), 2022. Best Student Paper Runner-Up Award, Oral Presentation.
paper | code slides | blog | Snorkel talkComparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation.
Mayee F. Chen*, Benjamin Cohen-Wang*, Steve Mussmann, Frederic Sala, and Christopher Ré. Artificial Intelligence and Statistics (AISTATS), 2021.
paper | slidesFast and Three-rious: Speeding Up Weak Supervision with Triplet Methods.
Daniel Y. Fu*, Mayee F. Chen*, Frederic Sala, Sarah M. Hooper, Kayvon Fatahalian, and Christopher Ré. International Conference on Machine Learning (ICML), 2020.
paper | code | video | blog
Data Representations
Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning
Mayee F. Chen*, Daniel Y. Fu*, Avanika Narayan, Michael Zhang, Zhao Song, Kayvon Fatahalian, and Christopher Ré. International Conference on Machine Learning (ICML), 2022.
paper | code | blogTABi: Type-Aware Bi-encoders for End-to-End Entity Retrieval
Megan E. Leszczynski, Daniel Y. Fu, Mayee F. Chen, and Christopher Ré. To Appear in the Findings of the Association for Computational Linguistics (ACL), 2022.
paper | code | blogThe Details Matter: Preventing Class Collapse in Supervised Contrastive Learning
Mayee F. Chen*, Daniel Y. Fu*, Michael Zhang, Kayvon Fatahalian, and Christopher Ré. AAAI Workshop on Artificial Intelligence with Biased or Scarce Data, 2022. Best Paper Award.
paper | code
Science/Health Applications
A case for reframing automated medical image classification as segmentation.
Sarah Hooper, Mayee F. Chen, Khaled Kamal Saab, Kush Bhatia, Curtis Langlotz, Christopher Ré. Conference on Neural Information Processing Systems (NeurIPS), 2023.Anomaly Detection with Multiple Reference Datasets
Mayee F. Chen, Benjamin Nachman, Frederic Sala. Journal of High Energy Physics (JHEP), 2023. Machine Learning and the Physical Sciences (ML4PS) Workshop at NeurIPS, 2022.
paper | codeReducing Reliance on Spurious Features in Medical Image Classification with Spatial Specificity.
Khaled Saab, Sarah M. Hooper, Mayee F. Chen, Michael Zhang, Daniel Rubin, Christopher Ré. Machine Learning for Healthcare (MLHC), 2022.
paper | code
Model evaluation
- Mandoline: Model Evaluation under Distribution Shift
Mayee F. Chen*, Karan Goel*, Nimit Sohoni*, Fait Poms, Kayvon Fatahalian, and Christopher Ré. International Conference on Machine Learning (ICML), 2021.
paper | code | slides | MedAI talk
Older
- An Adversarial Model of Network Disruption: Maximizing Disagreement and Polarization in Social Networks.
Mayee F. Chen and Miklos Z. Racz. IEEE Transactions on Network Science and Engineering (TNSE), 2021.
paper | code
Effect of Rotational Grazing on Plant and Animal Production.
Mayee F. Chen and Junping Shi. Journal of Mathematical Biosciences and Engineering, vol. 15, no. 2. 2018.
paper | slidesEfficient GCD Computation for Big Integers on Xeon Phi Coprocessor.
Jie Chen, William Watson, and Mayee F. Chen. IEEE Conference on Networking, Architecture, and Storage (NAS). 2014.
paper | slides