Research

As a Ph.D. student at Georgetown University, I was advised by Lisa Singh, and was part of the Georgetown DataLab and the Massive Data Institute. My research area revolved around adapting topic models to social media data, and culminated in my doctoral thesis, Modernizing Topic Models: Accounting for Noise, Time, and Domain Knowledge. My research was focused on improving the interpretability of topics by humans, as well as the overall performance of topic models on short documents such as social media posts and survey data. In the course of our research, I got to interact with a lot of very cool data sets in close to real time, including data sets pertaining to the 2016 and 2020 presidential elections, the Covid-19 pandemic, and others.

The main contribution of my thesis was the invention of topic-noise models, which combine a topic-word distribution with a global noise distribution to effectively filter noise from topics generated on social media data. We have since adapted topic-noise models to incorporate a temporal aspect (Dynamic Topic-Noise Models), and to incorporate the domain knowledge of the user (Guided Topic-Noise Models). The relevant papers can be found on the Publications page.

On the way to creating topic-noise models, we created a framework for preprocessing text data for topic modeling (TextPrep), and a few other topic models which were based on a graph structure instead of a probability distribution, like the Dirichlet for LDA and our topic-noise models.

We put together all of our topic models (as well as our own implementations of a couple very popular topic models) in a python package called GDTM (Georgetown Datalab Topic Models), which can be found on Github.

We also published a survey of topic models called The Evolution of Topic Modeling, which carefully traces the origins of different facets of modern topic models to give a better understanding of how we came to be where we are today.

My current main research interests include topic modeling and finding places where LLMs are beaten by smaller, more specialized models.