I am currently a Ph.D student at Georgetown. My research interests include Topic Models, Data Mining, Machine Learning, Data Science, and Text Mining. I hold a Master's Degree in Computer Science from Georgetown University, and a Bachelor's in CS from Boston University. My thesis work improved on existing Topic Modeling algorithms to enable tracking topics through time, and to make them more accuracte in noisy mediums such as social media. I work closely with my advisor Lisa Singh.
I first became interested in Data Mining during my Junior year of college at Boston University when I took a class with Evimaria Terzi. I followed up that first Data Mining class with several more, and worked on multiple Data Mining projects under Evimaria's advisement.
When I'm not doing CS research, you can find me surfing, skiing, kitesurfing, or playing ice hockey. I am a co-captain of the Georgetown Club Ice Hockey Team.
You can reach me through email at rjc111 (at) georgetown (dot) edu
I am currently doing research with Lisa Singh at Georgetown, and am part of her Data Mining group. My specific research area is in dynamic topic models.
Over the past five years, I have conducted research in the realms of Data Mining and Machine Learning. Here you can find a description of each particular project.
Classic state of the art topic models like LDA, DMM, and other graphical models tend to perform poorly on social media data. We are currently working on creating a topic model that does not rely on guessing the probability distribution of words in a data set in order to discover topics. Our hope is to have an online temporal version of a social media topic model ready to go before the 2020 Presidential Election cycle gets into full swing.
A huge problem that we encounter with social media data is the relentless noisiness in these relatively short documents. The noise inhibits even the best topic models. We are currently looking into how different preprocessing methods can affect the accuracy of topic models, and hope to produce a preprocessing methodology and best practice for dealing with social media. We are also hoping to expand on my thesis research with a generative temporal topic model for noisy mediums, as opposed to a graph-based approach.
Thesis Abstract: In the modern era, data is being created faster than ever. Social media, in particular, churns out hundreds of millions of short documents a day. It would be useful to understand the underlying topics being discussed on popular channels of social media, and how those discussions evolve over time. There exist state of the art topic models that accurately classify texts large and small, but few attempt to follow topics through time, and many are adversely affected by the large amount of noise in social media documents. We propose Topic Flow Model (TFM), a graph theoretic temporal topic model that identifies topics as they emerge, and tracks them through time as they persist, diminish, and re-emerge. TFM identifies topic words by capturing the changing relationship strength of words over time, and offers solutions for dealing with flood words, i.e., domain specific words that pollute topics. We conduct an extensive empirical analysis of TFM on Twitter data, newspaper articles, and synthetic data and find that the topic accuracy and signal to noise ratio are better than state of the art methods. The thesis can be found here: paper.
This project was inspired by Sanaz Bahargam, a Ph.D. candidate at Boston University. The film industry is volatile and unpredictable. It's very hard to be consistently successful, and many producers lose a lot of money trying to create a blockbuster. With that in mind, are there communities of actors and directors in Hollywood who can virtually guarantee a success? To test this, I used a large dataset from IMDB, composed almost exclusively of movies made in Hollywood (a few were from England and other parts of the world). I uncovered small communities in the IMDB dataset using spectral analysis, and analyzed each community in turn. A community was productive in ratings and/or profits if its average movie rating or profit was above the average plus one standard deviation. Out of roughly five-hundred communities, only thirteen were productive in both ratings and profits.
In this project I created a system to store and rank authors in the DBLP database. The goal was to find the best potential peer-reviewers for papers and grant proposals for any given subject. We combined ranking algorithms with a specialized max-k coverage algorithm to achieve our goal of ranking each author dynamically for each query to the system. You can see every messy bit of code that went into creating it on my Github.
This project was inspired by the team-formation problem, specifically when creating groups of students such that individual learning is optimal. A random grouping of students can achieve some level of success, but to get as close to the optimal as possible takes a lot more effort. From an abstract point of view, the goal is to maximize the score of a cluster. In the past, it was shown that you can achieve a high score for a single cluster, but nobody had attempted to achieve a high-scoring, even-score clustering. The problem is NP-Hard, so I came up with an approximation heuristic to achieve a better solution than the random grouping. My algorithm maximizes the score of each individual cluster while minimizing the standard deviation of single cluster scores.
I sometimes work on projects that have nothing to do with anything in particular. They don't fall into the category of research, they might fall into the category of academics, but they almost certainly fall into one or more of the following categories:
To navigate to any project, just click the title of that project.
The guys at Spotfund wanted a real-time dashboard to see all of their app's metrics, especially Users, Donations, and every possible way to look at those two stats. Instead of creating a chart for each specific slice of data, I created a generic, reusable chart that could display each field we care about. That's not cool... What was cool about it was that it could seamlessly switch between each field and date range with minimal calls to the API. I shared the code on GitHub (with Spotfund's permission, of course), so that anyone can use the chart for their own purposes.
Here's what it looks like with minimal styling effort:
I found out that netflix has a genre feature that lets you view all the movies in a (very) specific genre, all on one page. I quickly acquired (through a bit of scraping) a couple hundred of the best genre codes, and made a little app that lets you spin a wheel and get a random genre to watch. Give the wheel a spin by clicking on the title.
Some fellow graduate students and I decided that it would be interesting to see whether there was any location-based bias on Yelp and TripAdvisor, specifically for restaurants. We chose restaurants because they seemed to be the only entities on the sites that both locals and visitors would regularly use. We thought that locals would be more positive, you know, because support your local businesses and stuff. We were wrong. Check it all out by clicking on the title.
This Fall (2019), I am the TA for Data Science. In the Spring 2016 semester, I was the TA for Introduction to Databases. In the Fall of 2016, I was the TA for Data Analytics. I have also helped teach at the GU Women Coders events since 2016. All of the above classes were taught by Lisa Singh
The interesting ones, in reverse chronological order.