Topic Modeling – part 1 | DHUM 70002 Digital Humanities: Methods and Practices (Spring 2021)

I’m going to try to offer an accessible explanation of the topic modeling I’ve been doing over the last couple of weeks. First, a massive shoutout to Joanne for scraping the dataset for us, to Martin and Kevin for their help with the cleaning/sorting/tagging, and to Micki for her suggestion to topic model just the syllabi (the “syllabus” is basically an abstract for each case) instead of the full text of each court decision. The data gathering/production has been an awesome group effort, and the topic model just feels like one piece where I’m steering the ship.

This will be Part 1, covering what LDA topic modeling is, and my first attempt at doing it on this dataset. I’ll link to the R code once we finalize which parts of our project are public and private — for now that’s only in our private data repository but I’m happy to share the code with anyone privately if that’s of interest. Part 2 will cover how I tweaked the model to improve the results.

So, what’s a topic model? The one I’m using is Matt Jockers’ LDA model, which he explains with a lunch buffet metaphor in this blog post. I’ll be honest… it’s not my favorite, but worth a read for a first taste. I’ll see if I can do any better here, though, by walking through my experience.

**Pre-firstly, I do some basic text cleaning. I take out all the punctuation, the numbers, and the whitespace, and make all letters lowercase (otherwise “taco” and “taco,” and “Taco” and “TACO!” are all different words). I also remove common words, based on a pre-determined/widely recognized list of stop words. For this project, if I leave in words like “and”, “because”, and “he”, it will be hard to get clarity about words that are more interesting to me, like “communism” or “obscenity.”

Firstly, I set some parameters for the model. (After that I basically just run a couple lines of code and then wait a few minutes for the computer to execute a series of computations, so “understanding topic modeling” is largely about knowing what the parameters mean.) One parameter I set is how many topics I want the model to return. This step is, as far as I can tell, a bit soft — i.e. I have a feeling that 20 is a more appropriate number of topics than 5 or 100, because Joanne and I brainstormed the topics we saw in the cases and came up with about 12-15. I want to give the model enough breathing room that it comes up with different, distinct topics (for example, to separate “school” and “flag” as two topics) but not so much that we lose the thread of a common idea (I don’t want “school” and “teacher” in two different topics). I started with a number that felt right, 20, knowing that I could adjust after looking at the topics I got.

Another parameter I set is how many times I want the model to sample over the topics. (This is the Gibbs Sampling part of Jockers’ blog post, FYI.) The model starts by taking all the non-stop words that occur at least 5 times in all the corpus, and dumping that list into 20 bins in 20 different random orders. This is the first iteration of our 20 topics: 20 groups of the same words, but randomly ordered as if the words were all equally unrelated. The topics will eventually be ranked from top to bottom, so words at the top are more “important” and words at the bottom are less important in that topic. All words in the whole corpus are present in each topic, they are just ordered by the statistical likelihood that they occur together. In the end, I’ll end up taking the top 10 words from each topic, but I could also examine the top 50 or top 200 if that would be useful to me.

Back to our 20 random buckets: because I set this parameter to 5,000, the model basically “re-creates” the topics 5,000 times, decreasing the amount of randomness with each iteration. Randomness is decreased by examining one word at a time, in the context of its original document (“which other words are present in that syllabus?”) and its current topic (“which other words are important in this topic?”).

So… the model eventually ends up computing that, for instance, the word “communist” co-occurs disproportionately often with the words “party”, “foreign”, “control”, and “soviet,” and those all rise to the top of one of the topics. I can export a spreadsheet that has 20 columns, each with the top 10 words in the topic. Here’s a selection of the topics my first model returned:

spreadsheet of 6 different topics listed out

A human could read a case syllabus and tell you that that particular freedom of speech case is about communism and foreign policy; the computer can compare its words to all the other words it could have had, but didn’t, and tell you that it has a relatively high proportion of Topic 11.

This brings us to the next step. At this point, we have 20 topics — we also want to know how those topics are distributed over the 573 case syllabi. Not every case will fit perfectly into one of those 20 topics, of course, and some may be a mix of multiple topics. That’s totally fine, and I can make a second spreadsheet to explore that. This one is much bigger: each column still represents one topic, so I have one “obscenity” column, one “communism” column, up to my 20 columns. Now, though, each row is for one of the 573 case syllabi, and each cell in the row is the percentage of the case taken up by the corresponding topic. Each row adds up to 1 (i.e. 100%), giving us the distribution for each case across 20 topics.

spreadsheet showing relative percentages of sample topics in several rows of syllabi

Here, we can see a few cases that might potentially be labeled as “obscenity” or “communism” cases, because they have relatively large amounts of those topics. It’s possible to use this data to figure out if two topics consistently occur together, or if a topic ebbs or flows in importance. (For example, in the screenshot above, all of those cases register a 0 for the “university” topic — that is an important speech topic, but it’s not “popular” until later cases.)

These topics and topic distribution are a promising start, but Topics 14 and 15 in particular tell me that I can make the model better. That will be part 2!