Here’s Part 2 of my walkthrough of the topic modeling process that makes up one part of the Freedom of Speech* project. (Part 1 here.) I’ll cover how I improved the topic model after showing the initial results to my group members.
At this point, I’ve run three topic models with the goal of increasing the clarity and specificity of results each time. The number of topics seems about right, and my main focus was on removing words that didn’t contribute to the topics as knowledge-producing documents. For example, here are the topics from the first model:

It was an exciting start, particularly since topics like 1, 10, and 11 immediately speak to (un)protected speech themes we’ve been talking about for weeks: broadcasting/advertising, obscenity, and communism/McCarthyism. Another part of the fun is also seeing words that expand or clarify our understanding of a topic. It’s not surprising that the words “foreign” and “control” are in the communism topic, or that the sexual/obscenity topic includes the word “children.” It does, however, help to solidify our understanding of the motivations of these battlegrounds: communism in speech matters because the U.S. government cares about the impacts of foreign influence; obscenity in speech matters because the U.S. government cares about protecting the rights (and souls) of children.
On the other hand, topics like 7 and 14 are essentially useless for telling us about themes in the cases, since for the most part they just include high incidence court-related words (“plaintiff”, “justice”, “court”, “district”, etc) that aren’t on a general purpose stop word list. Topic 15 also includes several Justice’s names: Blackmun, Brennan, and Rehnquist. I took these words out and ran the model again:

These results were better, particularly since every “unhelpful” word we remove from the model makes room for a more interesting one. For a perfect example, I took out the word “statute” and it’s replaced in the obscenity topic (now called V18) with the word “minors,” a much more descriptive word for that topic. Topic 10 distills a clearer picture of the topic about broadcasting regulations, compared to its corresponding Topic 1 in the first model.
New topics also appear: Topic 16 shows a new topic about fraud/soliciting/telemarketing. Topic 17 brings together the words “flag”, “symbol”, “peace” and “group.”
BUT, another limitation of the first and second models arises: Joanne pointed out that Topic 14 contains court-specific language that makes an interesting group of “court verbs” but doesn’t help us with thematic topics. So, one more time! Here are the topics in the third model:

Topic 11 shows us a “libel” topic, an important battleground of free speech that was missing from earlier models. Topic 13 also brings out a new thread with “university”, “students”, “message”, and “viewpoint”.
There are some more words that could be taken out (“John” and “jr”, and what is “FALSE”??), but for now, this is the dataset! Each topic is represented in some percentage (often 0) in every case, so the dataset we’ll use for describing whether a case is about obscenity includes that information. We’ll have to see what the threshold is, i.e., if a case is 30% obscenity topic, does that make it an obscenity case? What about 50%? That’s a task for this week, and one I’m excited to share with our group as well. (Subtext: there is also a boatload of web development work to do and I’m grateful we can share the load of this data work yayyyy)



Fascinating, Eva! I can’t wait to see Thursday’s presentation.