Author Archives: Eva Sibinga

Post-showcase FOS* reflection | personal blog

Now that the showcase is done and our papers are handed in, my reflections on the Freedom of Speech* project are taking a different form. The intensely action-oriented thoughts and feelings of the last six weeks were marked by a combination of stress and optimism: there was so much work to be done, but I always found my self talk leaning towards “it’s going to happen” (until the last day, when it switched to a tired but satisfied-enough “it’s going to be what it is”).

Now, without the deadline of a presentation and audience looming, I’m much less stressed about the project. But without the motivation of that deadline, I’m less optimistic about its future. As a recovering perfectionist, I am now okay with the idea that a project can be useful and valid without functioning 100% seamlessly. But I want the tool to work as well as possible so that people can use it without encountering faulty information or frustrating glitches, and that’s definitely not the case yet. But it’s hard to imagine going back to it without my whole group there to share in the process.

It’s lead me to wonder if the site is a beta version or a rough draft. To me, calling it a beta version implies that the same site will be reworked and updated, and the final product would look and function similarly, but at a higher level. Calling it a rough draft (which is increasingly how I feel about the Explore page) opens the door to many more revisions, not just functionality upgrades. For one thing, I’m not nearly as excited about the topic model results now as I was when I first ran the model, and I wonder if they really add much value to the Explore page. The topic model proved a fun exploratory analysis tool, but I think its merits are more limited than I first imagined, especially in the face of the manual data grouping that Martin and Joanne did to make the eras on the timeline.

Thinking about that now brings me back to my first semester in this program — I took a GIS course at Hunter College (Intro to Cartography and Geovisualization), and my major takeaway was that good maps are 10% GIS and 90% context and design. The actual “truth” of the coordinates and GIS layers is obviously important to get right, but the vast majority of communicating the argument or point of a map comes from everything else: colors, labels, statistical breaks, symbology, and maybe most importantly, context. That’s how I’m feeling about the topic model: however cool or interesting the results are, they’re probably only about 10% of the way towards making a good end product. I didn’t leave enough time to build out that context or consider how I would lead a user through the topic model, and in the end, I find Martin’s context-rich descriptions of the eras of free speech to be a much more compelling part of the site.

I think the topic model has potential, but implementing it in a way that really helps users to gain new knowledge about freedom of speech cases would require, at the least:

  • including the full 10-word topic somewhere, not just a representative title, so that users can see the words that make up a topic
  • including a quick definition of what topic modeling is
  • omitting the topic model data point for cases where it adds little to no value, as in cases where the top topic is a mix of unspecific words (this is true of a lot of cases) or cases that don’t have a strong one or two topics

Those are my thoughts for the future of the Explore page topic model, which in truth probably isn’t even top of my list to fix up (looking at you, wrong case showing up in the case modal). As far as deadlines go… I’m definitely taking the coming week off. Five fully online courses later, I maybe feel more beat than I ever have at the end of a school year and am 100% ready for the semester to be over: for some summer, some days outside without opening my computer, and some sleep.

After that, maybe I’ll hit up our group’s Discord server to run my tech edits by them, set a one week deadline, and see if I can squeeze a few more days of “it’s going to happen” energy into the project.

Catchup blog #1

I think I have 5 personal blogs left to write, which pretty much coincides with having built an almost-fully functional website in the last… 3 weeks. I honestly haven’t had much to write since my topic modeling posts — the update each week is that I am neck-deep in coding, which is a state that I alternately love and hate.

I hate it because it involves hours and hours of sitting, often so engaged in a problem that I forget to move or drink water for an unreasonable amount of time. I hate it because it leaves my body restless and my brain knackered. I hate it because the emotional landscape sometimes involves great swathes of frustration with just pinpricks of triumph before I turn to the next tangled problem.

I love it, though, because it really is like learning a new language. It’s a language of functionality and precision, and of breaking the problem I want to solve down into a set of tasks that a computer can accomplish. I’ve enjoyed learning over the course of my degree that there’s rarely just one way to do that. There may be one way that’s the most performant, one that’s the most mobile-friendly, one the most visually pleasing, or one the easiest. Beyond that, there may be one way that accomplishes what you really intend and one that seems logically sound but ultimately fails (for example, select all the women in this dataset and select all the not men in this dataset are equally easy tasks for a computer but certainly not equal questions for a researcher — intention is key, as is a data structure that allows you to ask exactly the question you want).

Sometimes these different  “best ways to break down a task” overlap, sometimes they don’t. Sometimes I have the knowledge and bandwidth to make an informed decision or improve on an old way of doing things, other times I’m so relieved to hack out a way that just WORKS, I don’t even think about the others. That’s definitely a plus one for collaboration, since looking at other peoples’ code often teaches me about the ways I haven’t chosen.

I love coding because I get to not only think about all of that, but actually do it. I hope it never gets old for me to write out text commands and see them bring dynamic shapes and colors and movement and information to life online.

FOS* Group Project update: crunch edition

This group project update from last week is coming in no-so-hot because we’re working hard to get everything squared away for Thursday! There may have been some scope creep, and some of us are wondering for the second year in a row what happened to April…

We met all together on Monday to finalize presentation details and discuss feedback from the class practice session. Kevin has been working hard on the presentation and a few last minute asks for the website (like a little courthouse icon to denote landmark cases). Martin has been helping with the presentation script and with some important last minute manual data work. Joanne is continuing to work on making our text data perfectly ready for web display,  as well as working on the map and timeline. And Eva is working on the sort/filter/display functions for the website’s explore page.

Topic Modeling – part 2

Here’s Part 2 of my walkthrough of the topic modeling process that makes up one part of the Freedom of Speech* project. (Part 1 here.) I’ll cover how I improved the topic model after showing the initial results to my group members.

At this point, I’ve run three topic models with the goal of increasing the clarity and specificity of results each time. The number of topics seems about right, and my main focus was on removing words that didn’t contribute to the topics as knowledge-producing documents. For example, here are the topics from the first model:

Topics from the 1st model

It was an exciting start, particularly since topics like 1, 10, and 11 immediately speak to (un)protected speech themes we’ve been talking about for weeks: broadcasting/advertising, obscenity, and communism/McCarthyism. Another part of the fun is also seeing words that expand or clarify our understanding of a topic. It’s not surprising that the words “foreign” and “control” are in the communism topic, or that the sexual/obscenity topic includes the word “children.” It does, however, help to solidify our understanding of the motivations of these battlegrounds: communism in speech matters because the U.S. government cares about the impacts of foreign influence; obscenity in speech matters because the U.S. government cares about protecting the rights (and souls) of children.

On the other hand, topics like 7 and 14 are essentially useless for telling us about themes in the cases, since for the most part they just include high incidence court-related words (“plaintiff”, “justice”, “court”, “district”, etc)  that aren’t on a general purpose stop word list. Topic 15 also includes several Justice’s names: Blackmun, Brennan, and Rehnquist. I took these words out and ran the model again:

Topic results for the 2nd model.

These results were better, particularly since every “unhelpful” word we remove from the model makes room for a more interesting one. For a perfect example, I took out the word “statute” and it’s replaced in the obscenity topic (now called V18) with the word “minors,” a much more descriptive word for that topic. Topic 10 distills a clearer picture of the topic about broadcasting regulations, compared to its corresponding Topic 1 in the first model.

New topics also appear: Topic 16 shows a new topic about fraud/soliciting/telemarketing. Topic 17 brings together the words “flag”, “symbol”, “peace” and “group.”

BUT, another limitation of the first and second models arises: Joanne pointed out that Topic 14 contains court-specific language that makes an interesting group of “court verbs” but doesn’t help us with thematic topics. So, one more time! Here are the topics in the third model:

3rd set of topics

Topic 11 shows us a “libel” topic, an important battleground of free speech that was missing from earlier models. Topic 13 also brings out a new thread with “university”, “students”, “message”, and “viewpoint”.

There are some more words that could be taken out (“John” and “jr”, and what is “FALSE”??), but for now, this is the dataset! Each topic is represented in some percentage (often 0) in every case, so the dataset we’ll use for describing whether a case is about obscenity includes that information. We’ll have to see what the threshold is, i.e., if a case is 30% obscenity topic, does that make it an obscenity case? What about 50%? That’s a task for this week, and one I’m excited to share with our group as well. (Subtext: there is also a boatload of web development work to do and I’m grateful we can share the load of this data work yayyyy)

Topic Modeling – part 1

I’m going to try to offer an accessible explanation of the topic modeling I’ve been doing over the last couple of weeks. First, a massive shoutout to Joanne for scraping the dataset for us, to Martin and Kevin for their help with the cleaning/sorting/tagging, and to Micki for her suggestion to topic model just the syllabi (the “syllabus” is basically an abstract for each case) instead of the full text of each court decision. The data gathering/production has been an awesome group effort, and the topic model just feels like one piece where I’m steering the ship.

This will be Part 1, covering what LDA topic modeling is, and my first attempt at doing it on this dataset. I’ll link to the R code once we finalize which parts of our project are public and private — for now that’s only in our private data repository but I’m happy to share the code with anyone privately if that’s of interest. Part 2 will cover how I tweaked the model to improve the results.

So, what’s a topic model? The one I’m using is Matt Jockers’ LDA model, which he explains with a lunch buffet metaphor in this blog post. I’ll be honest… it’s not my favorite, but worth a read for a first taste. I’ll see if I can do any better here, though, by walking through my experience.

**Pre-firstly, I do some basic text cleaning. I take out all the punctuation, the numbers, and the whitespace, and make all letters lowercase (otherwise “taco” and “taco,” and “Taco” and “TACO!” are all different words). I also remove common words, based on a pre-determined/widely recognized list of stop words. For this project, if I leave in words like “and”, “because”, and “he”, it will be hard to get clarity about words that are more interesting to me, like “communism” or “obscenity.”

Firstly, I set some parameters for the model. (After that I basically just run a couple lines of code and then wait a few minutes for the computer to execute a series of computations, so “understanding topic modeling” is largely about knowing what the parameters mean.) One parameter I set is how many topics I want the model to return. This step is, as far as I can tell, a bit soft — i.e. I have a feeling that 20 is a more appropriate number of topics than 5 or 100, because Joanne and I brainstormed the topics we saw in the cases and came up with about 12-15. I want to give the model enough breathing room that it comes up with different, distinct topics (for example, to separate “school” and “flag” as two topics) but not so much that we lose the thread of a common idea (I don’t want “school” and “teacher” in two different topics). I started with a number that felt right, 20, knowing that I could adjust after looking at the topics I got.

Another parameter I set is how many times I want the model to sample over the topics. (This is the Gibbs Sampling part of Jockers’ blog post, FYI.) The model starts by taking all the non-stop words that occur at least 5 times in all the corpus, and dumping that list into 20 bins in 20 different random orders. This is the first iteration of our 20 topics: 20 groups of the same words, but randomly ordered as if the words were all equally unrelated. The topics will eventually be ranked from top to bottom, so words at the top are more “important” and words at the bottom are less important in that topic. All words in the whole corpus are present in each topic, they are just ordered by the statistical likelihood that they occur together. In the end, I’ll end up taking the top 10 words from each topic, but I could also examine the top 50 or top 200 if that would be useful to me.

Back to our 20 random buckets: because I set this parameter to 5,000, the model basically “re-creates” the topics 5,000 times, decreasing the amount of randomness with each iteration. Randomness is decreased by examining one word at a time, in the context of its original document (“which other words are present in that syllabus?”) and its current topic (“which other words are important in this topic?”).

So… the model eventually ends up computing that, for instance, the word “communist” co-occurs disproportionately often with the words “party”, “foreign”, “control”, and “soviet,” and those all rise to the top of one of the topics. I can export a spreadsheet that has 20 columns, each with the top 10 words in the topic. Here’s a selection of the topics my first model returned:

spreadsheet of 6 different topics listed out

A human could read a case syllabus and tell you that that particular freedom of speech case is about communism and foreign policy; the computer can compare its words to all the other words it could have had, but didn’t, and tell you that it has a relatively high proportion of Topic 11.

This brings us to the next step. At this point, we have 20 topics — we also want to know how those topics are distributed over the 573 case syllabi. Not every case will fit perfectly into one of those 20 topics, of course, and some may be a mix of multiple topics. That’s totally fine, and I can make a second spreadsheet to explore that. This one is much bigger: each column still represents one topic, so I have one “obscenity” column, one “communism” column, up to my 20 columns. Now, though, each row is for one of the 573 case syllabi, and each cell in the row is the percentage of the case taken up by the corresponding topic. Each row adds up to 1 (i.e. 100%), giving us the distribution for each case across 20 topics.

spreadsheet showing relative percentages of sample topics in several rows of syllabi

Here, we can see a few cases that might potentially be labeled as “obscenity” or “communism” cases, because they have relatively large amounts of those topics. It’s possible to use this data to figure out if two topics consistently occur together, or if a topic ebbs or flows in importance. (For example, in the screenshot above, all of those cases register a 0 for the “university” topic — that is an important speech topic, but it’s not “popular” until later cases.)

These topics and topic distribution are a promising start, but Topics 14 and 15 in particular tell me that I can make the model better. That will be part 2!

FOS* group project update #1

This week, the FOS* team got our website and our two social media sites up and running. Joanne put in some crucial legwork on our dataset, Kevin designed our landing page and starter style guide, Eva built the landing page, and Martin sorted out our social media posting schedule and organizational software. All of us also made and/or critiqued each others’ memes as we worked on bulking up the content we have to share.

We set some very ambitious goals for making a complete, clean dataset, as well as for finishing the website’s midi-fi wireframes; this is obviously the best way to lead up to saying those aren’t done yet, but they’re being done properly. As a group, we’re not concerned with being behind on those deadlines, since 1. We baked in enough time to be “behind”, 2. There is no way of rushing quality data here and 3. More time on the data means more time available for designing. 

We’ll meet, as usual, on Sunday. This week, Joanne outlined some QC and data-sorting tasks that are best done manually, so we’ll apply an hour of our time together towards that. We like to work in a Discord voice channel, which has a great coworking vibe and relevant tech capabilities without the complication and screen fatigue of videoconferencing. 

By the end of Spring Break, we’ll aim to have:

  • our dataset finished
  • Observable prototypes for visualizations
  • Mid-fi wireframes complete for the website
  • 2-3 posts per week on our social media + engagement with other relevant accounts
  • Outreach to CUNY law students (and others, potentially)

Personal Blog: landing page dev + social media content

This was a good week! I got to work a lot on getting our landing page launched, as well as make some content for our social media sites.

The landing page was very fun, not least of all because of Kevin’s great design vision and prototypes. Design-from-scratch is one aspect I find very daunting in my own web design, so it was a treat to re-make Kevin’s designs in HTML/CSS rather than having to come up with everything on my own as I’m used to. Joanne also helped to get the interactivity up and running, and her help and company were yet more bonuses of group project work.

I was struck by the difference of what “web developer” means when it’s not “designer/developer” — instead of designing as I build (my own bad habit/lack of a better way), I got to work out a puzzle of how to translate Kevin’s designs into responsive pieces of a web page. The 2D design was taken care of, and my job felt more like getting it to feel “natural” on the webpage so that, for example, if you change the size of the window, the content resizes instead of getting stuck in 2002. I’m personally very interested in the ways in which web design aims to replicate real-life forms of sensory input (i.e. the mouse icon “senses” and changes to a pointing hand when you mouse over something clickable, in the same way that your real fingers might sense the edge of a table and tell you it’s solid if they brush against it), so it was fun for me to have more time to think about making Kevin’s designs web-live and responsive.

The social media content remains a secret for now, but suffice it to say it’s been an enjoyable challenge both to nail down the tone I want and to say everything I want to in few enough words to keep internet-level attention. Very excited to get some posts up on our instagram and twitter pages soon!

Think deeply and make stuff

When I started the DAV program in the Fall of 2019, I was faced with the ostensibly unfortunate reality during registration that there was not a single open course in my program. I ended up signing up for two DH courses and one GIS class at Hunter College. In the end, it was a semester marked by transformative thinking: about data (what is it? where does it come from? who makes it? should it actually be called capta?); about categorizations (is it possible to reconcile the inherent messiness of our world with the binaries required to communicate through digital means? who is left in and left out when we decide which structures compartmentalize the world? or, even more important, do we recognize categorization as a subjective, historically situated decision, not a reflection of “inherent truth”?); about visualizations (is there an inherent lie in representing 3D space on a 2D map? what about an inherent lie in representing data sets visually? or an inherent truth in legibility and access?). So many questions!

Since that semester, I’ve taken 6 classes in the DAV program. I’ve continued to think about, and be pushed on, these and other crucial questions about humanist data inquiry. But they’ve more often been from the DAV perspective — that is, the goal was generally to produce data analysis and data visualizations as the deliverables. The questions are important and absolutely considered in the process, but they are on some level incidental to classes that expect a final product of shareable insight via data, rather than, for example, a paper or a round table discussion as the fruits of knowledge production.

The return to the DH side this semester has made me realize how much more action-oriented I’ve become in the last year. I’m constantly thinking about how my work relates to the news, to my life, to the U.S. at large, to the jobs I’d like to have in one year, or five, or ten. Right now I’m both enjoying the opportunity to think deeply about data management plans and group dynamics, and rearing to get started on making stuff. In the last week or two, the rearing to go side has been shouting louder and louder.

I’m happy to have both sides. My background is largely in well-funded academic spaces, where talking about data and equity can happen without the urgency of needing to actually get a project done on a deadline. At worst, this has at times led me to feel like the work I do in private, academic spaces feels irrelevant to the work that’s needed in public spaces. The DAV program is, for me, a great antidote to that. I guess this is all to say, I’m not “move fast and break things,” I’m “think deeply and make stuff.” And I’m ready to make stuff!

Eva’s bio & contributions

Eva Sibinga is in her final semester of the Grad Center’s Data Analysis & Visualization program. Her research interests include data ethics, the intersection of race and technology, and the application of feminist theory to contemporary data questions. With a background in English Literature and Visual Art, her approach to data analysis and visualization is motivated by a desire to expand the way we tell stories and understand the world through our own eyes and others’.

Eva is one half of Freedom of Speech*’s core data and developer team. She will also support the project’s outreach effort, and hopes to improve her understanding and skills in UX/UI design by applying some time and effort there as well.

Personal Blog: some disorganized data documentation thoughts

I’m going to use this week’s blog post as an opportunity to try and stay abreast of some data analysis documentation, i.e. what we did on the data side of things. This week saw a few productive screen-share/voice-channel sessions for our group; the two I’ll focus on were 1. Saturday’s data understanding/wrangling session and 2. Sunday’s BeautifulSoup web scraping session.

One of Saturday’s important questions was: what is the scope of our data set? Which cases are in, and which are out? Some answers: we talked (and clicked) through it, and decided to include cases from both the modern and legacy databases, filtering by Issue Area 3 – “First Amendment”. This means there are some extra cases that are First Amendment but not freedom of speech related, but for now a little extra data is preferable to cutting out data that we may later want.

Another important question was: From what source will we gather the SCOTUS decision language? Answer: We chose Justia because it shows the full SCOTUS opinion, including dissent, searchable by the U.S. Citation.

I sort of expected this process + web-scraping to take just one meeting, but it took a long time (for me at least) to make myself familiar with the actual language of the dataset. I went back and forth a lot between the WashU Law code book and our .csv file as we began just by understanding what information is available in the data and how/from where we would scrape the SCOTUS decision language for each case.

Once we had decided on Justia, we spend time on Sunday examining the Justia website. In order to scrape our content, we needed to know 1. the URL for each case’s webpage and 2. where on the webpage the actual case content is (“where” means figuring out how the relevant HTML containers are labeled/classed).

Figuring out part 1 wasn’t too hard. All the Justia SCOTUS cases have the same format, as determined by the U.S. Citation– ex: 391 U.S. 367 corresponds to: https://supreme.justia.com/cases/federal/us/391/367/. A minor data detail — accessing each number independently required separating the single citation column into two distinct columns.

Part 2 was a little more challenging, but Joanne and I bascially… hacked it out until it worked! Sometimes it’s like that.

Finally, I think it’s worth noting that Martin and Kevin both joined Joanne and I for the first session, and Kevin chose not to join for the second. Martin did join, and listened/observed for over 2 hours with only a few sentences of input as Joanne and I mostly just worked through the questions we had. This reflects exactly the kind of group dynamic I want to be in– that group members feel comfortable both opting in AND opting out of  meetings as makes sense for their roles and personal goals. Fully online collaboration sometimes makes it feel like everyone is invited to every meeting and therefore obligated to attend every meeting and… I hate that. I’m all in for setting boundaries and protecting personal time as a key ingredient to successful collaborative relationships, at the same time that we create an environment where all feel welcome and invited to participate in satisfying ways.