Freedom Of Speech: Data Management Plan | DHUM 70002 Digital Humanities: Methods and Practices (Spring 2021)

1. Data Collection

What data will you collect or create?
- Our metadata about court cases, and our initial filtering, comes from the Washington University in St. Louis (WUSTL) law school database, also known as The Supreme Court Database (SCD). The dataset is comprised of two .csv files that we have downloaded which cover: (1) cases up until 1945; and then (2) post-1945. These two .csv files have been stitched together in a chronological manner. Each case then has the actual, court-published opinion texts as well, scraped from the Justia website and attached to the dataset.
How will the data be collected or created?
- After having downloaded our data from the SCD, the next step is to filter by First Amendment. The download produced a larger dataset than we are actually interested in, so the next step is to filter for cases that address the Freedom of Speech specifically. These cases will be taken from the book Landmark Supreme Court Cases: The Most Influential Decisions of the Supreme Court of the United States (Vol. 3. 2nd ed.) by Richard A. Leiter and Roy M. Mersky. In its complete form, the data will be a JSON file with nested features.
Is it possible to regenerate the data? What are the implications for your research if the data are lost or become unusable later?
- It is possible to regenerate our data by recreating the steps mentioned above and rescraping Justia for the opinion texts. We have backups of all of the processes we have created in order to get our final, usable dataset.
What are the tools or software you will be using to create/process/analyze/visualize the data?
- We will be using a Jupyter Notebook to create, process, and analyze the data. Some visualizations will be sketched out using a mixture of matplotlib/seaborn for initial topic modeling/text analysis, and then reimagined in d3.js.

2. Documentation and Metadata

What documentation and metadata will accompany the data?
- We will use README.md files to document our data’s features and the processes we work with. These processes include: combining SCD’s codebook with the codebook we create for the features we generate in the process of building out the existing database; processing documentation of cleaning, scraping, and filtering; and linking to SCD, Justia, and the textbook we’re using for analysis purposes (Landmark Supreme Court Cases).
- In order to ensure good project and data documentation, we will be taking notes throughout the creation, cleaning, and processing phases and then compiling those notes into README files.
- Each of us, with respect to our roles in the project, will be responsible for an aspect of data management.
- We will use categorical descriptive terminology in order to name our directories and files, and because our data is already mostly created, we will be following the standards of the already-created datasets.

3. Ethics and Legal Compliance

How will you manage any ethical issues?
- Managing ethical issues largely entails deferring to experts when possible—especially through Landmark Supreme Court Cases—rather than deciding ourselves what “counts” as a landmark case. We will also aim to be transparent about where our data is from and any decisions we have made in curating it.
How will you manage copyright and Intellectual Property Rights (IP/IPR) issues?
- The SCD allows us to download and transform the data, with attribution. Regarding the Landmark Cases textbook, we will be reaching out to the publisher in order to request permission to use the content therein.

4. Storage and Backup

How will the data be stored and backed up during the research?
- To use Stephen Zweibel’s language, we will use Github for “far”, flash drives for “near,” and our computers for “here.” Doing this should allow for sufficient storage. Regarding backup, each of us will back the data up locally, as well as on flash drives just in case.
How will you manage access and security?
- Our Github repository/organization is closed to the public in terms of who can edit and manage the repository, so only our group can control it. The repository can still be forked, but the root data will not be mutated.

5. Selection and Preservation

Which data are of long-term value and should be retained, shared, and/or preserved?
- After cleaning, processing, and finalizing the dataset, we will be keeping it in our Github repository. We will be keeping it there for the foreseeable future.
What is the long-term preservation plan for the dataset?
- Github will also be used for long-term preservation.

6. Data Sharing

How will you share the data?
- We will make the data immediately available upon upload, via the Freedom of Speech* site. The data itself will be a public, raw dataset available through Github.
Are any restrictions on data sharing required?
- Our case data is in the public domain, but we may have to restrict our data if we are given a limited license by the publishers of the Landmark Cases textbook.
- **need to check copyright**
Who is your possible audience? Who may use the data now, or later?
- Our possible audience would be: any audience interested in First Amendment rights; law students who specialize in Constitutional Law; and members of the general public who want to know more about the fundamentals of legal studies; we also hope to reach a more “casual” or amorphous audience on social media platforms like Twitter, where Freedom of Speech is a relevant/hot topic.
What tools/software are required to access your data?
- No special tools or software are required to access the data; if the user has a web browser, they can access it through our Github page, which will open the data in a new window.

7. Responsibilities and Resources

Who will be responsible for data management?
- Each member of our team will exercise due diligence in implementation of the standards outlined above.
What resources will you require to deliver your plan?
- We have all the resources we need in order to deliver our plan: Discord for communication, Jupyter Notebooks + Python for code and processing, Figma for design ideation and UX/UI wireframing/prototyping, Observable Notebooks for prototyping visualizations in d3, and Github for storage and hosting.