Tag Archives: data management plan

Lisa’s Public Journal – Week 8 – More Research

This week was research and more research. Bret made me aware that the NYPL will send books to GC students, so I made a request for a copy of the Jenkins Worth biography and the planning report that the NYC Common Council created to commemorate the memorial. It can take up to three weeks for the request to be processed, so I am unsure I will be able to rely on getting it in time. However, it’s good to know that I will likely get it at some point.

We need to populate our website’s top navigation timeline with more data. My research focus for this week was on the broader history of New York City. Bri had already done super work recording some of the great moments in the NYC deathscape, like the fact the city banned burials below 14^th Street in 1839, or that the state allowed the purchase of tax-free land for cemeteries in 1847. I focused on demographics (for example, how huge surges in population growth affected culture) and infrastructure (how the opening of the Erie Canal gave NYC a direct water route to the content to the Midwest). In our team meeting tonight we agreed that we would need to winnow down the number of data points that we’ll use in the timeline, but that we’ll also want to memorialize the bulk of the content in another way.

The biggest challenge is focus. I started reading Burrows “Gotham: A History of New York City to 1898” on Friday night, promising to keep to the index and just focus on the deathscape …six hours later I had completely forgotten my promise and was well into just reading! So now, I have mapped out a set number of hours a week, in three-hour chunks, through 15 April. I have also fired up Zotero, to keep me on the mission. [There is nothing like tweezing citations to focus the mind.] I love reading and doing research, but with such a short timeframe I know I need to keep it on the core aims of the project. Discipline!

“Gotham: A History of New York City to 1898” or where Lisa spent her Friday night.

Lisa’s Public Journal – Week Six — Data Management

Our team has had several meetings where we have touched on different aspects of data management. We are using GitHUB for development, which has good version controls for the codebase. That system is self-contained and doesn’t require that we create a system from whole cloth. However, we have agreed to a version control system for file names which is limited to letters, numbers, and the underscore symbol and which includes the creation date.

A bigger challenge is how to collaborate on content creation asynchronously. We have settled on using a shared Google spreadsheet, with each tab being a different aspect of the website: Splash page, timeline page, about page, et cetera. We have also agreed on using Tidy Data* standards for the tracking of all assets. The Tidy Data standard, when applied to a spreadsheet, uses the structure below:

Every column is a variable.
Every row is an observation.
Every cell is a single value.

We expect to start populating the spreadsheet soon so that our developer has some content to push into our site’s wireframe. Our expectation is to have the bulk of our research completed by 15 April. While the spreadsheet will host information about the different data sources, assets we create will be hosted on Wikimedia Commons if a picture or on Soundcloud if an audio file. In this way, we can extend the life of the assets without the worry of having to pay for their storage.

Our team leader is hosting all our shared Google files on their personal Google account. Archives are being saved to the Library on our team Commons site. It will be interesting to see how collaborating in a shared spreadsheet works over time. My biggest fear is that we inadvertently lose information. However, my hope is that by agreeing to conform to these standards at the beginning of the project, we will avoid our developer losing time managing the uploads during the production process.

*SOURCE: https://vita.had.co.nz/papers/tidy-data.html and https://vita.had.co.nz/papers/tidy-data.pdf.

Bianca’s weekly post: 4×6 nostalgia

Reflections on how (if) data management can improve my research practices:

Although our group (ReadingRebus) chose to manage our project data through a combination of Google Folders/Docs and a Google Sheet tracking each team member’s contribution on a timeline, I thought I’d explore further a more task-designated tool that I came across when we were still in the project planning stage.

Trello’s design appealed to me because it looked as if it might replicate my ancient dissertation-research logging practices: writing each quotation or idea on a 4×6 index card, with a bibliographic shorthand in one corner and the topic/chapter designation in the other. I shuffled them endlessly as I reorganized my argument and its supporting materials, bound them with elastic bands, and stored them in shoe boxes.

And then I threw all the cards away when the dissertation was done. For some time, I regretted that I hadn’t used pre-punched catalogue cards so I could claim a discarded library file cabinet to house them, forever, hauling the wall-sized furniture around the world with my increasing boxes of books and dead-tree teaching files . . . . (The cards would still be valueless but the cabinets have appreciated considerably.)

Optimistic for a lighter equivalent, I sketched out two projects on Trello, one where I created my own board, another where I used a template. The former is for a manuscript that I’m revising, but where some of the material will need to be broken up into smaller sections or go into a different chapter. I can see how the column-of-post-its design could help me imagine different sequences and move the bits and pieces accordingly. The latter is a new project, based off a short presentation that I both need to expand and to write up in a different form. Again, having a preexisting structure (in this case a sequence of slides), makes it easy to envision the project as a growing series of equal components or steps.

However, trying to use Trello in the early stages of a project, where I don’t know where I’m heading, may result in one endless data column that can only be sorted after I start understanding my claims. Before that I just need somewhere to dump things and perhaps a DM program will just require unnecessary preemptive organization. For example, the Trello templates are a wash. There are far too many options, most of which are meant for non-academic projects or daily tasks. And even if one opts for a very simple template, as I did for my second project, one wastes time eliminating features that one doesn’t want or need (inspirational photos of up-tilted faces for the “TODO” (sic) column or images of human cartwheels for “DONE”) .

None of this addresses the real challenge: envisioning the “deliverables” of a scholarly research project and approximating how long it will take to find the data to support one’s hazy vision of a possible conclusion or to replace it with something more plausible.

If a project starts with a question (i.e. “Might early modern women have become literate by sewing letters rather than writing them?”), one might be able to chart the areas where one should look and assign them to a timeline, but one could hardly work backwards from an anticipated answer (beyond ‘Yes/No”) through the stages of the project to a fixed start date. A program like Zotero, that uses data that won’t get tossed, in a preexisting structure (i.e. a standardized bibliography of works consulted), might be a better tether for the amorphous beast of potentially useful/useless matter that is raw research.

Hence the short answer for me is that I’m not sure data management–at least as found in conventional data management programs–can improve my research practices. However it certainly should help shape that accumulated research into a legible form by a fixed date–say–into a conference paper due March 15th?! We’ll see . . . .

Freedom Of Speech: Data Management Plan

1. Data Collection

What data will you collect or create?
- Our metadata about court cases, and our initial filtering, comes from the Washington University in St. Louis (WUSTL) law school database, also known as The Supreme Court Database (SCD). The dataset is comprised of two .csv files that we have downloaded which cover: (1) cases up until 1945; and then (2) post-1945. These two .csv files have been stitched together in a chronological manner. Each case then has the actual, court-published opinion texts as well, scraped from the Justia website and attached to the dataset.
How will the data be collected or created?
- After having downloaded our data from the SCD, the next step is to filter by First Amendment. The download produced a larger dataset than we are actually interested in, so the next step is to filter for cases that address the Freedom of Speech specifically. These cases will be taken from the book Landmark Supreme Court Cases: The Most Influential Decisions of the Supreme Court of the United States (Vol. 3. 2nd ed.) by Richard A. Leiter and Roy M. Mersky. In its complete form, the data will be a JSON file with nested features.
Is it possible to regenerate the data? What are the implications for your research if the data are lost or become unusable later?
- It is possible to regenerate our data by recreating the steps mentioned above and rescraping Justia for the opinion texts. We have backups of all of the processes we have created in order to get our final, usable dataset.
What are the tools or software you will be using to create/process/analyze/visualize the data?
- We will be using a Jupyter Notebook to create, process, and analyze the data. Some visualizations will be sketched out using a mixture of matplotlib/seaborn for initial topic modeling/text analysis, and then reimagined in d3.js.

2. Documentation and Metadata

What documentation and metadata will accompany the data?
- We will use README.md files to document our data’s features and the processes we work with. These processes include: combining SCD’s codebook with the codebook we create for the features we generate in the process of building out the existing database; processing documentation of cleaning, scraping, and filtering; and linking to SCD, Justia, and the textbook we’re using for analysis purposes (Landmark Supreme Court Cases).
- In order to ensure good project and data documentation, we will be taking notes throughout the creation, cleaning, and processing phases and then compiling those notes into README files.
- Each of us, with respect to our roles in the project, will be responsible for an aspect of data management.
- We will use categorical descriptive terminology in order to name our directories and files, and because our data is already mostly created, we will be following the standards of the already-created datasets.

3. Ethics and Legal Compliance

How will you manage any ethical issues?
- Managing ethical issues largely entails deferring to experts when possible—especially through Landmark Supreme Court Cases—rather than deciding ourselves what “counts” as a landmark case. We will also aim to be transparent about where our data is from and any decisions we have made in curating it.
How will you manage copyright and Intellectual Property Rights (IP/IPR) issues?
- The SCD allows us to download and transform the data, with attribution. Regarding the Landmark Cases textbook, we will be reaching out to the publisher in order to request permission to use the content therein.

4. Storage and Backup

How will the data be stored and backed up during the research?
- To use Stephen Zweibel’s language, we will use Github for “far”, flash drives for “near,” and our computers for “here.” Doing this should allow for sufficient storage. Regarding backup, each of us will back the data up locally, as well as on flash drives just in case.
How will you manage access and security?
- Our Github repository/organization is closed to the public in terms of who can edit and manage the repository, so only our group can control it. The repository can still be forked, but the root data will not be mutated.

5. Selection and Preservation

Which data are of long-term value and should be retained, shared, and/or preserved?
- After cleaning, processing, and finalizing the dataset, we will be keeping it in our Github repository. We will be keeping it there for the foreseeable future.
What is the long-term preservation plan for the dataset?
- Github will also be used for long-term preservation.

6. Data Sharing

How will you share the data?
- We will make the data immediately available upon upload, via the Freedom of Speech* site. The data itself will be a public, raw dataset available through Github.
Are any restrictions on data sharing required?
- Our case data is in the public domain, but we may have to restrict our data if we are given a limited license by the publishers of the Landmark Cases textbook.
- **need to check copyright**
Who is your possible audience? Who may use the data now, or later?
- Our possible audience would be: any audience interested in First Amendment rights; law students who specialize in Constitutional Law; and members of the general public who want to know more about the fundamentals of legal studies; we also hope to reach a more “casual” or amorphous audience on social media platforms like Twitter, where Freedom of Speech is a relevant/hot topic.
What tools/software are required to access your data?
- No special tools or software are required to access the data; if the user has a web browser, they can access it through our Github page, which will open the data in a new window.

7. Responsibilities and Resources

Who will be responsible for data management?
- Each member of our team will exercise due diligence in implementation of the standards outlined above.
What resources will you require to deliver your plan?
- We have all the resources we need in order to deliver our plan: Discord for communication, Jupyter Notebooks + Python for code and processing, Figma for design ideation and UX/UI wireframing/prototyping, Observable Notebooks for prototyping visualizations in d3, and Github for storage and hosting.

Personal Blog: The First DMP Is the Deepest

Our deliverables these past two weeks have been very challenging to me. I really didn’t want to deliver them. But having gone through the process of creating a Work Plan and a Data Management Plan (DMP), I get it. It’s so much more fun having ideas and having your head in the clouds, but, again, I get it.

In my other class, we’ve been doing related work in reviewing other digital projects. To do so, we’ve been following a template put forth by Miriam Posner (see her video How Did They Make That), in which she asks: (1) What are the sources (or data)? (2) What did they do to them/how was the data processed? And (3) how is the project presented? I’m very glad to have watched this video before working on the DMP for this class because it helped me understand how vast a category “data” really is and to start thinking about it more expansively–beyond numbers and calculations.

Having now completed my first DMP, I have to believe future ones will be easier, or if not easier then at least feel more approachable. And I am optimistic this process will change how I write a proposal in the first place. With a better sense of the data I want/am able to collect, I think I will be able to start with stronger research questions.

It was also challenging thinking about how our group data will be stored and maintained–servers seem so far away and their capacity limitless. But I’m learning this is very far from true. So how long will our data be stored? We wrote 3 to 5 years, but in my heart I wrote for-ev-er.

Gif of Officer Saying For-Ev-Er

Mapping Cemeteries: Data Management Plan

1. What are the types of data that may be produced as part of this project?

Our project will generate data specific to five cemeteries, as well as data for the timeline visuals which will combine all five cemeteries data into one. We expect to have both primary (generated by team members) and secondary data (found by team members).

How will data be collected (e.g., instrumentation, observation, survey, etc.)?

- - We are gathering data between February and Mid-April, 2021, based primarily on research from digital archives, journal articles, and digital sources we have access to (either through our CUNY affiliations or are freely available to the public).

Is it possible to regenerate the data? What are the implications for your research if the data are lost or became unusable later?

- - Yes, our data (e.g., dates and locations) will be reproducible, though we may come to slightly different conclusions about it if our supporting text is also lost.

What types of data will be produced?

- - images: historical images (based on licensing availability) and present-day images taken by our team members at the selected locations
  - videos: taken by our team members at the selected locations
  - audio: to increase accessibility to our text, interviews with funerary experts, and accompanying podcast documenting our process of building this project
  - text: descriptions and narratives produced by our research
  - location data: longitude and latitude of cemeteries to produce map pins
  - code: to build our website
  - metadata: for each page of our site, as well as each media item that appears on it to ensure searchability and accessibility

What are the tools or software you will be using to create/process/analyze/visualize the data?

- - Google sheets, Google docs, Mapbox, timeline tools like TimeLineJS or Vuetify

What are your access, storage, and backup strategies?

- All primary digital assets (images, videos, and audio) will be stored on Wikimedia commons. Our main tool for storing data will be a spreadsheet. Each sheet will be filled out by all team members as they do their research.The spreadsheet will include multiple sheets:
  - general/map
  - horizontal timeline
  - vertical timeline
  - historical cemetery
  - cemetery repurposed as park
  - cemetery war memorial
  - cemetery rediscovered
  - established/other cemetery

2. What standards will you be using for data collection, documentation, description, and metadata?

The spreadsheet will reside on Bri’s google drive, GitHub repository (in a csv format, as well as a link to the google drive in the Read.me file), team members’ local machines. Version control is built into the Google spreadsheet so we can see how/when the data is updated, and changes to our website code will also be versioned and saved within GitHub. And we are documenting our weekly contributions to the project via individual diary-like updates in our Mapping Cemeteries Commons group.

How do you document data collection procedures?

- - We are noting all of our data collection via our shared Google sheet. Each sheet will include the following list of columns that is subject to expand:
    - name
    - custodian
    - caption
    - description
    - data type
    - purpose
    - tag
    - source link
    - citation
    - Institution

How will you ensure good project and data documentation? Who is responsible for implementing this data management plan?

- - All team members are responsible for implementing this data management plan; our names will be next to all data we enter onto the sheet.

What directory and file naming conventions will you be using?

- - We will follow Tidy data and other best practices. All file names will use underscores (_) instead of spaces, and they will include dates to aid in version control. Information about our files will be included in a Read.me file with a data abstract, as well as a data dictionary as needed.

What project and data identifiers will be assigned?

- - Data will be organized via cemetery/memorial location. Historical data we include in our vertical timeline will be organized separately.

Will you use disciplinary or community standards for data formatting, description, interoperability, or sharing for any of the data you collect?

- We will follow all disciplinary standards, and customize to our project needs as necessary.

3. What steps will you take to protect your or your participant’s security, privacy/confidentiality, intellectual property, or other rights? (Check current university policies for requirements.)

Who controls the data (e.g., PI, student, lab, University, funder), and at what level?

- - Team members control the data.
  - Google docs reside under Bri’s account as she may be using this for future phases/capstone project.

Any special privacy or security requirements (e.g., personal data, high-security data)?

- - We will make sure to use up-to-date software and upgrade as necessary to avoid any vulnerabilities. Additionally, no personal information will be stored on our site.

Do you have any embargo periods to uphold?

- No

4. If you allow others to reuse your data, how will the data be accessed and shared? What are the data sharing requirements your work is subject to (e.g., funder, journal)?

Who is your possible audience? Who may use the data now, or later?

- - We are planning an initial “soft launch,” so our initial primary audiences are our classmates and attendees at the GC Digital Showcase.
  - Going forward we expect our audience to include:
    - New York City historians, especially those interested in the macabre, necropolitics, and lesser-known or “forgotten” histories
    - Scholars and members of the public studying cemeteries and memory studies
    - People offering and interested in taking walking tours and practicing alternative forms of tourism
  - Bri may expand on this project for future phases and/or for her capstone project.

When will you publish the data and where?

- - We will share all of our data on GitHub, and media we create will be shared on Wikimedia. We will publish our data on our website, and we will also share our findings in Clio as a potential walking tour, with links back to our website.

What tools/software are required to access your data?

- Users will access our data via our public-facing website, social media posts, and Clio.

5. How will the data be archived for preservation and long-term access?

How long should the data be retained (e.g., 3-5 years, 10-20 years, permanently)?

- - Our data will be retained for 3-5 years, at which point this DMP will be re-reviewed to determine whether longer-term access is required.

What file formats will you be using, or converting to? Are they sustainably accessible?

- - Our data spreadsheet will be saved in csv format, and a link to the Google docs will be included in the Read.me file stored on GitHub. Text will live in Word and Google docs, and be backed up in rich text non-proprietary formats. Images, video, and audio files will be saved as JGP or PNG, MP4, and FLAC files (or other non-proprietary format), respectively. The non-proprietary formats will live in GitHub, and both proprietary and non-proprietary formats will be stored in our Mapping Cemeteries Common group library.

Who will maintain my data for the long-term?

- - Bri

Which data archives are your data appropriate for (subject-based? institutional)?

- Our data archives can be appropriate for New York City history, New York–related migration studies, and Digital Humanities archives

*Posted by Nadia, Lisa, Asma, lane, and Bri*

DHUM 70002 Digital Humanities: Methods and Practices (Spring 2021)