Your reading today was probably one of the more technical pieces you’ll read in this course. Tidy Data is written for statisticians to discuss the problem of organizing data. I find that this piece is useful one for introducing some of the concepts involved in data work and for introducing you to how statisticians think about data. These are the people we are stealing methods from, and it is important to understand their concerns and assumptions as we think about how we bring those methods and tools into the humanities.
So, my goals for today:
- Learn some of the vocabulary around data
- Practice thinking about data as something to be organized and operated on.
- Think a bit together about why one might one tidy data and what the costs might be of tidy data
So who is Hadley Wickham and why does he get to define “tidy data”?
Hadley Wickham is a statistician, PhD from the University of Iowa. He did not create the statistical language “R” but he has made a number of very popular libraries and tools that make R easier to use.
Who has heard of “R” before?
(R is a programming language designed for statistical computing → There are many programming languages that provide ways for humans to interact with the computer. Different language have different strengths, were designed for different problems or use cases. R was built for data analysis, and is a popular tool for people doing computational analysis in the digital humanities. If you want to learn more about R in the digital humanities, check out Humanities Data in R, Text Analysis with R for Students in Literature, and Digital History Methods in R.)
- data table
- Your data table is your spread sheet. A spread sheet is used to describe multiple examples of a particular type of thing.
- columns and rows → variables and observations
- Columns are the vertical aspects of a spread sheet.
- Each column should contain one variable or aspect of what is being described.
- Rows are the horizontal aspects of a spread sheet
- Each row should contain one observation or example of what is being described.
- fixed variables
- These are the columns where you record the known or contextual information
- measured variables
- These are the columns where you record the findings (the results of measuring)
- The value is the piece of data that is recorded in the table.
- features of messy data (Wickham)
- Column headers are values, not variable names.
- Multiple variables are stored in one column.
- Variables are stored in both rows and columns.
- Multiple types of observational units are stored in the same table.
- A single observational unit is stored in multiple tables.
- principles of tidy data (Wickham)
- one variable per column
- each observation gets a row
- each observation type gets a table
- normalized database
- If you find that you have to repeat a lot of information on each row (observation), it may be time to normalize. This means extracting out that repeating information into its own data table, and using IDs to match the tables back up.
- normalize can also mean a range of statistical transformations where the data is transformed for comparison or to fit a distribution (graded on a curve.) Not what we mean by normalize in this context.
Terms that relate to Wickham’s R package (the concepts are useful; don’t worry too much about the names.)
- This is Wickham’s term for the process of transforming a messy data table into a clean table. While he does not like the terms, I find it is useful to think in terms of moving from a “wide” table to a “long” table, where it is better to have multiple rows that refer to the same thing than multiplying columns to capture all of the aspects in one row.
- “column variable” → colvars are columns that are already properly organized, or that contain a single variable or aspect of what is being described. These are used to anchor the dataset as it is being “melted.”
- A data table that has been transformed into tidy data
Data that has this spreadsheet structure is referred to as “structured data.” The primary alternative is “unstructured” – text, images, etc. – things that contain information but that do not conform to a regular, easily computable structure.
What do we do with structured, tidy data?
Tidy data also makes it easier for visualization and modeling.
Note: Tidy data is generally easier to work with. There are exceptions and places where different data organization is needed. It is a model for structuring information, and as will come up many times in this course, “all models are wrong; some are useful.”
All of these transformations can be done relatively quickly with a computer, but it takes a fair bit of learning to get up and running with it. For today, we are going to work with a small dataset and transform it by hand to understand the changes Wickham is talking about.
Go to https://docs.google.com/spreadsheets/d/1jMaFSVfac9-W8rL3LsEH1ko5OGfUpiXZLq_gyRygHUw/edit?usp=sharing, and create a copy. We will work through the first tab together.
Transform the table in the “Homework” tab from the Pew religious landscape study into a tidy table and post the tidy table on your blog.
Additionally, in your post identify what the variables, observations, and values that are in the table. What is one question that you can ask of the data in this format that would have been hard in the original format?