Data Processing

“Raw data, like raw potatoes, usually require cleaning before use.”

— Ronald A. Thisted, Professor Emeritus - Departments of Statistics and Public Health Sciences, The University of Chicago

“Happy families are all alike; every unhappy family is unhappy in its own way.”

— Leo Tolstoy, Russian author known as one of the world’s greatest novelists.

“Tidy datasets are all alike, but every messy dataset is messy in its own way.”

— Hadley Wickham, Chief Scientist at Posit (formerly RStudio)

Good data organization and formatting is the foundation of any data project. Once the data are collected, you will need to process them so they can be used in your analyses or product development steps.

Data processing takes time - plan accordingly!

It’s commonly understood in the data science field that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and Johnson 2003) - expect the same will be true for your project and prepare to invest time and resources accordingly.

The Resources section below links to resources that provide detailed technical guidance on how to process data using data science methods. Here, we discuss the why and how we go about this step with an equity lens.

When we talk about data processing here, we’re referring to the steps required to organize, format, and clean the data so it’s more efficient to use in future steps, and so others can consistently reproduce your steps - - this is commonly referred to as tidying data.

Why Tidy Data?

From a data science lens - Investing the time to properly tidy your data upfront makes the analyses or product development steps much more efficient and reproducible. This will cut down the time it takes in the long term to complete your analysis and product development steps (even if they need to evolve/change over time!), enabling you to spend more time on the actual data, science, management, and equity questions you have for your project (aka the good stuff!).

From an equity lens - having consistent and reproducible data makes it easy for others to replicate your work, which can increase transparency of your process and ultimately help build trust with your partners, the communities impacted by the management decisions associated with your project, and the public at large.

Additionally, working from a foundation of tidy data (and using consistent tools) can make it easier for others to collaborate with and contribute to your data project, which generally results in a better product in the end.

Processing Data with an Equity Lens

Keep your “raw data raw”: It’s important to keep (and back up) a version of your data that is untouched by your process (i.e. raw). We also recommend backing up a version of your data that is tidied and cleaned (i.e. what you will use for analysis). Not only does doing so protect you if your working version of the data are lost for some reason (e.g., computer crash, freak system failure) - but it also helps with reproducibility and transparency of your process.

Remove duplicate or irrelevant data: When you’re processing your data and making it tidy, think through which data from your sources are applicable to your question(s). Sometimes we don’t have the technical or logistical capacity to keep ALL of the data we are able to collect - so if you are going to remove data at this step, be sure to document WHY you’re removing it in a way that makes it easy for others to easily understand the reasoning behind the data decisions you’re making. Some reasons might include:

Data are outside of the geographic scope of the analysis (e.g. we’re focused on a particular region for the project, and we are removing data outside of that region so we can increase loading/analysis speeds)
Fields/columns from source data are not relevant to our analysis for XYZ reasons.

Transform your data into a tidy structure: Here you will reformat or restructure your dataset so it can be used in your future analyses or product development steps in a way that is efficient, effective, and meaningful for your project’s objectives. A common data transformation at this phase is converting a dataset from wide to long formats.

Graphic with a quote from Hadley Wickham at the top " Tidy data is a standard way of mapping the meaning of a dataset to its structure". Below the quote, on the left there is text that defines tidy data as a structure where: each variable forms a column, each observation forms a row, and each cell is a single measurement. To the right of the text is an example table describing pets and their colors, with three columns (ID, name, color), six observations (IDs 1-6) and each cell representing a different pet (e.g. id = 1, name = floof, color = gray; id = 2, name = max, color = black) — Graphic illustrating tidy data as a way to describe data that’s organized with a particular structure – a rectangular structure, where each variable has its own column, each observation has its own row, and each cell is a single measurement. Artwork by Allison Horst from the Openscapes Blog: Tidy data for efficiency, reproducibility, and collaboration by Julie Lowndes and Allison Horst.

Explore your data - but don’t analyze it: Once duplicate and irrelevant data are removed from the dataset, you may want to explore your data using simple visualizations or statistics (see the Data Exploration Checklist for ideas). This will help you find extreme outliers in your data, or otherwise incorrect or missing data that should be removed, separated, corrected, substituted or imputed before you begin your data analysis phase. Similar to when you removed duplicate or irrelevant data, you will want to document WHY you’re removing/correcting/imputing these data in a way that makes it easy for others to easily understand the reasoning behind the data decisions you’re making.

Any aggregation or preliminary analysis of your data should be completed during the analysis phase of the project. Keeping the values and aggregation of the data as they are at this phase increases reproducibility and transparency now and during future steps. Note that this is different from transforming data as described above (e.g. restructuring from wide to long format), which is a common and very useful part of the data processing phase.

You might be tempted to begin your data quality assessments as you’re looking at the data during this phase, but you’ll actually want to hold off on that for now. You’ll complete your data quality and assurance steps after all of your data have been tidied and you can look at all of your data holistically.

Additional Resources

Julia Lowndes and Allison Horst (2020) Tidy Data for reproducibility, efficiency, and collaboration. Openscapes blog.
College of Water Informatics Data Management Handbook - Collect and Process Section
Hadley Wickham. Tidy Data. Journal of Statistical Software
Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. R for Data Science (2e) - Data Tidying Chapter
The Py4DS Community. Python for Data Science - Tidy Data Chapter
Karl W. Broman, and Kara H Woo (2018) Data Organization in Spreadsheets. The American Statistician 72 (1). Available open access as a PeerJ preprint.