Data Processing

“Raw data, like raw potatoes, usually require cleaning before use.”

Ronald A. Thisted, Professor Emeritus - Departments of Statistics and Public Health Sciences, The University of Chicago

“Happy families are all alike; every unhappy family is unhappy in its own way.”

— Leo Tolstoy, Russian author known as one of the world’s greatest novelists.

“Tidy datasets are all alike, but every messy dataset is messy in its own way.”

Hadley Wickham, Chief Scientist at Posit (formerly RStudio)

Good data organization and formatting is the foundation of any data project. Once the data are collected, you will need to process them so they can be used in your analyses or product development steps.

Important

It’s commonly understood in the data science field that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and Johnson 2003) - expect the same will be true for your project and prepare to invest time and resources accordingly.

The Resources section below links to resources that provide detailed technical guidance on how to process data using data science methods. Here, we discuss the why and how we go about this step with an equity lens.

When we talk about data processing here, we’re referring to the steps required to organize, format, and clean the data so it’s more efficient to use in future steps, and so others can consistently reproduce your steps - - this is commonly referred to as tidying data.

Why Tidy Data?

From a data science lens - Investing the time to properly tidy your data upfront makes the analyses or product development steps much more efficient and reproducible. This will cut down the time it takes in the long term to complete your analysis and product development steps (even if they need to evolve/change over time!), enabling you to spend more time on the actual data, science, management, and equity questions you have for your project (aka the good stuff!).

From an equity lens - having consistent and reproducible data makes it easy for others to replicate your work, which can increase transparency of your process and ultimately help build trust with your partners, the communities impacted by the management decisions associated with your project, and the public at large.

Additionally, working from a foundation of tidy data (and using consistent tools) can make it easier for others to collaborate with and contribute to your data project, which generally results in a better product in the end.

How to Tidy Data with an Equity Lens

When you’re processing your data and making it tidy, think through which data from your sources are applicable to your question(s). Sometimes we don’t have the technical or logistical capacity to keep ALL of the data we are able to collect - so if you are going to remove data at this step, be sure to document WHY you’re removing it in a way that makes it easy for others to easily understand the reasoning behind the data decisions you’re making. Some reasons might include:

  • Data are outside of the geographic scope of the analysis (e.g. we’re focused on a particular region for the project, and we are removing data outside of that region so we can increase loading/analysis speeds)
  • Fields/columns from source data are not relevant to our analysis for XYZ reasons.

Keep your “raw data raw”. Any aggregation or preliminary analysis of your data should be completed during the analysis phase of the project. Keeping the values and aggregation of the data as they are at this phase increases reproducibility and transparency now and during future steps. Note that this is different from reformatting/restructuring data from wide to long formats, which is a common and very useful part of the data processing/tidying phase.

You might be tempted to begin your data quality assessments as you’re looking at the data during this phase, but you’ll actually want to hold off on that for now. You’ll complete your data quality and assurance steps after all of your data have been tidied and you can look at all of your data holistically.

Resources