Keeping Your Data So Fresh and So Clean

Photo by Marcela Laskoski on Unsplash
Music video by Jill Scott performing Golden. © 2007 Hidden Beach Records

Cleaning Data

To clean data, I often spend lots of time:

  • Removing Duplicates — duplicates take up space, can slow the implementation of models, and may lead to erroneous insights.
  • Replacing Missing Values — while there is no optimal way to deal with missing values, learning how to handle missing values is a skill that budding data nerds should work on building up.
  • Replacing Placeholder values — placeholders like: ‘—,’ NANs,‘ ?’, and ‘#’ consistently gum up the works and lead to errors both seen and unseen in analysis. If you are lucky, error messages abound; if you are supremely unlucky, you notice after trying to visualize the correlation between dependent and independent variables and get noticeably improbable graphs.
  • Casting datatypes to more appropriate datatypes —Often, this shows up when I have naively been treating a categorical variable (think something with a fixed number of values) as though it was continuous (something that can have any value between a minimum and a maximum).

Where do I begin?

I look at the shape of the data frame, pretty much I need to know how many rows and columns I am working with and need to wrangle into submission. As I jam to the legendary vocalist from Philadelphia, Pennsylvania, I use the following:

  • df.shape — Review the dimensions of the data
  • df.info() — Review a concise summary of the data
  • df.head() — Review the first five rows of the data (you can increase this number by adding a number n inside of the parentheses exdf.head(10) will yield the first ten rows of the data.

Onto the Next

After I have learned the basic features of the data, I check for the entire data frame for placeholders, often with something that looks like the following:

The official video of “Ghetto Story (feat. Alicia Keys)” by Baby Cham off the album ‘Ghetto Story’- Atlantic Records

Reviewing the Changes

After identifying placeholders, I use steps like this while jamming to old-school Reggae:

Keeping the Groove Going

I typically transition to dirty South jams while removing duplicates and deciding whether I will drop rows from the data frame with null values or replace them with the median or mode. My favorite artists in that genre are from the Dungeon Family.

Provided to YouTube by Arista Excalibur · Dungeon Family · Goodie Mob · Big Rube, and Even In Darkness.
  1. Jill Scott — Golden
  2. Baby Cham featuring Alicia Keys — Ghetto Story
  3. Wayne Wonder — No Letting Go
  4. Sean Paul — Temperature
  5. Beenie Man— Romie
  6. Dungeon Family — Excalibur
  7. Erykah Badu — On and On
  8. Outkast — Hey Ya!
  9. Outkast — Rosa Parks
  10. Lauryn Hill — Ex Factor

What‘s my next step?

I would like to build in more flexibility in and variety of my playlist:

  • How do I use data science to create a better playlist?
  • I am currently working on creating a trip planning tool to help with my upcoming trip to the Bahamas to help celebrate our 50th year of Independence.

Contact Me

If you would like to be updated with my latest articles, follow me on Medium. You can also connect with me on LinkedIn or email me at tenicka.norwood@gmail.com.

--

--

It's not about the intention, it's about the impact of your insights. I help non-traditional learners make the leap into data science and analysis.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Tenicka Terell Norwood

Tenicka Terell Norwood

17 Followers

It's not about the intention, it's about the impact of your insights. I help non-traditional learners make the leap into data science and analysis.