Keeping Your Data So Fresh and So Clean

A Playlist for Squeaky Clean Data

I have been consuming lots of data science and analysis content, and one of the major takeaways I have found is that there is no model you can build that will yield actionable insights if you feed it garbage data. One of my instructors holds that the first thing that newbies need to learn is the GIGO principle. Well, he would put it more succinctly and say that it would be better to spend 80% of your time cleaning your data instead of building a shiny model that you feed Garbage.

The GIGO (Garbage In = Garbage Out) Principle holds that models are akin to humans and are what they consume. Moreover, if you feed your model unrefined data, you can expect your metrics to be dampened by the inaccuracies that compound along your data pipeline.

With this in mind, I created a playlist to keep my mind focused while cleaning data and thought that you might find it helpful if you are new to digging into data like I am.

Let’s start with data that is in the .csv file format, and we have also imported all of the libraries and visualization packages necessary to build a model.

So my typical first step is to create a data frame called df that I use pd.read_csv to create.

df = pd.read_csv('myfile.csv', index_col = 0)

After I have an initial data frame, I clean the data. The initial song on my list that I am typically listening to while reading the .csv file and creating my initial data frame is :

Jill Scott’s Golden

Cleaning Data

To clean data, I often spend lots of time:

Removing Duplicates — duplicates take up space, can slow the implementation of models, and may lead to erroneous insights.
Replacing Missing Values — while there is no optimal way to deal with missing values, learning how to handle missing values is a skill that budding data nerds should work on building up.
Replacing Placeholder values — placeholders like: ‘—,’ NANs,‘ ?’, and ‘#’ consistently gum up the works and lead to errors both seen and unseen in analysis. If you are lucky, error messages abound; if you are supremely unlucky, you notice after trying to visualize the correlation between dependent and independent variables and get noticeably improbable graphs.
Casting datatypes to more appropriate datatypes —Often, this shows up when I have naively been treating a categorical variable (think something with a fixed number of values) as though it was continuous (something that can have any value between a minimum and a maximum).

Where do I begin?

I look at the shape of the data frame, pretty much I need to know how many rows and columns I am working with and need to wrangle into submission. As I jam to the legendary vocalist from Philadelphia, Pennsylvania, I use the following:

df.shape — Review the dimensions of the data
df.info() — Review a concise summary of the data
df.head() — Review the first five rows of the data (you can increase this number by adding a number n inside of the parentheses exdf.head(10) will yield the first ten rows of the data.

Onto the Next

After I have learned the basic features of the data, I check for the entire data frame for placeholders, often with something that looks like the following:

df.isin['?', '#', 'NaN', 'null', 'N/A', '-']).any()

This is typically when I start listening to old-school Reggae Artists like Baby Cham, Capleton, and Sean Paul, which bring back the nostalgia of my high school projects in the physics lab with Mr. Green, who had the best rhythmic reggae soundtrack in the late 90s on the East Side of New Providence bar none.

The official video of “Ghetto Story (feat. Alicia Keys)” by Baby Cham off the album ‘Ghetto Story’- Atlantic Records

Reviewing the Changes

After identifying placeholders, I use steps like this while jamming to old-school Reggae:

df['column_with_placeholder'] = pd.to_numeric(df['column_with_placeholder'], errors = "coerce")

Keeping the Groove Going

I typically transition to dirty South jams while removing duplicates and deciding whether I will drop rows from the data frame with null values or replace them with the median or mode. My favorite artists in that genre are from the Dungeon Family.

Provided to YouTube by Arista Excalibur · Dungeon Family · Goodie Mob · Big Rube, and Even In Darkness.

My playlist typically looks like this:

Jill Scott — Golden
Baby Cham featuring Alicia Keys — Ghetto Story
Wayne Wonder — No Letting Go
Sean Paul — Temperature
Beenie Man— Romie
Dungeon Family — Excalibur
Erykah Badu — On and On
Outkast — Hey Ya!
Outkast — Rosa Parks
Lauryn Hill — Ex Factor

What‘s my next step?

I would like to build in more flexibility in and variety of my playlist:

How do I use data science to create a better playlist?

I was thinking of leveraging the youtube API, but am still thinking about this, so for right now, I just add songs based on my mood.

I am currently working on creating a trip planning tool to help with my upcoming trip to the Bahamas to help celebrate our 50th year of Independence.

Contact Me

If you would like to be updated with my latest articles, follow me on Medium. You can also connect with me on LinkedIn or email me at tenicka.norwood@gmail.com.