MCD-UNISON
Published in

MCD-UNISON

Exploratory Data Analysis and Automated EDA in Python using Sweetviz

Complete implementation of Exploratory Data Analysis in Python and a way to make it automated using the Sweetviz Python library.

What is Exploratory Data Analysis (EDA)?

Exploratory Data Analysis is all about analyzing the dataset and summarizing the key insights and characteristics of the data. EDA is important because if you are not familiar with the dataset you are working on, then you won’t be able to infer something from that data.

Understanding EDA with an interesting use case in Python.

We will be working on the Breast Cancer Wisconsin (Diagnostic) dataset available in Kaggle. Features are computed from a digitized image of a fine needle aspirate (FNA) on a breast mass. They describe the characteristics of the cell nucleo present in the image.

Using this dataset we will try to understand the characteristics of the data and its descriptive measures with EDA and Automated EDA.

As a first step, let’s import the libraries needed.

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder

The next step is to load the Breast Cancer dataset from the CSV file to a Pandas DataFrame:

breast_cancer = pd.read_csv('cancer.csv')
  1. Checking the features in the dataset.

We can use Pandas’s head() function to see a part of the DataFrame.

breast_cancer.head()
Firsts rows and columns of the dataset.
First rows and columns of the dataset.

As we can see, the second column is “diagnosis” where “M” represents Malignant and “B” means Benign.

2. Checking the data type of each column and non-null count.

breast_cancer.info()
First 26 rows of the output.

We can see here that the dataset contains 569 rows (data points) and 32 columns (features).

The “id” is in the form of integer, “diagnosis” column is in ‘object’ form, a categorical variable, and the remaining columns are ‘float’ form.

3. Encoding the labels for classification problems.

Let’s encode the “diagnosis” column to have all the columns in numerical format. We will encode “B” as 0 and “M” as 1.

label_encode = LabelEncoder()
labels = label_encode.fit_transform(breast_cancer['diagnosis'])
breast_cancer['target'] = labels
breast_cancer.drop(columns=['id','diagnosis'], axis=1, inplace=True)

We encoded the “diagnosis” column, storing it in a different column called “target” and removing the “diagnosis”. We are also removing the “id” because it is not necessary.

4. Checking for missing values.

Let’s check whether there are any missing values in the dataset.

breast_cancer.isnull().sum()
First few rows of the output.

As we can see, there are no missing values in this case.

5. Descriptive summary of the dataset.

The next step is to get some statistical measures about the dataset. This is “Descriptive Statistics” which is a summarization of the data. We can use describe() function in Pandas for this.

breast_cancer.describe()
Showing few columns of the output.

The main inference that we can get here is, for most of the columns, the mean value is larger than median value (50th percentile: 50%). This is an indication that those features have a right skewed data.

6. Checking the distribution of the target variable.

The next step is to check the distribution of the dataset based on the target variable to see if there is an imbalance.

breast_cancer['target'].value_counts()

There is a slight imbalance in the dataset ( number of Benign(0) cases is more than number of Malignant(1) cases).

7. Grouping the data based on target variable:

Let’s group the dataset based on the target variable. We will group the data points as 0 and 1 representing Benign and Malignant, respectively. This grouping is done with the mean value of all the columns.

breast_cancer.groupby('target').mean()
Showing few columns of the output.

This clearly tells us that the mean value for most of the features are greater for Malignant cases than the mean value for Benign cases.

At this point, as a summary we have:

  • The dataset has 569 rows and 32 columns.
  • It has any missing values.
  • The data is right skewed for most of the features.
  • Benign cases are more than Malignant cases.
  • The mean value for most of the features are greater for Malignant cases than for Benign cases.

The EDA is not completed yet.

It is necessary to do Data Visualization tasks to understand the data better. Those topics won’t be covered in this post, but a way to make an Automated EDA will be shown.

We will work on Automating EDA using Sweetviz. It is a Python library that generates beautiful, high-density visualizations to start your EDA.

8. Installing Sweetviz

We can install Sweetviz by using the pip install command given below.

pip install sweetviz

9. Analyzing Dataset.

Previously, the dataset was loaded and analysis was done, so the next step is to do another analysis using Sweetviz.

# importing sweetviz
import sweetviz as sv
#analyzing the dataset
breast_cancer_report = sv.analyze(breast_cancer)
#display the report
breast_cancer_report.show_html('cancer.html')
EDA Report.

And that’s all… As you can see, our EDA report is ready and contains a lot of information for all the attributes. It’s easy to understand and is prepared in just three lines of code, being exported to an html doc.

Sweetviz can also be used to visualize the comparison of test and train data. For this example, we are going to divide the data into two parts, the first half of the rows for train dataset and last rows for the test dataset.

The compare() function of Sweetviz is used for the comparision of the dataset.

bc_comp = sv.compare(breast_cancer[100:], breast_cancer[:100])
bc_comp.show_html('Compare.html')
Comparison Analysis using Sweetviz.

There are many more functions that Sweetviz provides but these are the basics. In the same way, there are some other libraries that automate the EDA process, like Pandas Profiling.

I encourage you to check more of Sweetviz here.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store