Global AI and Data Science

 View Only

EDA: Exploratory Data Analysis with example in Jupyter notebook

By Shivam Solanki posted Wed February 19, 2020 05:35 PM

  

The goal of EDA is to leverage visualization tools, summary tables, and hypothesis testing to:

  • Provide summary level insight into a dataset.
  • Uncover underlying patterns and structures in you data.
  • Identify outliers, missing data, class balance, and other data-related issues.
  • Relate the available data to the business opportunity.

Let’s work with a case study that comes from the online retail data set and are available through the UCI Machine Learning Repository. This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Business scenario here is that the management team expects to spend less time in projection models and gain more accuracy in forecasting revenue. It is expected that well projected numbers will help stabilize staffing and budget projections which will have a beneficial ripple effect throughout the company.

Business metric can be defined as a function of revenue gained through more accurate predictions.

Steps of the EDA Process:

  1. Load data into pandas, NumPy or another similar tool and summarize the data 

    1_CWjTTjI0DgfiS98VwZXj4Q.png

Loading data into pandas

2. Use tables, text and visualizations to tell the story that relates the business opportunity to the data

1_dwVotrY46w-Zd5IqV429pw.png

Monthly Revenue Calculation

Here, the data is leveraged to calculate the monthly revenue of the online retail store. Since one of the goals of this case study is forecasting revenue, therefore it is important to quantify revenue using such formulae which can later be utilized either in supervised learning or for hypothesis testing.

3. Identify a strategy to deal with missing values:

It is during the Exploratory Data Analysis (EDA) process that data integrity issues are identified sometimes. After extracting data it is important to include checks for quality assurance even on the first pass through the project workflow. Quality assurance step must implement checks for duplicity and missing values. Missing values are generally dealt with depending on the category of missingness i.e MCAR (Missing completely at random), MAR (Missing at random) and MNAR (Missing not at random). If the missing data are not MCAR, then imputing values can result in an increase in bias and therefore it is very important to have train/test split.

1_p6j0BdVw6ao-DqNRfqfprQ.png

Data cleaning summary

4. Investigate the data and underlying business scenario with visualizations and hypothesis testing.

Jupyter notebook is predominantly used for investigating the data with visualizations. However, best practices as a data scientist generally require that maximum amount of code is saved as text files either in simple scriptsmodules, or Python package. This ensures re-usability, allows for unit testing and works naturally with version control.

1_p3v1kWICEUhjkDlBacWDYA.png

For example, plot_rev() function has been called from the python script data_visualization.py rather than writing scripts for plotting repetitively in the Jupyter notebook.

1_rBhUZB2q2FnltxqepdRdiw.png

EDA charts with plot_rev() function

Jupyter notebook should be kept as a presentable component with minimal code. It can used as a Data Scientist’s powerpoint to deliver your story of the initial findings on the data.

5. Communicate your findings

There is no single right way to communicate EDA, but a minimum bar is that the data summaries, key findings, investigative process, conclusions are made clear. Deliverables should be concise and clear.

One important deliverable could be the result of Investigating the relationship between the relevant data, the target and the business metric. For example, revenue calculated in step 2 could be the target variable directly related to the business metric and a proposal for supervised learning and/or forecasting model could be substantiated with the EDA deliverable.

Follow this link to access the notebook for EDA


#Highlights-home

#Hands-on-feature
#Hands-on-feature
#ChampionsCorner
#Hands-on-feature
#GlobalAIandDataScience
#GlobalDataScience
#Hands-on
3 comments
638 views

Permalink

Comments

Thu February 20, 2020 11:14 AM

Thank you! always looking to learn and improve data science/engineering skills

Thu February 20, 2020 10:55 AM

@Nhi Diep I am glad that you liked it. I will be posting more of such blogs with examples and codes to make it easier to follow. Thanks!​​​

Thu February 20, 2020 06:59 AM

Thank you for this!! Awesome walk through