Data Analysis

Go to Problems

EDA

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to

  1. maximize insight into a data set;
  2. uncover underlying structure;
  3. extract important variables;
  4. detect outliers and anomalies;
  5. test underlying assumptions;
  6. develop parsimonious models; and
  7. determine optimal factor settings.

 

  • EDA isn't just like statistical graphics although the 2 terms are used almost interchangeably. Statistical graphics may be a collection of techniques--all graphically based and everyone that specializes in one data characterization aspect. 
  • EDA encompasses a bigger venue; EDA is an approach to data analysis that postpones the standard assumptions about what model the info follows with the more direct approach of allowing the info itself to reveal its underlying structure and model. 
  • EDA isn't a mere collection of techniques; EDA may be a philosophy on how we dissect a knowledge set; what we glance for; how we look; and the way we interpret. It is true that EDA heavily uses the gathering of techniques that we call "statistical graphics", but it's not just like statistical graphics.
  • Most EDA techniques are graphical in nature with a couple of quantitative techniques. The reason for the heavy reliance on graphics is that by its very nature the main role of EDA is to open-mindedly explore, and graphics gives the analysts unparalleled power to try to do so, enticing the info to reveal its structural secrets, and being always able to gain some new, often unsuspected, insight into the info. 
  • Many data scientists will agree that it's very easy to get lost in data—the more you collect, study and analyze, the more you would like to explore. Rabbit holes of data are familiar and friendly places for data analysts and data scientists to dive into and spend hours extracting, modeling, and analyzing these large datasets.
  • The EDA sorts of techniques are either graphical or quantitative (non-graphical). While the graphical methods involve summarising the info in a diagrammatic or visual way, the quantitative method, on the opposite hand, involves the calculation of summary statistics. These two sorts of methods are further divided into univariate and multivariate methods.

EDA Steps:-


  1. Data Sourcing
  2. Data Cleaning
  3. Univariate analysis
  4. Bivariate analysis
  5. Multivariate analysis
  6. Handle Missing value
  7. Removing duplicates
  8. Outlier Treatment
  9. Normalizing and Scaling( Numerical Variables)
  10. Encoding Categorical variables( Dummy Variables)

Types of Graphical Analysis:-


  1. Numerical vs. Numerical
    1. Scatterplot
    2. Line plot
    3. Heatmap for correlation
    4. Joint plot

  1. Categorical vs. Numerical
    1. Bar chart
    2. Categorical box plot

Handling Missing Values:-

  1. Deleting rows with missing values
  2. Imputing missing data based on mean/median/mode
  3. Estimating missing data using ML classifiers - knn

Outlier Detection:-

  1. Based on standard deviations away from the mean (continuous variables)
  2. Based on inter-quartile distance (categorical data)





 

Serious about Learning Data Science and Machine Learning ?

Learn this and a lot more with Scaler's Data Science industry vetted curriculum.
Vector analysis (numpy)
Problem Score Companies Time Status
find the one 30
2:22
choose the output 30
4:00
python broadcasting 30
4:37
How not to retrieve? 30
4:51
Fill Infinite 30
2:19
Duplicates detection 50
29:34
Row-wise unique 50
29:15
Data handling (pandas)
Problem Score Companies Time Status
For 'series' 30
4:38
drop axis 30
1:46
Rename axis 30
1:58
iloc vs loc part I 30
1:39
As a Series 50
21:56
Max registrations they asked? 50
45:26
Basic computer vision (opencv)
Problem Score Companies Time Status
Which library it is? 30
0:48
Image dimensions 30
1:33
Dimension with components 30
1:07
Color interpretation 30
1:54
Image cropping 30
2:00
Data visualization (matplotlib)
Problem Score Companies Time Status
2d graphics 30
0:39
Suitable plot type 30
1:20
Subplot Coordinates 30
3:50
Vertically Stacked Bar Graph 30
3:22
Load RGB 30
2:15
Web scraping basics
Problem Score Companies Time Status
What does the code do? 30
2:35
Retrieval protocol 30
1:16
2-way communication 30
0:54
Search engine process 30
1:28
What does the code print? 30
1:16
Eda
Problem Score Companies Time Status
PCA's secondary objective 30
1:31
Five number theory 30
1:28