Data Analysis

Pandas

Pandas is an open-source Python library that is used for data handling tasks for machine learning and data science objectives.

Firstly create an alias of pandas let’s use pd here.

One most frequently used functionality of Pandas is to read a data file in the format of csv, json, SQL table, or a JSON file.

For eg. we can read a csv file using the following syntax:

data_frame=pd.read_csv(“location_of_the_file”)

Series are one dimensional labeled Pandas arrays that can contain any kind of data, even NaNs.

import pandas as pd
import numpy as np
lectures = pd.Series(["Mathematics","Chemistry","Physics","History","Geography","German"]*3)
grades  = pd.Series([90,54,77,22,25]*3)
classes = pd.Series(['A','B','C']*6)
credits = pd.Series(['1','2','6']*6)
names=np.array([["John"]*6,["Dan"]*6,["Zac"]*6]).flatten()
retake=np.array(['Yes','No']*9)
df=pd.DataFrame({"Names":names,"Lectures": lectures, "Grades": grades*3, "Classes":classes,"Credits": credits, "Retake":retake})
print(df.to_string(index=False)) # code to show the dataframe without index column

print(df.head(7))

head() is a function using which we can retrieve the first rows of the dataframe. By default, it retrieves the first five rows but we can retrieve as many front (first) rows after passing them as arguments.

DataFrames are a lot similar to data files like an Excel csv file or an SQL table.
Other than reading from a file a dataframe can also be created through a series in Pandas.
Pandas provides DataFrame Slicing using “loc” and “iloc” functions.

print(df.loc[:10,['Names','Lectures']])   #here we are retrieving first ten rows from which only Names and Lectures variables are selected.

In the case of iloc the arguments passed need to be integers like in iloc Names and lectures won’t work but we will have to pass their indices like 0,1 in the list to get the output otherwise it’ll give an error.

print(df.iloc[5:10,1:3]) #here we have retrieved the columns from index 1 to 3 (Lectures and Grades) for rows of index 5 to 10.

Let’s say John's parents want to learn more about their son’s performance at the school. They want to see their son’s lectures, grades for these lectures, the number of credits earned, and finally if their son will need to take a retake exam. We can simply slice the DataFrame created with the grades.csv file (which has all the student’s academic records), and extract the necessary information we need. For example:

Grades = df.loc[(df["Names"] == "John"), ["Lectures","Grades","Credits","Retake"]]

In the above code, we are just retrieving those rows in which the “Name” variable is equal to the mentioned name.

You can use the loc and iloc functions to access rows in a Pandas DataFrame.

print(df.iloc[0])

This row will just return the info about the first row of the dataframe.

The Pandas groupby function allows you to split data into groups based on some criteria. Pandas DataFrames can be split on either axis, ie., row or column.

print(df.groupby(["Lectures","Names"]).first())

Using the above code, the data can be divided into groups using Lectures and Names attributes where the division would be according to the Lectures at level1 then Names at level2.Example

We can even iterate on grouped objects as we have done in the code below, according to the Classes.

for key, item in grouped_obj:
    if(key=='A'):
        print("Key is: " + str(key))
        print(str(item), "\n\n")

One can also save data in a CSV in the local directory using Pandas, using the below code.

df.to_csv('file1.csv') # here file1 is the name of the file and to_csv is the function used to save the CSV.

Some of the important uses of Pandas are:

Data cleansing
Data fill
Data normalization
Merges and joins
Data visualization
Statistical analysis
Data inspection
Loading and saving data

Vector analysis (numpy)

Problem	Score	Time
find the one	30	2:46
choose the output	30	4:04
python broadcasting	30	5:42
How not to retrieve?	30	6:19
Fill Infinite	30	3:26
Duplicates detection	50	20:59
Row-wise unique	50	31:56

Data handling (pandas)

Problem	Score	Time
For 'series'	30	5:14
drop axis	30	1:55
Rename axis	30	2:22
iloc vs loc part I	30	1:51
As a Series	50	19:28
Max registrations they asked?	50	49:26

Basic computer vision (opencv)

Problem	Score	Time
Which library it is?	30	0:52
Image dimensions	30	1:40
Dimension with components	30	1:30
Color interpretation	30	2:00
Image cropping	30	2:32

Data visualization (matplotlib)

Problem	Score	Time
2d graphics	30	0:47
Suitable plot type	30	1:23
Subplot Coordinates	30	4:17
Vertically Stacked Bar Graph	30	3:47
Load RGB	30	2:34

Web scraping basics

Problem	Score	Time
What does the code do?	30	3:03
Retrieval protocol	30	1:46
2-way communication	30	0:57
Search engine process	30	1:32
What does the code print?	30	1:28

Eda

Problem	Score	Companies	Time	Status
PCA's secondary objective	30		1:49
Five number theory	30		1:38

Data Analysis

Level 1

Probability ▼

Random Variables

Conditional Probability

Bayes Theorem

Probability Distributions

Level 2

Inferential Statistics ▼

Central Limit Theorem

Multivariate analysis

Estimation and Sampling

Hypothesis Testing

Descriptive Statistics ▼

Measure of Central Tendency

Measures of Variability

Univariate Analysis

Level 3

Data Analysis ▼

Numpy

Pandas

OpenCV

Matplotlib

Web Scraping

EDA

Pandas

Serious about Learning Data Science and Machine Learning ?