Understanding & working with Pandas
Pandas
Introduction to Pandas
Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data.
Pandas allows us to :
- Analyze big data and make conclusions based on statistical theories.
- Clean messy data sets, and make them readable and relevant. Relevant data is very important in data science.
- Pandas gives you answers about the data Such as : Is there a correlation between two or more columns? What is the average value? Max value? Min value?
- Manage diffrent data sets merging them filtering them etc...
Pandas is also able to delete rows that are not relevant, or contain wrong values, like empty or NULL values. This is called cleaning the data.
To start using Pandas, you first need to install it. You can install it using the following command:
After installing pandas, you can import it as follows:
Pandas Data Structures
Pandas provides two main data structures: Series and DataFrame. A Series is a one-dimensional array-like object that can hold any data type. A DataFrame is a two-dimensional table that can hold data of different types in columns.
Series
A Pandas Series is like a column in a table. It is a one-dimensional array holding data of any type
If nothing else is specified, the values are labeled with their index number. The first value has index 0, etc. This label can be used to access a specified value. With the index argument, you can name your own labels. When using labels, you can access an item by using the label. Code example :
import pandas as pd
dataset1=[1, 3, 5, np.nan, 6, 8]
df1=pd.Series(dataset1, index=["a","b","c","e","f","g"])
print (df1)
print (df1["a"])
We can also use key/value pair objects like dictionaries when creating a Series
Note: The keys of the dictionary become the labels
Code example :
import pandas as pd
dataset1={"Vehicle number":1, "Wheels":4, "Doors":4}
df1=pd.Series(dataset1)
print (df1)
DataFrame
Data sets in Pandas are usually multi-dimensional tables, called DataFrames. Series can be considered to be like a column in a table, whereas a DataFrame can be considered to be the table.
A DataFrame can be created in several ways. One way is to pass a dictionary of lists to the pd.DataFrame() function. Each key in the dictionary represents a column name, and each list represents the data in that column.
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 32, 18, 47],
'gender': ['F', 'M', 'M', 'M']}
df = pd.DataFrame(data)
print(df)
Code example :
import pandas as pd
dataset1={"Vehicle number":[1,2,3], "Wheels":[4,2,4], "Doors":[4,0,5]}
df1=pd.DataFrame(dataset1, index=["Car","Motorcycle","Van"])
print(df1)
print (df1.loc["Car"])
Data Import and Export
Pandas supports reading and writing data from and to various file formats, such as CSV, Excel, SQL databases, and more.
- CSV : A simple way to store big data sets is to use CSV (Comma Separated Value) files. CSV files contain plain text and are a well know format that can be read by almost all software including Pandas. To read a CSV file, you can use the pd.read_csv() function. The function takes the path to the CSV file as an argument and returns a DataFrame. To write a DataFrame to a CSV file, you can use the to_csv() method. Code example :
import pandas as pd
## Importing a CSV file
df = pd.read_csv('data.csv')
print(df)
## Export a CSV file
df.to_csv('data.csv', index=False)#The index=False argument tells Pandas not to write the row index to the CSV file.
- Excel : To read an Excel file, you can use the pd.read_excel() function. The function takes the path to the Excel file as an argument and returns a DataFrame. To write a DataFrame to an Excel file, you can use the to_excel() method.
## Importing a Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
print(df)
## Export a Excel file
df.to_excel('data.xlsx', sheet_name='Sheet1', index=False)
import sqlite3
## Importing a SQL file
conn = sqlite3.connect('mydatabase.db')
query = 'SELECT * FROM mytable'
df = pd.read_sql(query, conn)
print(df)
## Export a SQL file
df.to_sql('mytable', conn, if_exists='replace', index=False) #The if_exists='replace' argument tells Pandas to replace the table if it already exists in the database.
Getting a quick overview of the Dataframes content
-
One of the most used method for getting a quick overview of the DataFrame, is the head() method. The head() method returns the headers and a specified number of rows, starting from the top.
-
The tail() method returns the last rows of the DataFrame
-
The DataFrame object has a method called info(), that gives you more information about the data set.
Code Example :
Data Cleaning with pandas
Data cleaning means fixing bad data in your data set it is an essential step in data analysis. Pandas provides several functions and methods for cleaning and preparing data. Examples of bad data include :
Empty cells
Empty cells can potentially give you a wrong result when you analyze data. One way to deal with empty cells is to remove rows that contain empty cells. This usually works since data sets can be very large, and removing a few rows will not have a significant impact on the results.
To detect empty cells we can use :
To remove empty cells, we can use the dropna() method. By default, the dropna() method returns a new DataFrame, and will not change the original. If you want to change the original DataFrame, use the inplace = True argument. Code Example : Another way of dealing with empty cells is to insert a new value to replace the empty cell. This way you do not have to delete entire rows just because of some empty cells. The fillna() method allows us to replace empty cells with a value. To only replace empty values for one column, specify the column name for the DataFrame. Code Example :import pandas as pd
df=pd.read_csv("data.csv")
df.fillna(130, inplace=True)
df["Calories"].fillna(130, inplace=True)
print(df.to_string())
You can also fill in empty cells with the mean, median or mode of the column. Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified column
- Mean : the mean is the average value (the sum of all values divided by number of values)
- Median : the median is the value in the middle, after you have sorted all values ascending
- Mode : the mode is the value that appears most frequently
Code Example :
import pandas as pd
df=pd.read_csv("data.csv")
meancal=df["Calories"].mean()
mediancal=df["Calories"].median()
modecal=df["Calories"].mode()
print ("The mean of calories is" + str(meancal) + " The median of calories is " + str(mediancal) + " The mode of calories is " + str(modecal))
meandf=df["Calories"].fillna(meancal)
mediandf=df["Calories"].fillna(mediancal)
modedf=df["Calories"].fillna(modecal)
print(meandf.to_string())
print(mediandf.to_string())
print(modedf.to_string())
Data in wrong format
Cells with data of incorrect format can make it difficult, or even impossible, to analyze data.
To remedy this, you can either remove the rows, or convert all cells in the columns into the same format
Incorrect Data
Incorrect data does not have to be empty cells or incorrect format, it can just be incorrect, like if someone entered 199 instead of 1.99. Sometimes you can spot incorrect data by looking at the data set because you have an expectation of what it should be.
If you take a look at our data set you can see that in row 7 the duration is 450, but for all the other rows the duration is between 30 and 60. It doesn't have to be incorrect, but taking in consideration that this is the data set of someone's workout sessions, we conclude this person did not work out for 450 minutes
One way to fix wrong values is to replace them with something else. For small data sets you might be able to replace the wrong data one by one, but not for large data sets. To replace wrong data for larger data sets you can create some rules and set some boundaries for legal values, and replace any values that are outside of the boundaries
Another way of handling incorrect data is to remove the rows that contains incorrect data. This way you do not have to find out what to replace them with, and there is a good chance you do not need them for analysis.
Code Example
import pandas as pd
df1=pd.read_csv("data.csv")
df2=pd.read_csv("data.csv")
for x in df1.index: #replace all values in duration above 120 with 120
if df1.loc[x, "Duration"] > 120:
df1.loc[x,"Duration"] = 120
for y in df2.index: #drop all values above 120
if df2.loc[y, "Duration"] >120:
df2.drop(y, inplace = True)
print(dataframe1.to_string())
print(dataframe2.to_string())
Wrong data Duplicates
Duplicate rows are rows that have been entered more than once. By taking a look at our test data set, we can assume that row 11 and 12 are duplicates
To discover duplicates, we can use the duplicated() method The duplicated() method returns a Boolean values for each row To remove duplicates, use the drop_duplicates() method. Code Example :
import pandas as pd
df1=pd.read_csv("data.csv")
df1.duplicated().value_counts() #search for duplicates and output true when found
df1.drop_duplicates(inplace = True) #drop all duplicates
df1
Merging and Joining Data
Pandas provides several functions and methods for merging and joining DataFrames.
To merge two DataFrames based on a common column, you can use the merge() method.
df1 = pd.DataFrame({'id': [1, 2, 3, 4], 'name': ['Alice', 'Bob', 'Charlie', 'David']})
df2 = pd.DataFrame({'id': [1, 2, 3, 4], 'age': [25, 32, 18, 47]})
merged_df = pd.merge(df1, df2, on='id') # Merge DataFrames on 'id' column
print(merged_df)
The merge() method merges the two DataFrames based on the 'id' column. The result is a new DataFrame with columns from both DataFrames.
To join two DataFrames based on a common column, you can use the join() method.
df1 = pd.DataFrame({'id': [1, 2, 3, 4], 'name': ['Alice', 'Bob', 'Charlie', 'David']})
df2 = pd.DataFrame({'id': [1, 2, 3, 4], 'age': [25, 32, 18, 47]})
joined_df = df1.set_index('id').join(df2.set_index('id')) # Join DataFrames on 'id' column
print(joined_df)
The join() method joins the two DataFrames based on the 'id' column. The result is a new DataFrame with columns from both DataFrames.
Data Analysis in Pandas
Describe in Pandas
The describe() method in Pandas provides summary statistics of a DataFrame. It calculates several common statistics for each numerical column in the DataFrame, such as count, mean, standard deviation, minimum, maximum, and quartiles. Code Example :
Correlation in Pandas
The corr() method calculates the relationship between each column in your data set. The corr() method ignores "not numeric" columns. Code Example :
The Result of the corr() method is a table with a lot of numbers that represents how well the relationship is between two columns. The number varies from -1 to 1 such as :
-
1 means that there is a 1 to 1 relationship (a perfect correlation), and for this data set, each time a value went up in the first column, the other one went up as well.
-
0.9 is also a good relationship, and if you increase one value, the other will probably increase as well.
-
-0.9 would be just as good relationship as 0.9, but if you increase one value, the other will probably go down
-
0.2 means NOT a good relationship, meaning that if one value goes up does not mean that the other will
What is a good correlation? It depends on the use, but I think it is safe to say you have to have at least 0.6 (or -0.6) to call it a good correlation
Perfect Correlation: We can see that "Duration" and "Duration" got the number 1.000000, which makes sense, each column always has a perfect relationship with itself.
Good Correlation: "Duration" and "Calories" got a 0.922721 correlation, which is a very good correlation, and we can predict that the longer you work out, the more calories you burn, and the other way around: if you burned a lot of calories, you probably had a long work out
Bad Correlation: "Duration" and "Maxpulse" got a 0.009403 correlation, which is a very bad correlation, meaning that we can not predict the max pulse by just looking at the duration of the work out, and vice versa