Data Cleaning and Analysis with Pandas¶
A large food group wants to buy a local fast food restaurant, but before investing, it would like to analyze its data.
Import the library Pandas and the datasetresto.csv
¶
In [ ]:
Copied!
#import pandas
import pandas as pd
#chargement du dataset
dataset = pd.read_csv("food.csv")
print("dataset loaded")
#import pandas
import pandas as pd
#chargement du dataset
dataset = pd.read_csv("food.csv")
print("dataset loaded")
dataset loaded
Data exploration and cleaning¶
Display the first 5 lines of the dataset.¶
In [ ]:
Copied!
dataset.head(5)
dataset.head(5)
Out[ ]:
Unnamed: 0 | Unnamed: 0.1 | order_id | quantity | item_name | choice_description | total_price | |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 1 | Chips and Fresh Tomato Salsa | NaN | $2.39 |
1 | 1 | 1 | 1 | 1 | Izze | [Clementine] | $3.39 |
2 | 2 | 2 | 1 | 1 | Nantucket Nectar | [Apple] | $3.39 |
3 | 3 | 3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NaN | $2.39 |
4 | 4 | 4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $16.98 |
Display the last 5 lines of the dataset.¶
In [ ]:
Copied!
dataset.tail(5)
dataset.tail(5)
Out[ ]:
Unnamed: 0 | Unnamed: 0.1 | order_id | quantity | item_name | choice_description | total_price | |
---|---|---|---|---|---|---|---|
4617 | 4617 | 4617 | 1833 | 1 | Steak Burrito | [Fresh Tomato Salsa, [Rice, Black Beans, Sour ... | $11.75 |
4618 | 4618 | 4618 | 1833 | 1 | Steak Burrito | [Fresh Tomato Salsa, [Rice, Sour Cream, Cheese... | $11.75 |
4619 | 4619 | 4619 | 1834 | 1 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... | $11.25 |
4620 | 4620 | 4620 | 1834 | 1 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Lettu... | $8.75 |
4621 | 4621 | 4621 | 1834 | 1 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... | $8.75 |
Display the columns of your dataset¶
In [ ]:
Copied!
dataset.columns
dataset.columns
Out[ ]:
Index(['Unnamed: 0', 'Unnamed: 0.1', 'order_id', 'quantity', 'item_name', 'choice_description', 'total_price'], dtype='object')
Check if there is nans or duplicates in your dataset¶
In [ ]:
Copied!
dataset.isnull().value_counts()
dataset.isnull().value_counts()
Out[ ]:
Unnamed: 0 Unnamed: 0.1 order_id quantity item_name choice_description total_price False False False False False False False 3376 True False 1246 dtype: int64
In [ ]:
Copied!
dataset.duplicated().value_counts()
dataset.duplicated().value_counts()
Out[ ]:
False 4622 dtype: int64
Fill your nans with "Not specified"¶
In [ ]:
Copied!
dataset["choice_description"].fillna("Not specified", inplace=True)
dataset["choice_description"].fillna("Not specified", inplace=True)
In [ ]:
Copied!
dataset.isnull().value_counts()
dataset.isnull().value_counts()
Out[ ]:
Unnamed: 0 Unnamed: 0.1 order_id quantity item_name choice_description total_price False False False False False False False 4622 dtype: int64
Check the type of your columns¶
In [ ]:
Copied!
dataset.dtypes
dataset.dtypes
Out[ ]:
order_id int64 quantity int64 item_name object choice_description object total_price float64 unique_price float64 dtype: object
In [ ]:
Copied!
dataset.info()
dataset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4622 entries, 0 to 4621 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 4622 non-null int64 1 Unnamed: 0.1 4622 non-null int64 2 order_id 4622 non-null int64 3 quantity 4622 non-null int64 4 item_name 4622 non-null object 5 choice_description 4622 non-null object 6 total_price 4622 non-null object dtypes: int64(4), object(3) memory usage: 252.9+ KB
Replace the total_price format to float¶
In [ ]:
Copied!
dataset['total_price'] = dataset.total_price.str.replace('$', '').astype('float64')
dataset.head()
dataset['total_price'] = dataset.total_price.str.replace('$', '').astype('float64')
dataset.head()
<ipython-input-174-d0ca6c233a11>:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True. dataset['total_price'] = dataset.total_price.str.replace('$', '').astype('float64')
Out[ ]:
Unnamed: 0 | Unnamed: 0.1 | order_id | quantity | item_name | choice_description | total_price | |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 1 | Chips and Fresh Tomato Salsa | Not specified | 2.39 |
1 | 1 | 1 | 1 | 1 | Izze | [Clementine] | 3.39 |
2 | 2 | 2 | 1 | 1 | Nantucket Nectar | [Apple] | 3.39 |
3 | 3 | 3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | Not specified | 2.39 |
4 | 4 | 4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | 16.98 |
In [ ]:
Copied!
dataset.info()
dataset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4622 entries, 0 to 4621 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 4622 non-null int64 1 Unnamed: 0.1 4622 non-null int64 2 order_id 4622 non-null int64 3 quantity 4622 non-null int64 4 item_name 4622 non-null object 5 choice_description 4622 non-null object 6 total_price 4622 non-null float64 dtypes: float64(1), int64(4), object(2) memory usage: 252.9+ KB
Create a new column to get the price by one unique item¶
In [ ]:
Copied!
dataset["unique_price"] =dataset.total_price / dataset.quantity
dataset.head(5)
dataset["unique_price"] =dataset.total_price / dataset.quantity
dataset.head(5)
Out[ ]:
Unnamed: 0 | Unnamed: 0.1 | order_id | quantity | item_name | choice_description | total_price | unique_price | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 1 | Chips and Fresh Tomato Salsa | Not specified | 2.39 | 2.39 |
1 | 1 | 1 | 1 | 1 | Izze | [Clementine] | 3.39 | 3.39 |
2 | 2 | 2 | 1 | 1 | Nantucket Nectar | [Apple] | 3.39 | 3.39 |
3 | 3 | 3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | Not specified | 2.39 | 2.39 |
4 | 4 | 4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | 16.98 | 8.49 |
Data Analysis¶
Display the statistical description of the dataset¶
In [ ]:
Copied!
dataset.describe()
dataset.describe()
Out[ ]:
Unnamed: 0 | Unnamed: 0.1 | order_id | quantity | total_price | unique_price | |
---|---|---|---|---|---|---|
count | 4622.000000 | 4622.000000 | 4622.000000 | 4622.000000 | 4622.000000 | 4622.000000 |
mean | 2310.500000 | 2310.500000 | 927.254868 | 1.075725 | 7.464336 | 7.084424 |
std | 1334.400802 | 1334.400802 | 528.890796 | 0.410186 | 4.245557 | 3.665487 |
min | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.090000 | 1.090000 |
25% | 1155.250000 | 1155.250000 | 477.250000 | 1.000000 | 3.390000 | 2.950000 |
50% | 2310.500000 | 2310.500000 | 926.000000 | 1.000000 | 8.750000 | 8.750000 |
75% | 3465.750000 | 3465.750000 | 1393.000000 | 1.000000 | 9.250000 | 9.250000 |
max | 4621.000000 | 4621.000000 | 1834.000000 | 15.000000 | 44.250000 | 11.890000 |
Display the 10 most ordered items. Hint : Use the method groupby¶
In [ ]:
Copied!
dataset.groupby("item_name").sum().sort_values(["quantity"], ascending=False)[:10]
dataset.groupby("item_name").sum().sort_values(["quantity"], ascending=False)[:10]
Out[ ]:
Unnamed: 0 | Unnamed: 0.1 | order_id | quantity | total_price | unique_price | |
---|---|---|---|---|---|---|
item_name | ||||||
Chicken Bowl | 1779909 | 1779909 | 713926 | 761 | 7342.73 | 7011.51 |
Chicken Burrito | 1238217 | 1238217 | 497303 | 591 | 5575.82 | 5240.20 |
Chips and Guacamole | 1121773 | 1121773 | 449959 | 506 | 2201.04 | 2081.55 |
Steak Burrito | 817795 | 817795 | 328437 | 386 | 3851.43 | 3677.58 |
Canned Soft Drink | 759990 | 759990 | 304753 | 351 | 438.75 | 376.25 |
Chips | 518704 | 518704 | 208004 | 230 | 494.34 | 453.49 |
Steak Bowl | 482721 | 482721 | 193752 | 221 | 2260.19 | 2159.07 |
Bottled Water | 439354 | 439354 | 175944 | 211 | 302.56 | 231.93 |
Chips and Fresh Tomato Salsa | 250144 | 250144 | 100419 | 130 | 361.36 | 302.36 |
Canned Soda | 189915 | 189915 | 76396 | 126 | 137.34 | 113.36 |
Display the total number of items ordered¶
In [ ]:
Copied!
dataset.quantity.sum()
dataset.quantity.sum()
Out[ ]:
4972
Display the company's total revenue¶
In [ ]:
Copied!
#chiffre d’affaire realisé par Chipotle
p = dataset.total_price
tr = sum(p)
print("The total revenue of the company is : {}€".format(tr))
#chiffre d’affaire realisé par Chipotle
p = dataset.total_price
tr = sum(p)
print("The total revenue of the company is : {}€".format(tr))
The total revenue of the company is : 34500.16000000046€
Display the average revenue per order¶
In [ ]:
Copied!
#revenu moyen par commande
dataset.groupby("order_id").sum().describe()
#revenu moyen par commande
dataset.groupby("order_id").sum().describe()
Out[ ]:
Unnamed: 0 | Unnamed: 0.1 | quantity | total_price | unique_price | |
---|---|---|---|---|---|
count | 1834.000000 | 1834.000000 | 1834.000000 | 1834.000000 | 1834.000000 |
mean | 5822.863141 | 5822.863141 | 2.711014 | 18.811429 | 17.853986 |
std | 4675.730261 | 4675.730261 | 1.677624 | 11.652512 | 10.377472 |
min | 4.000000 | 4.000000 | 1.000000 | 10.080000 | 7.400000 |
25% | 2464.000000 | 2464.000000 | 2.000000 | 12.572500 | 12.400000 |
50% | 5235.000000 | 5235.000000 | 2.000000 | 16.200000 | 14.550000 |
75% | 8022.000000 | 8022.000000 | 3.000000 | 21.960000 | 20.500000 |
max | 53245.000000 | 53245.000000 | 35.000000 | 205.250000 | 205.250000 |
Drop the column Unnamed: 0
and Unnamed: 0.1
¶
In [ ]:
Copied!
dataset.drop(['Unnamed: 0'], axis=1,inplace=True)
dataset.drop(['Unnamed: 0.1'], axis=1,inplace=True)
dataset.drop(['Unnamed: 0'], axis=1,inplace=True)
dataset.drop(['Unnamed: 0.1'], axis=1,inplace=True)
In [ ]:
Copied!
dataset.head()
dataset.head()
Out[ ]:
order_id | quantity | item_name | choice_description | total_price | unique_price | |
---|---|---|---|---|---|---|
0 | 1 | 1 | Chips and Fresh Tomato Salsa | Not specified | 2.39 | 2.39 |
1 | 1 | 1 | Izze | [Clementine] | 3.39 | 3.39 |
2 | 1 | 1 | Nantucket Nectar | [Apple] | 3.39 | 3.39 |
3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | Not specified | 2.39 | 2.39 |
4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | 16.98 | 8.49 |
Save your new Dataframe into a new CSV file¶
In [ ]:
Copied!
#save
dataset.to_csv("clean_dataset.csv", index=False)
#save
dataset.to_csv("clean_dataset.csv", index=False)
In [ ]:
Copied!