Data Cleaning and Analysis with Pandas¶

A large food group wants to buy a local fast food restaurant, but before investing, it would like to analyze its data.

Import the library Pandas and the dataset`resto.csv`¶

In [ ]:

Copied!





#import pandas
import pandas as pd 

#chargement du dataset
dataset = pd.read_csv("food.csv")
print("dataset loaded")
#import pandas
import pandas as pd 

#chargement du dataset
dataset = pd.read_csv("food.csv")
print("dataset loaded")

dataset loaded

Data exploration and cleaning¶

Display the first 5 lines of the dataset.¶

In [ ]:

Copied!

dataset.head(5)
dataset.head(5)

Out[ ]:

	Unnamed: 0	Unnamed: 0.1	order_id	quantity	item_name	choice_description	total_price
0	0	0	1	1	Chips and Fresh Tomato Salsa	NaN	$2.39
1	1	1	1	1	Izze	[Clementine]	$3.39
2	2	2	1	1	Nantucket Nectar	[Apple]	$3.39
3	3	3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	$2.39
4	4	4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$16.98

Display the last 5 lines of the dataset.¶

In [ ]:

Copied!

dataset.tail(5)
dataset.tail(5)

Out[ ]:

	Unnamed: 0	Unnamed: 0.1	order_id	quantity	item_name	choice_description	total_price
4617	4617	4617	1833	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Sour ...	$11.75
4618	4618	4618	1833	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...	$11.75
4619	4619	4619	1834	1	Chicken Salad Bowl	[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...	$11.25
4620	4620	4620	1834	1	Chicken Salad Bowl	[Fresh Tomato Salsa, [Fajita Vegetables, Lettu...	$8.75
4621	4621	4621	1834	1	Chicken Salad Bowl	[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...	$8.75

Display the columns of your dataset¶

In [ ]:

Copied!

dataset.columns
dataset.columns

Out[ ]:

Index(['Unnamed: 0', 'Unnamed: 0.1', 'order_id', 'quantity', 'item_name',
       'choice_description', 'total_price'],
      dtype='object')

Check if there is nans or duplicates in your dataset¶

In [ ]:

Copied!

dataset.isnull().value_counts()
dataset.isnull().value_counts()

Out[ ]:

Unnamed: 0  Unnamed: 0.1  order_id  quantity  item_name  choice_description  total_price
False       False         False     False     False      False               False          3376
                                                         True                False          1246
dtype: int64

In [ ]:

Copied!

dataset.duplicated().value_counts()
dataset.duplicated().value_counts()

Out[ ]:

False    4622
dtype: int64

Fill your nans with "Not specified"¶

In [ ]:

Copied!

dataset["choice_description"].fillna("Not specified", inplace=True)
dataset["choice_description"].fillna("Not specified", inplace=True)

In [ ]:

Copied!

dataset.isnull().value_counts()
dataset.isnull().value_counts()

Out[ ]:

Unnamed: 0  Unnamed: 0.1  order_id  quantity  item_name  choice_description  total_price
False       False         False     False     False      False               False          4622
dtype: int64

Check the type of your columns¶

In [ ]:

Copied!

dataset.dtypes
dataset.dtypes

Out[ ]:

order_id                int64
quantity                int64
item_name              object
choice_description     object
total_price           float64
unique_price          float64
dtype: object

In [ ]:

Copied!

dataset.info()
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          4622 non-null   int64 
 1   Unnamed: 0.1        4622 non-null   int64 
 2   order_id            4622 non-null   int64 
 3   quantity            4622 non-null   int64 
 4   item_name           4622 non-null   object
 5   choice_description  4622 non-null   object
 6   total_price         4622 non-null   object
dtypes: int64(4), object(3)
memory usage: 252.9+ KB

Replace the total_price format to float¶

In [ ]:

Copied!

dataset['total_price'] = dataset.total_price.str.replace('$', '').astype('float64')

dataset.head()
dataset['total_price'] = dataset.total_price.str.replace('$', '').astype('float64')

dataset.head()

<ipython-input-174-d0ca6c233a11>:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  dataset['total_price'] = dataset.total_price.str.replace('$', '').astype('float64')

Out[ ]:

	Unnamed: 0	Unnamed: 0.1	order_id	quantity	item_name	choice_description	total_price
0	0	0	1	1	Chips and Fresh Tomato Salsa	Not specified	2.39
1	1	1	1	1	Izze	[Clementine]	3.39
2	2	2	1	1	Nantucket Nectar	[Apple]	3.39
3	3	3	1	1	Chips and Tomatillo-Green Chili Salsa	Not specified	2.39
4	4	4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	16.98

In [ ]:

Copied!

dataset.info()
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          4622 non-null   int64  
 1   Unnamed: 0.1        4622 non-null   int64  
 2   order_id            4622 non-null   int64  
 3   quantity            4622 non-null   int64  
 4   item_name           4622 non-null   object 
 5   choice_description  4622 non-null   object 
 6   total_price         4622 non-null   float64
dtypes: float64(1), int64(4), object(2)
memory usage: 252.9+ KB

Create a new column to get the price by one unique item¶

In [ ]:

Copied!

dataset["unique_price"] =dataset.total_price / dataset.quantity
dataset.head(5)
dataset["unique_price"] =dataset.total_price / dataset.quantity
dataset.head(5)

Out[ ]:

	Unnamed: 0	Unnamed: 0.1	order_id	quantity	item_name	choice_description	total_price	unique_price
0	0	0	1	1	Chips and Fresh Tomato Salsa	Not specified	2.39	2.39
1	1	1	1	1	Izze	[Clementine]	3.39	3.39
2	2	2	1	1	Nantucket Nectar	[Apple]	3.39	3.39
3	3	3	1	1	Chips and Tomatillo-Green Chili Salsa	Not specified	2.39	2.39
4	4	4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	16.98	8.49

Data Analysis¶

Display the statistical description of the dataset¶

In [ ]:

Copied!

dataset.describe()
dataset.describe()

Out[ ]:

	Unnamed: 0	Unnamed: 0.1	order_id	quantity	total_price	unique_price
count	4622.000000	4622.000000	4622.000000	4622.000000	4622.000000	4622.000000
mean	2310.500000	2310.500000	927.254868	1.075725	7.464336	7.084424
std	1334.400802	1334.400802	528.890796	0.410186	4.245557	3.665487
min	0.000000	0.000000	1.000000	1.000000	1.090000	1.090000
25%	1155.250000	1155.250000	477.250000	1.000000	3.390000	2.950000
50%	2310.500000	2310.500000	926.000000	1.000000	8.750000	8.750000
75%	3465.750000	3465.750000	1393.000000	1.000000	9.250000	9.250000
max	4621.000000	4621.000000	1834.000000	15.000000	44.250000	11.890000

Display the 10 most ordered items. Hint : Use the method groupby¶

In [ ]:

Copied!

dataset.groupby("item_name").sum().sort_values(["quantity"], ascending=False)[:10]
dataset.groupby("item_name").sum().sort_values(["quantity"], ascending=False)[:10]

Out[ ]:

	Unnamed: 0	Unnamed: 0.1	order_id	quantity	total_price	unique_price
item_name
Chicken Bowl	1779909	1779909	713926	761	7342.73	7011.51
Chicken Burrito	1238217	1238217	497303	591	5575.82	5240.20
Chips and Guacamole	1121773	1121773	449959	506	2201.04	2081.55
Steak Burrito	817795	817795	328437	386	3851.43	3677.58
Canned Soft Drink	759990	759990	304753	351	438.75	376.25
Chips	518704	518704	208004	230	494.34	453.49
Steak Bowl	482721	482721	193752	221	2260.19	2159.07
Bottled Water	439354	439354	175944	211	302.56	231.93
Chips and Fresh Tomato Salsa	250144	250144	100419	130	361.36	302.36
Canned Soda	189915	189915	76396	126	137.34	113.36

Display the total number of items ordered¶

In [ ]:

Copied!

dataset.quantity.sum()
dataset.quantity.sum()

Out[ ]:

Display the company's total revenue¶

In [ ]:

Copied!





#chiffre d’affaire realisé par Chipotle
p = dataset.total_price
tr = sum(p)
print("The total revenue of the company is : {}€".format(tr))
#chiffre d’affaire realisé par Chipotle
p = dataset.total_price
tr = sum(p)
print("The total revenue of the company is : {}€".format(tr))

The total revenue of the company is : 34500.16000000046€

Display the average revenue per order¶

In [ ]:

Copied!

#revenu moyen par commande 
dataset.groupby("order_id").sum().describe()
#revenu moyen par commande 
dataset.groupby("order_id").sum().describe()

Out[ ]:

	Unnamed: 0	Unnamed: 0.1	quantity	total_price	unique_price
count	1834.000000	1834.000000	1834.000000	1834.000000	1834.000000
mean	5822.863141	5822.863141	2.711014	18.811429	17.853986
std	4675.730261	4675.730261	1.677624	11.652512	10.377472
min	4.000000	4.000000	1.000000	10.080000	7.400000
25%	2464.000000	2464.000000	2.000000	12.572500	12.400000
50%	5235.000000	5235.000000	2.000000	16.200000	14.550000
75%	8022.000000	8022.000000	3.000000	21.960000	20.500000
max	53245.000000	53245.000000	35.000000	205.250000	205.250000

Drop the column `Unnamed: 0` and `Unnamed: 0.1`¶

In [ ]:

Copied!

dataset.drop(['Unnamed: 0'], axis=1,inplace=True)
dataset.drop(['Unnamed: 0.1'], axis=1,inplace=True)
dataset.drop(['Unnamed: 0'], axis=1,inplace=True)
dataset.drop(['Unnamed: 0.1'], axis=1,inplace=True)

In [ ]:

Copied!

dataset.head()
dataset.head()

Out[ ]:

	order_id	quantity	item_name	choice_description	total_price	unique_price
0	1	1	Chips and Fresh Tomato Salsa	Not specified	2.39	2.39
1	1	1	Izze	[Clementine]	3.39	3.39
2	1	1	Nantucket Nectar	[Apple]	3.39	3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	Not specified	2.39	2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	16.98	8.49

Save your new Dataframe into a new CSV file¶

In [ ]:

Copied!

#save
dataset.to_csv("clean_dataset.csv", index=False)
#save
dataset.to_csv("clean_dataset.csv", index=False)

In [ ]:

Data Cleaning and Analysis with Pandas¶

Import the library Pandas and the datasetresto.csv¶

Data exploration and cleaning¶

Display the first 5 lines of the dataset.¶

Display the last 5 lines of the dataset.¶

Display the columns of your dataset¶

Check if there is nans or duplicates in your dataset¶

Fill your nans with "Not specified"¶

Check the type of your columns¶

Replace the total_price format to float¶

Create a new column to get the price by one unique item¶

Data Analysis¶

Display the statistical description of the dataset¶

Display the 10 most ordered items. Hint : Use the method groupby¶

Display the total number of items ordered¶

Display the company's total revenue¶

Display the average revenue per order¶

Drop the column Unnamed: 0 and Unnamed: 0.1¶

Save your new Dataframe into a new CSV file¶

Import the library Pandas and the dataset`resto.csv`¶

Drop the column `Unnamed: 0` and `Unnamed: 0.1`¶