Selenium like a Ninja¶
In order to be a real ninja scraper, you will have to build a custom selenium driver 😎
Writing a custom Selenium driver offers several benefits, particularly when dealing with dynamic and complex web pages that may have measures to detect and block automated scraping. It provides a level of control and customization that is necessary for effective scraping of modern web applications.
Our goal 🚀¶
Marche à suivre
L'URL à utiliser pour cet exercice est : https://www.welcometothejungle.com/fr/jobs?page=1&refinementList%5Bprofession_name.fr.Tech%5D%5B%5D=Data%20Analysis&refinementList%5Bcontract_type_names.fr%5D%5B%5D=CDI
On peut remarquer que :
- Il y a plusieurs pages de résultats, que l'on peut parcourir simplement en changeant
page=k
dans l'URL. Il y a 30 postes proposés par page de résultats. - Welcome to the jungle a implémenté des mesures anti-scraping. En particulier, une partie du HTML est cachée lorsque l'on requête la page avec
requests
. Il est indispensable de commencer à scroller la page pour lancer le code JavaScript qui révèle le contenu caché.
$\rightarrow$ Pour résoudre ce problème, on ne peut se contenter de BeautifulSoup. Il faut Simuler le comportement d'une vraie personne qui parcourt la page avec sa souris, c'est donc Selenium qu'il nous faut
$\rightarrow$ Voilà une fonction qui permet de simuler un scroll de page jusqu'à la ième offre d'emploi :
def scroll(driver, i):
scroll_delta = int(250)
scroll_delta += 140*i
driver.execute_script("window.scrollBy(0, "+ str(scroll_delta) + ")")
- Une autre mesure anti-scraping concerne les noms de classes, les ids et même les liens vers des images dans le code HTML. Tous ces noms sont aléatoires (ex:
class="sc-1flb27e-5 cdtiMs"
) et changent à chaque chargement de la page.
$\rightarrow$ Une bonne nouvelle quand même : toutes les classes ne sont pas aléatoires, certaines restent fixes. Pour les noms aléatoires, certaines lettres du nom sont fixes également. On peut donc toujours utiliser des similarités pour désigner certains tags spécifiques (ex : le tag header, contenant le nombre total de résultats, commence toujours par "hd").
$\rightarrow$ Pour exploiter cette faille, il est conseillé d'utiliser la méthode Seleniumfind_elements_by_css_selector()
pour désigner des tags précis, car cette méthode permet de d'identifier un tag par un texte partiel (ex:driver.find_elements_by_css_selector("header[class^='hd']")
pour toutes les classes de headers qui commencent par "hd"). - Au bout du compte, on souhaite sauvegarder le contenu de chaque offre d'emploi dans un fichier .txt.
$\rightarrow$ Il va donc falloir cliquer sur chaque offre d'emploi avec la méthode.click()
de Selenium. Pour chaque offre d'emploi, le contenu de l'offre est stocké dans un dictionnaire à l'intérieur d'un tag<script>
. On peut utiliser la méthodejson.loads()
pour manipuler ce dictionnaire. On peut finalement l'enregistrer en .txt avec les fonctionsopen()
et.write
. - Sauvegarder le contenu de chaque offre d'emploi dans une database postgres puis mongodb.
$\rightarrow$Utiliser undataframe
comme structure intermédiaire.
$\rightarrow$Quel est le problème de postgres?
$\rightarrow$Quel est la différence avec mongodb?
Custom Selenium driver¶
def initialize_driver(headers_list, proxy_list):
options = Options()
#select a random user-agent from the list
user_agent = random.choice(headers_list)["User-Agent"]
options.add_argument(f"user-agent={user_agent}")
#select a random proxy from the list
proxy = random.choice(proxy_list)
if proxy:
options.add_argument(f"--proxy-server={proxy}")
#add some common options
options.add_argument("--headless")
options.add_argument("--disable-extensions")
options.add_argument("--ignore-certificate-errors")
#initialize Chrome WebDriver with the specified options
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)
#set implicit wait of 10sec
driver.implicitly_wait(10)
return driver
# Example usage
custom_driver = initialize_driver(headers_list, proxy_list)
custom_driver
<selenium.webdriver.chrome.webdriver.WebDriver (session="a8096916214b4e61e3b2cdc9b2f57dcc")>
Get the main page¶
Our goal here is very simple: writing a python function called MainPage
to get this url : https://www.welcometothejungle.com/fr/jobs?page=1&configure%5Bfilters%5D=website.reference%3Awttj_fr&configure%5BhitsPerPage%5D=30&aroundQuery=France&refinementList%5Boffice.country_code%5D%5B%5D=FR&refinementList%5Bcontract_type_names.fr%5D%5B%5D=CDI&refinementList%5Bcontract_type_names.fr%5D%5B%5D=Stage&query=%22data%20analyst%22&range%5Bexperience_level_minimum%5D%5Bmin%5D=0&range%5Bexperience_level_minimum%5D%5Bmax%5D=1
and do a sleep of 3 seconds.
def MainPage(driver, url):
'''Go the the first page and sleep(3)'''
#code here
def MainPage(driver, url):
'''Go the the first page'''
driver.get(url)
sleep(3)
Get the number of offers per page¶
Write a python function who return the number of job offer in a page
def nbOffers(driver):
try:
#code here
except Exception as e:
print("An error occurred in NB_OFFER:", str(e))
return 0 # Or handle the exception as needed
Example usage of the nbOffers
function :
url = f"https://www.welcometothejungle.com/fr/jobs?page=1&configure%5Bfilters%5D=website.reference%3Awttj_fr&configure%5BhitsPerPage%5D=30&aroundQuery=France&refinementList%5Boffice.country_code%5D%5B%5D=FR&refinementList%5Bcontract_type_names.fr%5D%5B%5D=CDI&refinementList%5Bcontract_type_names.fr%5D%5B%5D=Stage&query=%22data%20analyst%22&range%5Bexperience_level_minimum%5D%5Bmin%5D=0&range%5Bexperience_level_minimum%5D%5Bmax%5D=1"
MainPage(driver, url)
nb_offers = nbOffers(driver)
Then write a python function to get all the jobs post :
def nbOffers_tot(driver):
try:
#code here
except Exception as e:
print("An error occurred in NB_OFFER TOTAL:", str(e))
return 0 # Or handle the exception as needed
#ouverture de la page et récupération du nombre d'offres
url = f"https://www.welcometothejungle.com/fr/jobs?page=1&configure%5Bfilters%5D=website.reference%3Awttj_fr&configure%5BhitsPerPage%5D=30&aroundQuery=France&refinementList%5Boffice.country_code%5D%5B%5D=FR&refinementList%5Bcontract_type_names.fr%5D%5B%5D=CDI&refinementList%5Bcontract_type_names.fr%5D%5B%5D=Stage&query=%22data%20analyst%22&range%5Bexperience_level_minimum%5D%5Bmin%5D=0&range%5Bexperience_level_minimum%5D%5Bmax%5D=1"
MainPage(driver, url)
nb_offers = nbOffers(driver)
nb_offerst = nbOffers_tot(driver)
print(f"\nNumbers offers tot : {nb_offerst} \nNumber of offers per page : {nb_offers}")
Numbers offers tot : 68 Number of offers per page : 30
Click
and getText
functions¶
Write a python function who click on a given selenium element :
def Click(driver, pos):
'''Click on the link'''
try:
#code here
except Exception as e:
print("An error occurred in CLICK:", str(e))
return 0 # Or handle the exception as needed
Then write a function who get the text with beatifulsoup of a job post, save it into a list and a txt file :
def GetText(driver, jobs):
sleep(3)
#code here
try:
except Exception as e:
print(f"Error HTML PARSING: {e}")
Put it into a loop 👨🍳👩🍳¶
Write a simple loop over page in order to put all the jobs into a list named jobs
you can add a break
statement for the debuging part, it can be long 🤓
# boucle de scraping
write job : https://www.welcometothejungle.com/fr/companies/securitesociale/jobs/data-analyst-appui-au-pilotage-f-h_beauvais_LSS_jb46pYN?q=b26759548f42311cc511a99f6b39e87c&o=2290808 1/68write job : https://www.welcometothejungle.com/fr/companies/pwc/jobs/senior-data-analyst-deals-m-a-lyon-cdi-h-f_neuilly-sur-seine?q=0ef9d769ff2a53f24418d1bb9396bc64&o=2255786 2/68
Clean the data¶
Here our mission is simple clean the data as least a little in order to insert them into a PostgresSQL database.
- Transform our job list into a pandas Dataframe
- Clean the text inside the
description
column
- Write a function called
extract_salary_info()
who split thebaseSalary
columns into['minSalary', 'maxSalary', 'currency', 'salaryUnit']
- Extract the
name
variable inside thehiringOrganization
column - Extract the
addressLocality
variable inside theJobLocation
column - Drop the columns
['@context','baseSalary','educationRequirements','experienceRequirements','FAQPage']
@context | @type | baseSalary | datePosted | description | employmentType | educationRequirements | experienceRequirements | hiringOrganization | industry | jobLocation | qualifications | title | validThrough | FAQPage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | http://schema.org | JobPosting | {'@type': 'MonetaryAmount', 'currency': 'EUR',... | 2023-12-08T07:47:36.407881Z | <p>Le Data Analyst « Appui au Pilotage » prend... | FULL_TIME | {'@type': 'EducationalOccupationalCredential',... | {'@type': 'OccupationalExperienceRequirements'... | {'@type': 'Organization', 'name': 'La Sécurité... | Administration publique | [{'@type': 'Place', 'address': {'@type': 'Post... | Que ce soit avec SQL, Business Object XI, Exce... | Data Analyst Appui au Pilotage (F/H) | 2024-03-07T07:47:36.407Z | [{'@type': 'Question', 'name': 'Le télétravail... |
1 | http://schema.org | JobPosting | NaN | 2023-12-07T20:45:19.106393Z | <p>Vous souhaitez intégrer une équipe pluridis... | INTERN | {'@type': 'EducationalOccupationalCredential',... | {'@type': 'OccupationalExperienceRequirements'... | {'@type': 'Organization', 'name': 'Societe Gen... | Banque, FinTech / InsurTech, Finance | [{'@type': 'Place', 'address': {'@type': 'Post... | Vous préparez un Bac +4/5 en Ecole d'Ingénieur... | Data Analyst | 2024-03-06T20:45:19.106Z | [{'@type': 'Question', 'name': 'Le télétravail... |
2 | http://schema.org | JobPosting | {'@type': 'MonetaryAmount', 'currency': 'EUR',... | 2023-12-07T18:31:11.878063Z | <p>Tu seras chargé.e de transformer la donnée ... | INTERN | {'@type': 'EducationalOccupationalCredential',... | {'@type': 'OccupationalExperienceRequirements'... | {'@type': 'Organization', 'name': 'Hello Watt'... | Environnement / Développement durable, Energie... | [{'@type': 'Place', 'address': {'@type': 'Post... | Tu as d’excellentes compétences en Excel, VBA ... | Data Analyst (H/F) - Stage | 2024-03-06T18:31:11.878Z | [{'@type': 'Question', 'name': 'L'envoi d'un C... |
3 | http://schema.org | JobPosting | {'@type': 'MonetaryAmount', 'currency': 'EUR',... | 2023-12-07T13:00:47.450767Z | <p>Afin de mieux comprendre nos clients et leu... | INTERN | {'@type': 'EducationalOccupationalCredential',... | {'@type': 'OccupationalExperienceRequirements'... | {'@type': 'Organization', 'name': 'Matera', 's... | SaaS / Cloud Services, Immobilier commercial, ... | [{'@type': 'Place', 'address': {'@type': 'Post... | 😀 Idéalement,Tu es en dernière année d’école d... | Data Analyst - Stage de 6 mois | 2024-03-06T13:00:47.450Z | [{'@type': 'Question', 'name': 'Le télétravail... |
4 | http://schema.org | JobPosting | NaN | 2023-12-07T11:36:53.671284Z | <p><img loading="lazy" width="22" alt="🤔" src=... | FULL_TIME | {'@type': 'EducationalOccupationalCredential',... | {'@type': 'OccupationalExperienceRequirements'... | {'@type': 'Organization', 'name': 'Carrefour',... | Grande distribution, E-commerce, Grande consom... | [{'@type': 'Place', 'address': {'@type': 'Post... | 👥 Profil : De formation BAC+ 3 minimumVous dis... | Data Analyst (F/H) | 2024-03-06T11:36:53.671Z | [{'@type': 'Question', 'name': 'Le télétravail... |
Databases insertion¶
Our goal in this part is to insert our result data into a postgres database 😎
We will use docker to run our postgres database in a simple way with this command :
docker run --name posttest -d -p 5432:5432 -e POSTGRES_PASSWORD=fred postgres:alpine
You mission is simple : write data to the database in a job_table
table !
You can use this sample code to connect your database :
from sqlalchemy import create_engine
# Database credentials
user = 'postgres'
password = 'fred'
host = '0.0.0.0' # or the IP if your PostgreSQL server is running elsewhere
port = '5432' # default port for PostgreSQL used by our docker above
db = 'postgres'
# Create the connection
engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')
Then write a little script who connect to the database and list all the tables inside then perform a verification query (e.g., selecting the first 5 rows)
!pip install psycopg2 sqlalchemy
Collecting psycopg2 Downloading psycopg2-2.9.9.tar.gz (384 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 384.9/384.9 kB 1.6 MB/s eta 0:00:0000:0100:01 Preparing metadata (setup.py) ... done Requirement already satisfied: sqlalchemy in /Users/mac/.pyenv/versions/3.7.0/lib/python3.7/site-packages (1.3.17) Building wheels for collected packages: psycopg2 Building wheel for psycopg2 (setup.py) ... done Created wheel for psycopg2: filename=psycopg2-2.9.9-cp37-cp37m-macosx_10_15_x86_64.whl size=143078 sha256=9c312631bb53f10c92d5c58c5f3dc9b3f4233fa20f2c5aa3d6b7d0cdcff65b51 Stored in directory: /Users/mac/Library/Caches/pip/wheels/2e/80/51/f1ee56ddad6078839563bc276734ab2609ba3aaab8aaa942ff Successfully built psycopg2 Installing collected packages: psycopg2 Successfully installed psycopg2-2.9.9 [notice] A new release of pip is available: 23.0.1 -> 23.3.1 [notice] To update, run: pip install --upgrade pip
!docker run --name posttest -d -p 5432:5432 -e POSTGRES_PASSWORD=fred postgres:alpine
114d5b36e4f1667e98d5ec1b36c5541d248465e420ea8047b1824d3eab1d873b
from sqlalchemy import create_engine
# Database credentials
user = 'postgres'
password = 'fred'
host = '0.0.0.0' # or the IP if your PostgreSQL server is running elsewhere
port = '5432' # default port for PostgreSQL
db = 'postgres'
# Create the connection
engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')
jobs_df
@type | datePosted | description | employmentType | hiringOrganization | industry | jobLocation | qualifications | title | validThrough | minSalary | maxSalary | currency | salaryUnit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | JobPosting | 2023-12-08T07:47:36.407881Z | Le Data Analyst « Appui au Pilotage » prendra ... | FULL_TIME | La Sécurité Sociale | Administration publique | Beauvais | Que ce soit avec SQL, Business Object XI, Exce... | Data Analyst Appui au Pilotage (F/H) | 2024-03-07T07:47:36.407Z | 38000.0 | 45000.0 | EUR | YEARLY |
1 | JobPosting | 2023-12-07T20:45:19.106393Z | Vous souhaitez intégrer une équipe pluridiscip... | INTERN | Societe Generale | Banque, FinTech / InsurTech, Finance | Fontenay-Sous-Bois | Vous préparez un Bac +4/5 en Ecole d'Ingénieur... | Data Analyst | 2024-03-06T20:45:19.106Z | NaN | NaN | None | None |
2 | JobPosting | 2023-12-07T18:31:11.878063Z | Tu seras chargé.e de transformer la donnée uti... | INTERN | Hello Watt | Environnement / Développement durable, Energie... | Paris | Tu as d’excellentes compétences en Excel, VBA ... | Data Analyst (H/F) - Stage | 2024-03-06T18:31:11.878Z | 1000.0 | 1600.0 | EUR | MONTHLY |
3 | JobPosting | 2023-12-07T13:00:47.450767Z | Afin de mieux comprendre nos clients et leurs ... | INTERN | Matera | SaaS / Cloud Services, Immobilier commercial, ... | Paris | 😀 Idéalement,Tu es en dernière année d’école d... | Data Analyst - Stage de 6 mois | 2024-03-06T13:00:47.450Z | 1000.0 | 1200.0 | EUR | NONE |
4 | JobPosting | 2023-12-07T11:36:53.671284Z | Le saviez-vous ? : Nous rejoindre, c'est rejoi... | FULL_TIME | Carrefour | Grande distribution, E-commerce, Grande consom... | Mondeville | 👥 Profil : De formation BAC+ 3 minimumVous dis... | Data Analyst (F/H) | 2024-03-06T11:36:53.671Z | NaN | NaN | None | None |
Verification
import psycopg2
#connect to the PostgreSQL database
conn = psycopg2.connect(dbname='postgres', user=user, password=password, host=host, port=port)
#create a cursor object
cursor = conn.cursor()
Tables in the database: ('job_table',) ('postgres',) Table 'job_table' exists. First 5 rows of 'job_table': @type datePosted \ 0 JobPosting 2023-12-08T07:47:36.407881Z 1 JobPosting 2023-12-07T20:45:19.106393Z 2 JobPosting 2023-12-07T18:31:11.878063Z 3 JobPosting 2023-12-07T13:00:47.450767Z 4 JobPosting 2023-12-07T11:36:53.671284Z description employmentType \ 0 Le Data Analyst « Appui au Pilotage » prendra ... FULL_TIME 1 Vous souhaitez intégrer une équipe pluridiscip... INTERN 2 Tu seras chargé.e de transformer la donnée uti... INTERN 3 Afin de mieux comprendre nos clients et leurs ... INTERN 4 Le saviez-vous ? : Nous rejoindre, c'est rejoi... FULL_TIME hiringOrganization industry \ 0 La Sécurité Sociale Administration publique 1 Societe Generale Banque, FinTech / InsurTech, Finance 2 Hello Watt Environnement / Développement durable, Energie... 3 Matera SaaS / Cloud Services, Immobilier commercial, ... 4 Carrefour Grande distribution, E-commerce, Grande consom... jobLocation qualifications \ 0 Beauvais Que ce soit avec SQL, Business Object XI, Exce... 1 Fontenay-Sous-Bois Vous préparez un Bac +4/5 en Ecole d'Ingénieur... 2 Paris Tu as d’excellentes compétences en Excel, VBA ... 3 Paris 😀 Idéalement,Tu es en dernière année d’école d... 4 Mondeville 👥 Profil : De formation BAC+ 3 minimumVous dis... title validThrough minSalary \ 0 Data Analyst Appui au Pilotage (F/H) 2024-03-07T07:47:36.407Z 38000.0 1 Data Analyst 2024-03-06T20:45:19.106Z NaN 2 Data Analyst (H/F) - Stage 2024-03-06T18:31:11.878Z 1000.0 3 Data Analyst - Stage de 6 mois 2024-03-06T13:00:47.450Z 1000.0 4 Data Analyst (F/H) 2024-03-06T11:36:53.671Z NaN maxSalary currency salaryUnit 0 45000.0 EUR YEARLY 1 NaN None None 2 1600.0 EUR MONTHLY 3 1200.0 EUR NONE 4 NaN None None
MongoDB¶
Same mission with mongo :
docker run -d --name example-mongo -p 27017:27017 mongo
Connect the mongo database and do a dummy query like find the number of documents where the currency is 'EUR'
#!pip install pymongo
Collecting pymongo Downloading pymongo-4.6.1.tar.gz (1.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 1.5 MB/s eta 0:00:00a 0:00:010m Installing build dependencies ... done Getting requirements to build wheel ... done Installing backend dependencies ... done Preparing metadata (pyproject.toml) ... done Collecting dnspython<3.0.0,>=1.16.0 Downloading dnspython-2.3.0-py3-none-any.whl (283 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 283.7/283.7 kB 4.3 MB/s eta 0:00:0000:01 Building wheels for collected packages: pymongo Building wheel for pymongo (pyproject.toml) ... done Created wheel for pymongo: filename=pymongo-4.6.1-cp37-cp37m-macosx_10_15_x86_64.whl size=476397 sha256=4cc192763ce1f76a4f62b2c9b2f24d4841753fe967d1d496d71855531db33f7b Stored in directory: /Users/mac/Library/Caches/pip/wheels/4b/ea/fc/232ddbfbc8e6df7a8db6bfe11167efdde03604c9be02bc3527 Successfully built pymongo Installing collected packages: dnspython, pymongo Successfully installed dnspython-2.3.0 pymongo-4.6.1 [notice] A new release of pip is available: 23.0.1 -> 23.3.1 [notice] To update, run: pip install --upgrade pip
!docker run -d --name example-mongo -p 27017:27017 mongo
from pymongo import MongoClient