Skip to content

Scrapy Introduction

Python Scrapy is a very popular Python framework specifically designed for web scraping and crawling. It has a huge amount of functionality ready to use out of the box and can be easily extendable with open-source Scrapy extensions and middlewares.

Basicaly Scrapy is a great option for building a production-ready scrapers that can scrape the web at scale.

The good part is like every other python tool scrapy has the community power, it has a lot a of free ressources, like hands on project like the scrapy playbook. We will see this later do not worry 😎

Install scrapy

First, of course you need to install the scrapy library, you can follow the official documentation or the instruction here for a complete python virtualenv installation (which I encourage you to follow in order to keep your local machine clean 🤓)

I recommend you to use an editor like visual studio code (it's the one I will be using during this example) because we need to handle many files and do not want to navigate in our terminal so often with the cd command.

First scrapy project

After setting up our virtual environment and installing Scrapy, we're ready to dive into the exciting part: creating our very first Scrapy project.

The project created by Scrapy will contain all our scraper code. Scrapy provides a predefined template to organize our scrapers efficiently.

To initiate a Scrapy project, we use the command:

scrapy startproject <project_name>

For our example, as our target is the QuoteToScrape website, we'll name our project quotescraper. However, you can choose any name for your project.

scrapy startproject quotescraper

After running this command, if we list the contents of our directory using the ls command, we should see:

├── scrapy.cfg
└── quotescraper

Understanding the Structure of a Scrapy project

Let’s take a moment to understand Scrapy's project structure, particularly what the scrapy startproject bookscraper command did. If you open this folder in a code editor like VS Code, you'll see the complete directory structure:

├── scrapy.cfg
└── bookscraper
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py
        └── myCustomSpider.py

This structure, created by Scrapy, is the foundation for our project and highlights five key components of every Scrapy project: Spiders, Items, Middlewares, Pipelines, and Settings. These components enable the creation of a versatile scraper.

We might not use all these files in a beginner's project, but it's worth understanding each one:

  • settings.py holds all project settings, such as pipeline and middleware activation, delays, and concurrency.
  • items.py defines a model for your scraped data.
  • pipelines.py processes items yielded by spiders, often used for data cleaning and storage.
  • middlewares.py is where request modifications and response handling occur.
  • scrapy.cfg is a configuration file for deployment settings.
  • spiders are the core component where the scraping logic resides : it's here we will spend the most of our time 😂

At this point you might have one question : why so much effort in order to scrap some information of a webpage, some key points to note:

  • Asynchronous Nature: Built on the Twisted framework, Scrapy's requests are non-blocking.
  • Spider Name: Each spider must have a unique name for identification.
  • Start Requests: The initial URLs for scraping are defined in start_requests().
  • Parse Method: This is where the response data is processed and extracted.
  • Scrapy's real power lies in its Items, Middlewares, Pipelines, and Settings, which provide a more structured approach compared to a typical Python Requests/BeautifulSoup scraper.

Scrapy Items

In Scrapy, Items serve as containers for structured data that we scrape. They enable easy cleaning, validation, and storage of scraped data with features like ItemLoaders, Item Pipelines, and Feed Exporters.

Advantages of using Scrapy Items include:

  1. Structuring and defining a clear schema for your data.
  2. Simplifying data cleaning and processing.
  3. Validating and deduplicating data, as well as monitoring data feeds.
  4. Streamlining data storage and exporting with Scrapy Feed Exports.
  5. Facilitating the use of Scrapy Item Pipelines & Item Loaders.

Scrapy Items are typically defined in the items.py file like this :

import scrapy

class QuotescrapItem(scrapy.Item):
    # define the fields for your item here like:
    quote = scrapy.Field()
    author = scrapy.Field()
    about = scrapy.Field()
    tags = scrapy.Field()

Inside your spider, instead of yielding a dictionary, you create and yield a new Item filled with the scraped data but we will see this later do not worry 🤓

Scrapy Pipelines

Pipelines in Scrapy are where the scraped data passes through for cleaning, processing, validation, and storage.

Some functions of Scrapy Pipelines:

  • Clean the data (e.g., removing currency symbols from prices).
  • Format the data (e.g., converting strings to integers).
  • Enrich the data (e.g., converting relative links to absolute links).
  • Validate the data (e.g., ensuring a scraped price is viable).
  • Store data in various formats and locations.

We will see later an example of pipeline storing data in Databases.

Scrapy Middlewares

Scrapy Middlewares are components that process requests and responses within the Scrapy framework. They fall into two categories: Downloader Middlewares and Spider Middlewares.

Downloader middlewares act between the Scrapy Engine and the Downloader. They process requests going to the Downloader and responses coming back to the Engine.

Default middlewares in Scrapy are include in the settings.py file like this :

# settings.py
DOWNLOADER_MIDDLEWARES_BASE = {
    # Middlewares listed here...
}

These middlewares handle a range of functions like timing out requests, managing headers, user agents, retries, cookies, caches, and response compression.

To disable any default middleware, set it to None in your settings.py.

Custom middlewares can be created for specific tasks like altering requests, handling responses, retrying requests based on content, and more.

Code our first spider

Scrapy provides a number of different spider types, however, in this article we will cover the most common one, the generic Spider. Here are some of the most common ones:

  • Spider - Takes a list of start_urls and scrapes each one with a parse method.
  • CrawlSpider - Designed to crawl a full website by following any links it finds.
  • SitemapSpider - Designed to extract URLs from a sitemap

To create a new generic spider, simply run the genspider command:

# syntax is --> scrapy genspider <name_of_spider> <website> 
$ scrapy genspider quotespider quote.toscrape.com

Then you can see that a new file is added to you spider folder inside your project and look like this :

import scrapy

class QuoteSpider(scrapy.Spider):
    name = 'quotespider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        pass

Here we see that the genspider command has created a template spider for us to use in the form of a Spider class. This spider class contains:

  • name - a class attribute that gives a name to the spider. We will use this when running our spider later scrapy crawl .
  • allowed_domains - a class attribute that tells Scrapy that it should only ever scrape pages of the books.toscrape.com domain. This prevents the spider going rouge and scraping lots of websites. This is optional.
  • start_urls - a class attribute that tells Scrapy the first url it should scrape. We will be changing this in a bit.
  • parse - the parse function is called after a response has been recieved from the target website.

To start using this Spider we just need to start inserting our parsing code into the parse function. To do this, we need to handle our CSS/xPath selectors to parse the data we want from the page. We will use Scrapy ipython Shell to find play with selectors.

ipython scrapy Shell

To retrieve data from an HTML page, we will utilize XPath or CSS selectors. These selectors act as navigational tools guiding Scrapy through the DOM tree to pinpoint the required data.

Here we'll first employ some CSS selectors for data extraction, supported by the Scrapy Shell for crafting these selectors.

A very good feature of Scrapy is its integrated shell, which significantly aids in testing and fine-tuning your selectors. This tool allows for immediate testing of selectors in the terminal, eliminating the need to run the entire scraper.

To initiate the Scrapy shell, execute:

scrapy shell

For an enhanced shell experience with features like auto-completion and colored output, you can opt for IPython like this :

pip3 install ipython

Then, configure your scrapy.cfg file to use IPython:

## scrapy.cfg
[settings]
default = myscraper.settings
shell = ipython

When you have the Scrapy shell, you'll be greeted with an interface resembling this for playing with your page :

In [7]: response.css('ul.pager').get()
Out[7]: '<ul class="pager">\n            \n            \n            <li class="next">\n                <a href="/page/2/">Next <span aria-hidden="true">→</span></a>\n            </li>\n            \n        </ul>'

In [8]: response.css('ul.pager a').get()
Out[8]: '<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

In [9]: response.css('ul.pager a::text').get()
Out[9]: 'Next '

In [10]: response.css('ul.pager a::attr(href)').get()
Out[10]: '/page/2/'

Play with selectors

The first thing we want to do is fetch the main products page of the chocolate site in our Scrapy shell.

fetch('https://quotes.toscrape.com/')

The you can see the return, Scrapy shell has automatically saved the HTML response in the response variable like this :

In [1]: fetch('http://quotes.toscrape.com')
2023-12-10 11:42:51 [scrapy.core.engine] INFO: Spider opened

In [2]: response
Out[2]: <200 http://quotes.toscrape.com>

Find quote CSS Selectors​

To find the correct CSS selectors to parse our quote details we will first open the page in our browsers DevTools and inspect it with the dev tool, if you are not familiar with it it is very simple, you can check this quick tuto

Using the inspect element, hover over the item and look at the id's and classes on the individual quote. In this case we can see that each book has its own special component which is called <div class="col-md-8">. We can just use this to reference our quotes (see the image below).

You can notice also that the <div class="col-md-8"> is not exactly the <div> we want, we want the <div class="quote"> inside, let's practice a little with the scrapy shell.

Now using our Scrapy shell we can see if we can extract the quote informaton using this command here :

In [2]: response.css('div.col-md-8 div.quote')
Out[2]: 
[<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' col-md-8 ')]/descendant-or-self::*/div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' col-md-8 ')]/descendant-or-self::*/div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' col-md-8 ')]/descendant-or-self::*/div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
 ...
]

Great we have all the <div> element containing our quotes now let's get one with the get() function here :

In [5]: response.css('div.col-md-8 div.quote').get()
Out[5]: '<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world"> \n            \n            <a class="tag" href="/tag/change/page/1/">change</a>\n            \n            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>\n            \n            <a class="tag" href="/tag/thinking/page/1/">thinking</a>\n            \n            <a class="tag" href="/tag/world/page/1/">world</a>\n            \n        </div>\n    </div>'

Now that we have found the DOM node that contains our quote items, we will get all of them and save this data into a response variable and loop through the items and extract the data we need.

So can do this with the following command :

In [6]: len(response.css('div.col-md-8 div.quote'))
Out[6]: 10

That's verify all our quote in the webpage are well detected like we want, now let's get them all 😻

In [7]: response.css('div.col-md-8 div.quote').getall()
Out[7]: 
['<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world"> \n            \n            <a class="tag" href="/tag/change/page/1/">change</a>\n            \n            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>\n            \n            <a class="tag" href="/tag/thinking/page/1/">thinking</a>\n            \n            <a class="tag" href="/tag/world/page/1/">world</a>\n            \n        </div>\n    </div>',
 '<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>\n        <span>by <small class="author" itemprop="author">J.K. Rowling</small>\n        <a href="/author/J-K-Rowling">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="abilities,choices"> \n            \n            <a class="tag" href="/tag/abilities/page/1/">abilities</a>\n            \n            <a class="tag" href="/tag/choices/page/1/">choices</a>\n            \n        </div>\n    </div>',
 ...
]

Now you have all the informations inside the <div>

Extract quotes details

Now lets extract the quote, author and tags of each quote from the list of quotes.

When we update our spider code, we will loop through the list of quotes with the getall() function. However, to find the correct selectors we will test the CSS selectors on the first element of the list q[0].

In [9]: q = response.css('div.col-md-8 div.quote')

In [10]: q[0]
Out[10]: '<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world"> \n            \n            <a class="tag" href="/tag/change/page/1/">change</a>\n            \n            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>\n            \n            <a class="tag" href="/tag/thinking/page/1/">thinking</a>\n            \n            <a class="tag" href="/tag/world/page/1/">world</a>\n            \n        </div>\n    </div>'

You can see that we have a <span> tag inside our divs, lets explore this :

In [13]: q[0].css('span')
Out[13]: 
[<Selector query='descendant-or-self::span' data='<span class="text" itemprop="text">“T...'>,
 <Selector query='descendant-or-self::span' data='<span>by <small class="author" itempr...'>]

You can simple select one of the following tag like this :

In [14]: q[0].css('span.text')
Out[14]: [<Selector query="descendant-or-self::span[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]" data='<span class="text" itemprop="text">“T...'>]

If you want only the text just type :

q[0].css('span.text::text').get()
you will have only the text part extracted from the html code, isn't wonderfull 🥳

Now let's extract the link about the authors inside our quotes. For this y9ou can easly notice that our information is hiding inside <a> tags. With the same logic we can write :

In [19]: q[0].css('span a')
Out[19]: [<Selector query='descendant-or-self::span/descendant-or-self::*/a' data='<a href="/author/Albert-Einstein">(ab...'>]
What is interresting here is the href tag, let's extract it like this :
In [20]: q[0].css('span a ::attr(href)').get()
Out[20]: '/author/Albert-Einstein'

Code our spider

Now, that we've found the correct CSS selectors let's update our spider. Exit Scrapy shell with the exit() command or just Ctrl+D like most of the shells.

Our updated Spider code should look like this:

class QuoteSpider(scrapy.Spider):
    name = 'quotespider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        quote = response.css('div.col-md-8 div.quote')
        for q in quote:
            yield{
             'quote' :q.css('span.text::text').get()
             'author' : q.css('small::text').get()
             'about' : 'http://quotes.toscrape.com'+q.css('span a').attrib['href']
             'tags' : {tmp: None for tmp in q.css('div.tags a.tag::text').getall()}
            }

Here, our spider does the following steps:

  1. Makes a request to 'https://quotes.toscrape.com/'.
  2. When it gets a response, it extracts all the quotes from the page using quote = response.css('div.col-md-8 div.quote')
  3. Loops through each quote, and extracts the infos using the CSS selectors we created.
  4. Yields(returns) these items so they can be output to the terminal and/or stored in a CSV, JSON, DB, etc.

Then you can run you script with this command :

scrapy crawl quotespider 

You can also run the script and put the informations inside a json file like this :

scrapy crawl quotespider -O quote.json
and you will see a quote.json file created inside your project directory, nice or whaat 🧙🏼‍♂️

Up to this point, the code functions well, yet it only scrapes quotes from the first page of the our website, based on the URL specified in the start_url variable. But you might notice that our website have more pages 😂

The next reasonable action is to navigate to the subsequent page, if available, and extract item data from there as well. Let's explore how to accomplish this.

First, we should revisit our Scrapy shell, retrieve the page again, and determine the appropriate selector to identify the next page button. Same routine, fetch the page and inspect the sections of interest : inside the <nav> tag you will see a <ul class=pager>

Update your spider with the page handling, be carful inside this you have also the previous button not only the next, you can see this tutorial and adapt it for our case 🤓

class QuoteSpider(scrapy.Spider):
    name = 'quotespider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        quote = response.css('div.col-md-8 div.quote')
        for q in quote:
            yield{
             'quote' :q.css('span.text::text').get()
             'author' : q.css('small::text').get()
             'about' : 'http://quotes.toscrape.com'+q.css('span a').attrib['href']
             'tags' : {tmp: None for tmp in q.css('div.tags a.tag::text').getall()}
            }

        #code here  
        #next_page = 

Use Items

Like we said in the begining of this tutorial items give you a way to structures your data and gives it a clear schema. Let's code this for our project here :

import scrapy

class QuotescrapItem(scrapy.Item):
    # define the fields for your item here like:
    quote = scrapy.Field()
    author = scrapy.Field()
    about = scrapy.Field()
    tags = scrapy.Field()

Then we can add it into our custom spider like this :

class QuoteSpider(scrapy.Spider):
    name = 'quotespider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        quote = response.css('div.col-md-8 div.quote')
        quote_item = QuotescrapItem()

        for q in quote:
            quote_item['quote'] = q.css('span.text::text').get()
            quote_item['author'] = q.css('small::text').get()
            quote_item['about'] = 'http://quotes.toscrape.com'+q.css('span a').attrib['href']
            quote_item['tags'] = {tmp: None for tmp in q.css('div.tags a.tag::text').getall()}
            yield quote_item

        #handle next page 

You can note that it is not very different from our previous code, the general structure is not changing much 🤓

Now that we have this, our next goal is to insert all our record inside some database, for this we have to do two main things :

  1. Write a little code inside the our pipelines.py file
  2. Add it into settings.py file in order to appy it when we run our crawler

Insert scraped data into databases

We have seen earlier that we can save our scraped data into some json or csv file with the scrapy CLI like this :

scrapy crawl quotespider -O quote.json

or

scrapy crawl quotespider -O quote.csv

You have two options when using this command, use are small -o or use a capital -O.

  • -o : appends new data to an existing file.
  • -O : overwrites any existing file with the same name with the current data.

But it is much more efficient and reliable to store our scraped data inside some databases ! We will use docker to quickly instantiate our databases and we will go with MongoDB and Postgres for our little example.

MongoDB

You can find all the instruction to set up and connect to a MongoDB docker container in the Selenium section of this course. Let's assume from here that you have a proper MongoDB container up and running on your local machine.

First you will need to install the MongoDB python connector if it's not already on your environement. Then we need to edit our pipelines.py file and set up our mini pipeline.

First, we're going to import MongoClient into our pipelines.py file, and create an __init__ method that we will use to create our database and table named quote_database like this :

# pipelines.py

import mysql.connector

class SaveToMySQLPipeline:

    def __init__(self):
        pass

    def process_item(self, item, spider):
        return item

The __init__ method will configure the pipeline to do the following everytime the pipeline gets activated by a spider:

  1. Try to connect to our database quote_database and if it doesn't exist create this new database
  2. Create a cursor which we will use to execute mongo commands in the database
  3. Create a new collection named quote_collection with all our Item object
  4. Close the connection to not push your memory
#pipelines.py
from pymongo import MongoClient

class SaveToMongoPipeline:

    def __init__(self):
        #connect to the MongoDB server (default is localhost on port 27017)
        self.conn = MongoClient('0.0.0.0', 27017)
        #access the database (create it if it doesn't exist)
        self.db = self.conn['quote_database']
        #access the collection (similar to a table in relational databases)
        self.collection = self.db['quote_collection']

    def process_item(self, item, spider):
        #dump item into MongoDB
        pass

    def close_spider(self, spider):
        #close the connection to the database
        self.conn.close()

Save scraped Items into Mongo

Next, we're going to use the process_item function inside in our pipeline to dump our data into our Mongo database very quickly like this :

def process_item(self, item, spider):
        # Convert item to dict and insert into MongoDB
        self.collection.insert_one(ItemAdapter(item).asdict())
        return item

Run our pipeline​

In order to run our pipeline we need to include it in our settings.py file here :

ITEM_PIPELINES = {
    "quotescrap.pipelines.QuotescrapPipeline": 300, #other example cleaning function
    "quotescrap.pipelines.SaveToMongoPipeline": 400, #number = priority event
}

Now, when we run our quotespider it will save the scraped data into our database 🥳

Access MongoDB CLI

First, you need to find the container ID or name of your MongoDB container. You can do this by listing all running Docker containers:

docker ps
Look for the container running MongoDB and note down its CONTAINER ID or NAME. Then, access the MongoDB CLI within the container using:
docker exec -it <container_id_or_name> mongosh

Replace with the actual ID or name of your MongoDB container. Once you're inside the MongoDB CLI, you can use MongoDB commands to interact with your database.

List all databases:

show dbs

Switch to your specific database:

use quote_database

List all collections in your database:

show collections

Query data from your collection:

db.quote_collection.find().pretty()

This command will display the contents of the quote_collection in a formatted way like this :

[
  {
    _id: ObjectId("6574c09fb48252134cbf1c23"),
    quote: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
    author: 'albert einstein',
    about: 'http://quotes.toscrape.com/author/Albert-Einstein',
    tags: {
      change: null,
      'deep-thoughts': null,
      thinking: null,
      world: null
    }
  },
...
]

Postgres

You can do the same pattern with an other database like PostgreSQL easily like this :

import psycopg2
import json

class SavePostgreSQLPipeline:

    def __init__(self):
        # Connect to the PostgreSQL server
        # Update the connection details as per your PostgreSQL configuration
        self.conn = psycopg2.connect(
            host='0.0.0.0',
            dbname='postgres',
            user='postgres',
            password='fred'
        )
        self.cur = self.conn.cursor()

        # Create table if it doesn't exist
        self.cur.execute("""
        CREATE TABLE IF NOT EXISTS quotes (
            id SERIAL PRIMARY KEY,
            quote TEXT,
            author TEXT,
            about TEXT,
            tags JSONB
        )
        """)
        self.conn.commit()

    def process_item(self, item, spider):
        # Convert item to dict
        item_dict = ItemAdapter(item).asdict()

        # Insert data into the table
        self.cur.execute("""
        INSERT INTO quotes (quote, author, about, tags) VALUES (%s, %s, %s, %s)
        """, (item_dict['quote'], item_dict['author'], item_dict['about'], json.dumps(item_dict['tags'])))
        self.conn.commit()
        return item

    def close_spider(self, spider):
        # Close the cursor and connection to the database
        self.cur.close()
        self.conn.close()

then eddit the settings.py file in order to add our new step :

#cleaning + save into mongo & postgres 
ITEM_PIPELINES = {
    "quotescrap.pipelines.QuotescrapPipeline": 300,
    "quotescrap.pipelines.SaveToMongoPipeline": 400,
    "quotescrap.pipelines.SavePostgreSQLPipeline": 401,#after saving to mongo

}

You can do the same things for verification 🤓

Scaling

Add user agent to our scrapy request​

Another option is to set a user-agent on every request your spider makes by defining a user-agent in the headers of your request like this :

## YourCustomSpider.py
def start_requests(self):
    for url in self.start_urls:
        return scrapy.Request(url=url, callback=self.parse,
                       headers={"User-Agent": "Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148"})
A common good practice is to rotating through user-agents list in our spider and use a random one with every request. You can also use fake browser headers in your web scrapers to bypass more complex anti-bot defenses like this :
    def start_requests(self):
        user_agent_list = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
        'Mozilla/5.0 (iPhone; CPU iPhone OS 14_4_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1',
        'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363',
        ]

        fake_browser_header = {
            "upgrade-insecure-requests": "1",
            "user-agent": "",
            "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
            "sec-ch-ua": "\".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"",
            "sec-ch-ua-mobile": "?0",
            "sec-ch-ua-platform": "\"Linux\"",
            "sec-fetch-site": "none",
            "sec-fetch-mod": "",
            "sec-fetch-user": "?1",
            "accept-encoding": "gzip, deflate, br",
            "accept-language": "fr-CH,fr;q=0.9,en-US;q=0.8,en;q=0.7"
        }

        for url in self.start_urls:
            userr = user_agent_list[randint(0, len(user_agent_list)-1)]
            fake_browser_header["user-agent"] = userr
            print(f"\n************ New User Agent *************\n{userr}\n")
            yield scrapy.Request(url=url, callback=self.parse, headers=fake_browser_header)

You can use generative AI like ChatGPT for generating fake headers and user agent or use a proper tool to do it like The ScrapeOps Fake Browser Header API is a freenium fake browser header API, that returns a list of fake browser headers.

To use it you just need to send a request to their API endpoint to retrieve a list of user-agents but you first need an API key which you can get by signing up for a free account.

The best way to integrate the Fake Browser Headers API is to create a Downloader middleware and have a fake browser headers be added to every request but we will not go into this kind of details here. But the idea is pretty much the same : edit the middlewares.py file according to your need then save the congig inside your settings.py file like this:

#settings.py

SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
SCRAPEOPS_FAKE_BROWSER_HEADER_ENABLED = True

DOWNLOADER_MIDDLEWARES = {
    'bookscraper.middlewares.ScrapeOpsFakeBrowserHeaderAgentMiddleware': 400,
}

Same for proxies rotating, based on the same idea you can use different IP address for each requests. You can find more about proxies and hearders in the Selenium section. Keep this into your head, we will use it later for better scaling and monitoring our crawler 🤖

Manage thousands of fake user agents with ScrapeOps

To take a closer look to user agent concept go to the Selenium section of this course, you will find there more details about headers and how we can handle it in python 😎

For more scalable solution I recommend you this super tutorial on how to manage thousands of fake user agents here

Deploy and schedule jobs with Scrapyd

Now that we have set up our crawler, our database and some spicy techniques to scrap like a spy it will be cool if we can run it in the cloud and manage jobs. You can find all about scrapyd throught the official doc

To run jobs using Scrapyd, we first need to eggify and deploy our Scrapy project to the Scrapyd server. To do this, there is a library called scrapyd-client that makes this process very simple. First, let's install scrapyd-client:

pip install git+https://github.com/scrapy/scrapyd-client.git

Once installed, navigate to our bookscraper project we want to deploy and open our scrapyd.cfg file, which should be located in your project's root directory.

You should see something like this :

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = quotescrap.settings
shell = ipython 

[deploy]
url = http://localhost:6800/
project = quotescrap

Then run the following command in your Scrapy projects root directory:

scrapyd-deploy default

This will then eggify your Scrapy project and deploy it to your locally running Scrapyd server. You should get a result like this in your terminal if it was successful:

Packing version 1702212619
Deploying to project "quotescrap" in http://localhost:6800/addversion.json
Server response (200):
{"node_name": "benjamin.local", "status": "ok", "project": "quotescrap", "version": "1702212619", "spiders": 1}
Now your Scrapy project has been deployed to your Scrapyd and is ready to be run, you can see the webserver GUI on the 6800 port of your local machine and run a job like the doc said you will see something like this :

and see your job in the job section :

You can also set up a scrapydcluster on Heroku with this super github here and deploy and run distributed spiders with GUI like this :

Congrat's you now have set up your first job in a separate server 🥳

Monitoring

Follow this tutorial here in order to install the scrapeops SDK and synchronize your script then lunch a job you will see on the dashboard the following spider like this :

You can also enjoy the nice scrapeops GUI features like schedule a job see some KPIs 😎