Skip to content

πŸ•·οΈ Scrapy Tutorial

Scrapy is a fast, open-source web crawling and scraping framework written in Python. It allows you to extract structured data from websites and is ideal for building large-scale web scrapers.


πŸš€ Why Use Scrapy?

  • Asynchronous and high-performance
  • Handles crawling, scraping, and data pipelines
  • Supports retries, throttling, and logging out of the box
  • Can export data to JSON, CSV, XML, or databases

πŸ› οΈ Installation

Install Scrapy using pip:

pip install scrapy

To verify:

scrapy --version

πŸ“ Creating a Scrapy Project

scrapy startproject quotes_scraper
cd quotes_scraper

This creates the following structure:

quotes_scraper/
β”œβ”€β”€ scrapy.cfg
└── quotes_scraper/
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ items.py
    β”œβ”€β”€ middlewares.py
    β”œβ”€β”€ pipelines.py
    β”œβ”€β”€ settings.py
    └── spiders/

🧠 Writing Your First Spider

Create a file in spiders/ called quotes_spider.py:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                'text': quote.css("span.text::text").get(),
                'author': quote.css("small.author::text").get(),
                'tags': quote.css("div.tags a.tag::text").getall(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

▢️ Running the Spider

From the root directory of your project:

scrapy crawl quotes

πŸ“€ Exporting Data

You can save the scraped data to a file like this:

scrapy crawl quotes -O quotes.json
scrapy crawl quotes -O quotes.csv

🧱 Scrapy Project Structure Overview

  • items.py – Define data models
  • pipelines.py – Clean/process data
  • middlewares.py – Modify requests/responses
  • settings.py – Configuration (user agents, delays, throttling, pipelines)

🧩 Useful Features

  • Request throttling: Respect site load using DOWNLOAD_DELAY and AUTOTHROTTLE.
  • User-Agent spoofing: Pretend to be a real browser.
  • Login support: Handle login forms using FormRequest.
  • Middleware: Customize request/response behavior globally.

⚠️ Tips & Ethics

  • Respect robots.txt
  • Don't hammer servers β€” use delays and rate limits
  • Never scrape private or copyrighted content without permission

πŸ“š Learn More


Ready to level up your scraping game? Try building spiders that:

  • Login and maintain sessions
  • Follow multiple link levels
  • Scrape multiple websites in one project

Happy crawling! πŸ•ΈοΈ