Skip to content

πŸ€– Handling robots.txt in Web Scraping with Python

When scraping websites, one of the first and most important things you should do is check the site's robots.txt file. This file tells web crawlers what parts of the site are allowed or disallowed for automated access.

Beautiful Soup and requests do not automatically respect robots.txt, so you must handle it manually.


πŸ“„ What is robots.txt?

robots.txt is a simple text file hosted at the root of a domain (e.g., https://example.com/robots.txt) that defines rules for web crawlers about what URLs can or cannot be accessed.

Example:

User-agent: *
Disallow: /admin/
Allow: /

βœ… How to Check robots.txt in Python

Use Python’s built-in urllib.robotparser to parse and check access permissions.

πŸ”§ Example: Basic Usage

import urllib.robotparser
from urllib.parse import urlparse

def is_allowed(url, user_agent='*'):
    parsed = urlparse(url)
    base_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(base_url)
    rp.read()

    return rp.can_fetch(user_agent, url)

# Usage
url = "http://quotes.toscrape.com/page/1/"
if is_allowed(url):
    print("βœ… Allowed to scrape.")
else:
    print("🚫 Not allowed to scrape.")

πŸ§ͺ What if robots.txt is Missing?

Some websites don’t have a robots.txt at all, or it may return a 404 Not Found. In that case, there are no explicit rulesβ€”but it does not mean free access.

🧠 Proceed Cautiously

You can detect this case and make a judgment call:

import requests

def check_robots_txt(url):
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    try:
        response = requests.get(robots_url, timeout=5)
        if response.status_code == 200:
            print("βœ… robots.txt found.")
            return response.text
        elif response.status_code == 404:
            print("⚠️ robots.txt not found β€” proceed cautiously.")
            return None
        else:
            print(f"⚠️ Unexpected status {response.status_code}.")
            return None
    except Exception as e:
        print(f"❌ Error fetching robots.txt: {e}")
        return None

🧱 Ethical Scraping Practices

Even when allowed by robots.txt, you should follow best practices to avoid overwhelming servers or violating terms of service.

Practice How to Implement
Respect robots.txt Always check before scraping
Add request delays time.sleep(1-5) between requests
Identify yourself Use a custom User-Agent string
Avoid login-protected data Only scrape publicly accessible info
Limit data collected Don't scrape entire sites blindly
Cache scraped data Prevent repeated hits on the same URL

🧾 Example with Headers & Delay

import time
import requests

headers = {
    "User-Agent": "MyScraperBot/1.0 (contact@example.com)"
}

response = requests.get("http://quotes.toscrape.com", headers=headers)
time.sleep(2)  # wait 2 seconds before the next request

❓ Can I Ignore robots.txt?

Technically, yes β€” Python will not stop you β€” but:

  • Ethically, you shouldn't.
  • Legally, scraping protected or copyrighted content may expose you to legal consequences.
  • Many websites track and block bad bots.

⚠️ Always respect the site's terms of use.


πŸ”— Resources


βœ… Summary

Situation Action
robots.txt disallows access ❌ Don’t scrape
robots.txt allows access βœ… You can scrape
robots.txt is missing ⚠️ Proceed cautiously and ethically

Before you scrape, check first. Scraping responsibly helps keep the web healthy and open for everyone.