π€ Handling robots.txt in Web Scraping with Python
When scraping websites, one of the first and most important things you should do is check the site's robots.txt file. This file tells web crawlers what parts of the site are allowed or disallowed for automated access.
Beautiful Soup and requests do not automatically respect robots.txt, so you must handle it manually.
π What is robots.txt?
robots.txt is a simple text file hosted at the root of a domain (e.g., https://example.com/robots.txt) that defines rules for web crawlers about what URLs can or cannot be accessed.
Example:
β
How to Check robots.txt in Python
Use Pythonβs built-in urllib.robotparser to parse and check access permissions.
π§ Example: Basic Usage
import urllib.robotparser
from urllib.parse import urlparse
def is_allowed(url, user_agent='*'):
parsed = urlparse(url)
base_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = urllib.robotparser.RobotFileParser()
rp.set_url(base_url)
rp.read()
return rp.can_fetch(user_agent, url)
# Usage
url = "http://quotes.toscrape.com/page/1/"
if is_allowed(url):
print("β
Allowed to scrape.")
else:
print("π« Not allowed to scrape.")
π§ͺ What if robots.txt is Missing?
Some websites donβt have a robots.txt at all, or it may return a 404 Not Found. In that case, there are no explicit rulesβbut it does not mean free access.
π§ Proceed Cautiously
You can detect this case and make a judgment call:
import requests
def check_robots_txt(url):
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
try:
response = requests.get(robots_url, timeout=5)
if response.status_code == 200:
print("β
robots.txt found.")
return response.text
elif response.status_code == 404:
print("β οΈ robots.txt not found β proceed cautiously.")
return None
else:
print(f"β οΈ Unexpected status {response.status_code}.")
return None
except Exception as e:
print(f"β Error fetching robots.txt: {e}")
return None
π§± Ethical Scraping Practices
Even when allowed by robots.txt, you should follow best practices to avoid overwhelming servers or violating terms of service.
| Practice | How to Implement |
|---|---|
Respect robots.txt |
Always check before scraping |
| Add request delays | time.sleep(1-5) between requests |
| Identify yourself | Use a custom User-Agent string |
| Avoid login-protected data | Only scrape publicly accessible info |
| Limit data collected | Don't scrape entire sites blindly |
| Cache scraped data | Prevent repeated hits on the same URL |
π§Ύ Example with Headers & Delay
import time
import requests
headers = {
"User-Agent": "MyScraperBot/1.0 (contact@example.com)"
}
response = requests.get("http://quotes.toscrape.com", headers=headers)
time.sleep(2) # wait 2 seconds before the next request
β Can I Ignore robots.txt?
Technically, yes β Python will not stop you β but:
- Ethically, you shouldn't.
- Legally, scraping protected or copyrighted content may expose you to legal consequences.
- Many websites track and block bad bots.
β οΈ Always respect the site's terms of use.
π Resources
β Summary
| Situation | Action |
|---|---|
robots.txt disallows access |
β Donβt scrape |
robots.txt allows access |
β You can scrape |
robots.txt is missing |
β οΈ Proceed cautiously and ethically |
Before you scrape, check first. Scraping responsibly helps keep the web healthy and open for everyone.