Skip to content

๐Ÿ›ก๏ธ Avoiding Bans While Scraping

Web scraping can trigger anti-bot protections if you're not careful. To avoid getting blocked or banned, follow these best practices:


๐Ÿง  Why You Might Get Blocked

Websites often use techniques to detect and block bots:

  • Too many requests in a short time
  • Using default headers (like Python's requests)
  • Not following robots.txt
  • Ignoring JavaScript rendering
  • Accessing login-restricted or sensitive areas

โœ… Key Anti-Ban Techniques

1. ๐Ÿ” Rotate User-Agent Headers

User-Agent strings identify the browser/device making the request. Using the same one repeatedly (especially Python's default) is a red flag.

import random

user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 13_3) AppleWebKit/605.1.15 Version/16.1 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) Firefox/117.0",
    # Add more real user-agents
]

headers = {
    "User-Agent": random.choice(user_agents)
}

2. โณ Use Random Delays Between Requests

Fixed delays are predictable. Random delays simulate human browsing behavior.

import time
import random

delay = random.uniform(2, 5)
print(f"Sleeping for {delay:.2f} seconds")
time.sleep(delay)

Use after every request or interaction.


๐Ÿงช Combined Example with requests + BeautifulSoup

import time
import random
import requests
from bs4 import BeautifulSoup

urls = ["https://example.com/page1", "https://example.com/page2"]

user_agents = [
    "Mozilla/5.0 ...",
    "Mozilla/5.0 ...",
]

for url in urls:
    headers = {
        "User-Agent": random.choice(user_agents)
    }

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Parse content here...

    delay = random.uniform(2, 4)
    time.sleep(delay)

๐Ÿงช Example with Selenium

You can also rotate user agents in Selenium:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import random

user_agents = [
    "Mozilla/5.0 ...",
    "Mozilla/5.0 ..."
]

options = Options()
options.add_argument(f"user-agent={random.choice(user_agents)}")
options.headless = True

driver = webdriver.Chrome(options=options)
driver.get("https://example.com")

Add delays with time.sleep() between actions as needed.


๐Ÿงพ Other Useful Techniques

Technique Description
โœ… Respect robots.txt Always check if scraping is allowed
๐ŸŒ Use rotating proxies Helps bypass IP-based rate limits and bans
๐Ÿ” Retry logic Use try/except blocks to handle failures
๐Ÿ›‘ Rate limiting Never hammer a site โ€” use exponential backoff
๐Ÿง‘โ€๐Ÿ’ป Custom headers Mimic browser behavior more closely

โš ๏ธ Things to Avoid

  • โŒ Scraping login-required pages without permission
  • โŒ Running multiple threads with no throttling
  • โŒ Using scraping tools without setting headers
  • โŒ Ignoring robots.txt or site TOS

๐Ÿ“ฆ Libraries That Help

Tool Use Case
fake-useragent Random browser-like User-Agents
requests Simple static HTML scraping
Selenium Dynamic content scraping
cfscrape Bypass Cloudflare protection
scrapy-rotating-proxies Proxy rotation in Scrapy

๐Ÿ” Always Scrape Responsibly

Even if you can scrape a site, always ask:

  • Should you?
  • What data are you collecting?
  • Are you putting unnecessary load on the server?

Always identify yourself in your user agent or headers if you're doing heavy scraping for research or business.


๐Ÿ Scrapy: Avoiding Bans & Being Polite

Scrapy includes built-in tools and middlewares to help you scrape responsibly and avoid getting banned.

1. Download Delay & Randomization

In your settings.py:

DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True

2. Rotate User Agents

Use the scrapy-user-agents middleware:

Install:

pip install scrapy-user-agents

Enable in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

Alternatively, write a custom middleware to rotate user agents.


3. Enable AutoThrottle

AutoThrottle adjusts crawling speed based on server load.

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

4. Respect robots.txt

Make sure this is enabled:

ROBOTSTXT_OBEY = True

5. Retry & Proxy Middleware

Enable retrying failed requests:

RETRY_ENABLED = True
RETRY_TIMES = 5

Use rotating proxies if needed.


Example settings.py snippet

BOT_NAME = 'myproject'

SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'

ROBOTSTXT_OBEY = True

DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

RETRY_ENABLED = True
RETRY_TIMES = 5

DOWNLOADER_MIDDLEWARES = {
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
}

Resources


Scrape smart. Scrape safe. Scrape ethically. ๐Ÿ•ต๏ธโ€โ™‚๏ธ