π² Web Scraping with Beautiful Soup
Beautiful Soup is a Python library for parsing HTML and XML documents. Itβs ideal for beginners and small-to-medium web scraping tasks where you want quick and readable code.
π Why Use Beautiful Soup?
- Simple and beginner-friendly
- Great for static pages with predictable structure
- Integrates easily with
requests - Supports multiple parsers (html.parser, lxml)
π οΈ Installation
Install Beautiful Soup along with the requests library:
π Basic Example
Letβs scrape quotes from http://quotes.toscrape.com
import requests
from bs4 import BeautifulSoup
URL = "http://quotes.toscrape.com/"
response = requests.get(URL)
soup = BeautifulSoup(response.text, 'html.parser')
for quote in soup.find_all('div', class_='quote'):
text = quote.find('span', class_='text').get_text()
author = quote.find('small', class_='author').get_text()
tags = [tag.get_text() for tag in quote.find_all('a', class_='tag')]
print(f"{text} β {author} | Tags: {tags}")
π§ How It Works
requests.get()fetches the webpage content.BeautifulSoup()parses the HTML.find()andfind_all()are used to locate elements by tag and class..get_text()extracts the inner text from an element.
π§ Key Features
| Feature | Description |
|---|---|
| Tag navigation | Traverse HTML tags as objects (soup.div) |
| Searching by attributes | find(), find_all(), select() |
| CSS selectors | Use .select('div.quote span.text') |
| Parser flexibility | Use html.parser, lxml, or html5lib |
β¨ Example: Extract All Authors
authors = soup.find_all('small', class_='author')
for author in set(a.get_text() for a in authors):
print(author)
π¦ Exporting Data
You can save data to a CSV file like this:
import csv
with open('quotes.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['Quote', 'Author', 'Tags'])
for quote in soup.find_all('div', class_='quote'):
text = quote.find('span', class_='text').get_text()
author = quote.find('small', class_='author').get_text()
tags = [tag.get_text() for tag in quote.find_all('a', class_='tag')]
writer.writerow([text, author, ', '.join(tags)])
β When to Use Beautiful Soup
| Scenario | Use Beautiful Soup? |
|---|---|
| Parsing static HTML pages | β Yes |
| Quick one-off scraping scripts | β Yes |
| Pages with simple structure | β Yes |
| Crawling multiple pages or domains | β Use Scrapy |
| JavaScript-rendered content | β Use Selenium |
β οΈ Ethics and Best Practices
- Respect the websiteβs
robots.txt - Add delays to avoid overwhelming servers
- Avoid scraping personal/private data
- Review the site's terms of service
π Resources
Beautiful Soup is perfect for getting started with web scraping. It's powerful enough for most everyday scraping tasks and easy to pick up.
Happy scraping! π₯
Respect robots.txt with Beautiful Soup
Letβs fix that by clearly explaining how to respect robots.txt when using Beautiful Soup, since Beautiful Soup itself doesn't handle this β you have to do it manually.
β How to Respect
robots.txtwith Beautiful Soup
Unlike Scrapy, which has built-in support for handling robots.txt, Beautiful Soup is just an HTML parser. So you must manually check whether the site allows scraping before you proceed.
Hereβs how you can do that in Python:
π¦ Step 1: Install robotparser (standard in Python)
Python has a built-in module for this:
π§ Step 2: Check if Scraping Is Allowed
Hereβs an example that checks whether youβre allowed to scrape a page:
import urllib.robotparser
from urllib.parse import urlparse
def is_allowed(url, user_agent='*'):
parsed = urlparse(url)
base_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = urllib.robotparser.RobotFileParser()
rp.set_url(base_url)
rp.read()
return rp.can_fetch(user_agent, url)
# Example usage
url = "http://quotes.toscrape.com/page/1/"
if is_allowed(url):
print("β
Allowed to scrape")
else:
print("π« Not allowed to scrape")
π‘ Notes
robots.txtis a voluntary standard, but it's widely respected by ethical scrapers.- Some sites donβt have a
robots.txtat all β in that case, you may choose to proceed cautiously. - Always identify yourself in your headers if doing heavy scraping.
Proceed Cautiously
Great follow-up! If a site doesnβt have a robots.txt, it means there are no explicit rules for scraping β which isnβt the same as permission, but also not an outright denial.
So when robots.txt is missing, the ethical and practical response is to:
- Proceed cautiously and respectfully
- Follow general good scraping practices (throttling, not scraping sensitive data, etc.)
- Ideally contact the site owner if scraping heavily
π§ͺ How to Detect If
robots.txtIs Missing
You can check if the file exists by making a request and seeing the HTTP status:
import requests
from urllib.parse import urlparse
import urllib.robotparser
def is_allowed(url, user_agent='*'):
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
try:
response = requests.get(robots_url, timeout=5)
if response.status_code == 200:
rp = urllib.robotparser.RobotFileParser()
rp.parse(response.text.splitlines())
return rp.can_fetch(user_agent, url)
elif response.status_code == 404:
print("β οΈ robots.txt not found β proceed cautiously.")
return True # You decide whether to allow this or not
else:
print(f"β οΈ Unexpected response ({response.status_code}) for robots.txt β proceed with caution.")
return False
except Exception as e:
print(f"β Error checking robots.txt: {e}")
return False
How to Proceed Cautiously
If robots.txt is not found (404), and you decide to scrape anyway, follow these best practices:
| Guideline | How to Implement in Python |
|---|---|
| Throttle requests | Use time.sleep() or asyncio.sleep() |
| Set custom user-agent | Use headers with requests |
| Avoid scraping sensitive data | Only target public, visible info |
| Don't scrape too many pages at once | Limit page depth and pagination |
| Cache data when possible | Store locally to reduce repeat scraping |
| Monitor for changes in access rules | Check robots.txt before each session |
import time
headers = {
"User-Agent": "Mozilla/5.0 (compatible; MyScraper/1.0)"
}
response = requests.get("https://example.com", headers=headers)
time.sleep(2) # wait 2 seconds between requests
π Optional: Log That You Proceeded Cautiously
This is useful for auditability in your scraper:
if response.status_code == 404:
print("robots.txt missing. Proceeding cautiously with delays and limited scraping.")
β οΈ Final Note
Just because robots.txt is missing doesn't mean you're free to scrape anything β especially:
- Login-required or private pages
- Sites with terms of service prohibiting bots
- Pages with copyright-protected content
When in doubt, limit what you collect and contact the site owner if it's a long-term or heavy-use project.