(Beginner-friendly, step-by-step)
Step 0 — What We’re Doing
We’ll install Scrapy on your own computer so you can:
- Crawl a list of websites you choose,
- Look for a specific phrase (and variants) in the page text,
- Export the results into a CSV for analysis.
Scrapy runs locally — you keep all the data, no cloud account needed.
Step 1 — Install Python
- Go to https://www.python.org/downloads/.
- Download and install the latest Python 3.x for your operating system.
- During install, check the box: “Add Python to PATH”.
Step 2 — Open a Terminal
- Windows: Press
Win + S, typecmd(Command Prompt) or “PowerShell”. - Mac: Open Terminal from Applications → Utilities.
- Linux: Open your system terminal.
Step 3 — Create a Folder for Your Project
mkdir phrase_scan
cd phrase_scan
Step 4 — Create a Virtual Environment
This keeps your Scrapy install separate from other Python tools.
python -m venv .venv
Activate it:
- Windows:
.venv\Scripts\activate - Mac/Linux:
source .venv/bin/activate
Step 5 — Install Scrapy
pip install --upgrade pip
pip install scrapy readability-lxml
Step 6 — Start a Scrapy Project
scrapy startproject beliefscan
cd beliefscan
This creates the beliefscan project folder with the right structure.
Step 7 — Create Your Spider
Make a file:
beliefscan/spiders/beliefs_spider.py
Paste this code:
import re
import scrapy
from readability import Document
PHRASE = "fear of the lord" # change to any phrase
NEAR = [r"fear of god", r"reverent awe"] # optional variants
LIKELY_PATHS = [
"/beliefs", "/our-beliefs", "/what-we-believe", "/statement-of-faith",
"/faith", "/doctrine", "/about/beliefs", "/about/faith", "/core-beliefs", "/we-believe"
]
class BeliefsSpider(scrapy.Spider):
name = "beliefs"
custom_settings = {
"ROBOTSTXT_OBEY": True,
"DOWNLOAD_DELAY": 0.4,
"CONCURRENT_REQUESTS_PER_DOMAIN": 4,
"FEED_EXPORT_ENCODING": "utf-8"
}
def start_requests(self):
# Domains to check
domains = [l.strip() for l in open("seeds_domains.txt", encoding="utf-8") if l.strip() and not l.startswith("#")]
for d in domains:
base = d if d.startswith("http") else f"https://{d}"
for p in LIKELY_PATHS:
yield scrapy.Request(url=base.rstrip("/") + p, callback=self.parse_page)
yield scrapy.Request(url=base, callback=self.parse_home)
def parse_home(self, response):
hints = ("belief","faith","doctrine","what-we-believe","statement","our-beliefs","core-beliefs")
for a in response.css("a::attr(href)").getall():
if any(h in a.lower() for h in hints):
yield scrapy.Request(response.urljoin(a.split("#")[0]), callback=self.parse_page)
def parse_page(self, response):
doc = Document(response.text)
try:
from parsel import Selector
html = doc.summary(html_partial=True)
sel = Selector(text=html)
text = re.sub(r"\s+", " ", sel.xpath("string()").get() or "").strip()
except Exception:
text = re.sub(r"\s+", " ", response.xpath("string(//body)").get() or "").strip()
title = (response.css("title::text").get() or "").strip()
def hits(pattern, where):
return len(re.findall(pattern, where, flags=re.I))
exact = hits(re.escape(PHRASE), title) + hits(re.escape(PHRASE), text)
near = sum(hits(p, title) + hits(p, text) for p in NEAR)
heading = 1 if re.search(re.escape(PHRASE), " ".join(response.css("h1,h2,h3::text").getall() or []), re.I) else 0
early = 1 if re.search(re.escape(PHRASE), " ".join(text.split()[:300]), re.I) else 0
total = exact + near
if total == 0:
ei = 0
elif total == 1:
ei = 1 + heading + early
elif total <= 3:
ei = 2 + heading + early
elif total <= 6:
ei = 3 + heading + early
else:
ei = 4 + heading + early
ei = min(5, ei)
yield {
"domain": response.url.split("/")[2],
"url": response.url,
"title": title,
"exact_hits": exact,
"near_hits": near,
"heading_flag": heading,
"early_flag": early,
"EI": ei
}
Step 8 — Add Your Site List
In the same folder as beliefscan/spiders, create a file called:
seeds_domains.txt
Add one domain per line:
examplechurch.org
anotherchurch.com
Step 9 — Run the Spider
In your terminal (inside the project folder):
scrapy crawl beliefs -O results.csv
-O results.csvsaves the output to a CSV file.
Step 10 — View Your Results
- Open
results.csvin Excel or Google Sheets. - Columns:
domain, url, title, exact_hits, near_hits, heading_flag, early_flag, EI - Sort by
EIto see which sites emphasize the phrase most.
Step 11 — Change the Phrase
In beliefs_spider.py:
PHRASE = "your new phrase"
NEAR = ["variant one", "variant two"]
Save the file, run the crawl again.
Tips
- Keep your seeds list short to start (5–10 sites) while testing.
- You can increase
DOWNLOAD_DELAYif sites block you. - Scrapy obeys
robots.txtby default — safe crawling.