How to Use Scrapy to Scan Websites for a Phrase


(Beginner-friendly, step-by-step)


Step 0 — What We’re Doing

We’ll install Scrapy on your own computer so you can:

  • Crawl a list of websites you choose,
  • Look for a specific phrase (and variants) in the page text,
  • Export the results into a CSV for analysis.

Scrapy runs locally — you keep all the data, no cloud account needed.


Step 1 — Install Python

  1. Go to https://www.python.org/downloads/.
  2. Download and install the latest Python 3.x for your operating system.
  3. During install, check the box: “Add Python to PATH”.

Step 2 — Open a Terminal

  • Windows: Press Win + S, type cmd (Command Prompt) or “PowerShell”.
  • Mac: Open Terminal from Applications → Utilities.
  • Linux: Open your system terminal.

Step 3 — Create a Folder for Your Project

mkdir phrase_scan
cd phrase_scan

Step 4 — Create a Virtual Environment

This keeps your Scrapy install separate from other Python tools.

python -m venv .venv

Activate it:

  • Windows: .venv\Scripts\activate
  • Mac/Linux: source .venv/bin/activate

Step 5 — Install Scrapy

pip install --upgrade pip
pip install scrapy readability-lxml

Step 6 — Start a Scrapy Project

scrapy startproject beliefscan
cd beliefscan

This creates the beliefscan project folder with the right structure.


Step 7 — Create Your Spider

Make a file:

beliefscan/spiders/beliefs_spider.py

Paste this code:

import re
import scrapy
from readability import Document

PHRASE = "fear of the lord"               # change to any phrase
NEAR = [r"fear of god", r"reverent awe"]  # optional variants

LIKELY_PATHS = [
    "/beliefs", "/our-beliefs", "/what-we-believe", "/statement-of-faith",
    "/faith", "/doctrine", "/about/beliefs", "/about/faith", "/core-beliefs", "/we-believe"
]

class BeliefsSpider(scrapy.Spider):
    name = "beliefs"
    custom_settings = {
        "ROBOTSTXT_OBEY": True,
        "DOWNLOAD_DELAY": 0.4,
        "CONCURRENT_REQUESTS_PER_DOMAIN": 4,
        "FEED_EXPORT_ENCODING": "utf-8"
    }

    def start_requests(self):
        # Domains to check
        domains = [l.strip() for l in open("seeds_domains.txt", encoding="utf-8") if l.strip() and not l.startswith("#")]
        for d in domains:
            base = d if d.startswith("http") else f"https://{d}"
            for p in LIKELY_PATHS:
                yield scrapy.Request(url=base.rstrip("/") + p, callback=self.parse_page)
            yield scrapy.Request(url=base, callback=self.parse_home)

    def parse_home(self, response):
        hints = ("belief","faith","doctrine","what-we-believe","statement","our-beliefs","core-beliefs")
        for a in response.css("a::attr(href)").getall():
            if any(h in a.lower() for h in hints):
                yield scrapy.Request(response.urljoin(a.split("#")[0]), callback=self.parse_page)

    def parse_page(self, response):
        doc = Document(response.text)
        try:
            from parsel import Selector
            html = doc.summary(html_partial=True)
            sel = Selector(text=html)
            text = re.sub(r"\s+", " ", sel.xpath("string()").get() or "").strip()
        except Exception:
            text = re.sub(r"\s+", " ", response.xpath("string(//body)").get() or "").strip()

        title = (response.css("title::text").get() or "").strip()

        def hits(pattern, where):
            return len(re.findall(pattern, where, flags=re.I))

        exact = hits(re.escape(PHRASE), title) + hits(re.escape(PHRASE), text)
        near = sum(hits(p, title) + hits(p, text) for p in NEAR)

        heading = 1 if re.search(re.escape(PHRASE), " ".join(response.css("h1,h2,h3::text").getall() or []), re.I) else 0
        early = 1 if re.search(re.escape(PHRASE), " ".join(text.split()[:300]), re.I) else 0

        total = exact + near
        if total == 0:
            ei = 0
        elif total == 1:
            ei = 1 + heading + early
        elif total <= 3:
            ei = 2 + heading + early
        elif total <= 6:
            ei = 3 + heading + early
        else:
            ei = 4 + heading + early
        ei = min(5, ei)

        yield {
            "domain": response.url.split("/")[2],
            "url": response.url,
            "title": title,
            "exact_hits": exact,
            "near_hits": near,
            "heading_flag": heading,
            "early_flag": early,
            "EI": ei
        }

Step 8 — Add Your Site List

In the same folder as beliefscan/spiders, create a file called:

seeds_domains.txt

Add one domain per line:

examplechurch.org
anotherchurch.com

Step 9 — Run the Spider

In your terminal (inside the project folder):

scrapy crawl beliefs -O results.csv
  • -O results.csv saves the output to a CSV file.

Step 10 — View Your Results

  • Open results.csv in Excel or Google Sheets.
  • Columns: domain, url, title, exact_hits, near_hits, heading_flag, early_flag, EI
  • Sort by EI to see which sites emphasize the phrase most.

Step 11 — Change the Phrase

In beliefs_spider.py:

PHRASE = "your new phrase"
NEAR = ["variant one", "variant two"]

Save the file, run the crawl again.


Tips

  • Keep your seeds list short to start (5–10 sites) while testing.
  • You can increase DOWNLOAD_DELAY if sites block you.
  • Scrapy obeys robots.txt by default — safe crawling.