How to Collect and Analyze Web Pages for Emphasis on Any Chosen Phrase


This guide shows you how to find a set of websites, locate specific content pages, and measure how strongly a chosen phrase is emphasized. The process works for any type of site, topic, or phrase.


1. Overview of the Process

  1. Define your target phrase (or group of phrases).
  2. Collect a list of relevant websites.
  3. Identify pages likely to contain your target phrase.
  4. Extract clean text from those pages.
  5. Scan and score the level of emphasis on your target phrase.
  6. Compile results for analysis.

2. Gathering Websites

A. Directories & Lists

Find directories, associations, or network lists related to your subject area. These often have official listings with website links.

B. Aggregator Sites

Use community-curated databases, resource lists, or public listings that aggregate many sites in one place.

C. Search Engine Queries

Use Google, Bing, or a search API like SerpAPI with queries combining your topic and your chosen phrase:

"your topic" "your phrase"
site:.org
site:.com
site:.net

You can also omit the phrase to find a broader list of sites, then scan them for matches later.


3. Locating the Right Page

Common approaches:

  • Try likely URL paths related to your topic.
  • Search the site directly:
site:[domain.com] your phrase
  • Look for menu items or headings that relate to the content area you’re researching.

4. Extracting Text

  • Manual Method: Visit the page and copy text into a spreadsheet or document.
  • Automated Method:
    Use tools or scripts such as:
    • Python: httpx + readability-lxml
    • Playwright: For sites requiring JavaScript rendering
    • Scrapy: For large-scale crawling with filtering rules

When extracting:

  • Keep the main page title, section headings, and body text.
  • Remove navigation menus, sidebars, and footers.

5. Scoring Phrase Emphasis

Exact matches:

  • Count the number of times your phrase appears exactly as entered.

Near matches:

  • Include common variations, synonyms, or alternate phrasings.

Placement bonus:

  • In title or H1 (+2)
  • In H2/H3 (+1)
  • In first 300 words (+1)
  • In bullet points or lists (+1)

Context bonus:

  • If the phrase is explained, expanded upon, or connected to the page’s main purpose.

Penalty:

  • If the phrase is mentioned only in passing without context.

Score scale (Emphasis Index, EI):

  • 0: No mention
  • 1: Passing mention only
  • 2: Mention + minimal context
  • 3: Mention + prominent placement
  • 4: Strong emphasis + multiple supporting elements
  • 5: Central focus of the content

6. Storing Your Results

Suggested spreadsheet columns: | Domain | Topic | Page URL | Page Title | EI Score | Exact Hits | Near Hits | Supporting Context | Evidence Snippets |


7. Legal & Ethical Notes

  • Respect robots.txt and site terms of use.
  • Use polite rate limits (1–4 requests/sec) for automated scripts.
  • Include an identifiable user-agent if crawling.
  • Allow site owners to opt out if requested.

8. Recommended Tools

  • For bulk crawling: Scrapy, Apify
  • For search seeding: SerpAPI, Bing Web Search API
  • For cleaning HTML: Readability
  • For storing results: Google Sheets, Excel, SQLite, or Postgres

9. Quick Start for Non-Coders

  1. Make your list of websites from directories or search results.
  2. Manually visit each site and find relevant pages.
  3. Copy the text into a spreadsheet.
  4. Search for your target phrase using your spreadsheet’s “find” function.
  5. Assign scores using the rules above.