This guide shows you how to find a set of websites, locate specific content pages, and measure how strongly a chosen phrase is emphasized. The process works for any type of site, topic, or phrase.
1. Overview of the Process
- Define your target phrase (or group of phrases).
- Collect a list of relevant websites.
- Identify pages likely to contain your target phrase.
- Extract clean text from those pages.
- Scan and score the level of emphasis on your target phrase.
- Compile results for analysis.
2. Gathering Websites
A. Directories & Lists
Find directories, associations, or network lists related to your subject area. These often have official listings with website links.
B. Aggregator Sites
Use community-curated databases, resource lists, or public listings that aggregate many sites in one place.
C. Search Engine Queries
Use Google, Bing, or a search API like SerpAPI with queries combining your topic and your chosen phrase:
"your topic" "your phrase"
site:.org
site:.com
site:.net
You can also omit the phrase to find a broader list of sites, then scan them for matches later.
3. Locating the Right Page
Common approaches:
- Try likely URL paths related to your topic.
- Search the site directly:
site:[domain.com] your phrase
- Look for menu items or headings that relate to the content area you’re researching.
4. Extracting Text
- Manual Method: Visit the page and copy text into a spreadsheet or document.
- Automated Method:
Use tools or scripts such as:- Python:
httpx+readability-lxml - Playwright: For sites requiring JavaScript rendering
- Scrapy: For large-scale crawling with filtering rules
- Python:
When extracting:
- Keep the main page title, section headings, and body text.
- Remove navigation menus, sidebars, and footers.
5. Scoring Phrase Emphasis
Exact matches:
- Count the number of times your phrase appears exactly as entered.
Near matches:
- Include common variations, synonyms, or alternate phrasings.
Placement bonus:
- In title or H1 (+2)
- In H2/H3 (+1)
- In first 300 words (+1)
- In bullet points or lists (+1)
Context bonus:
- If the phrase is explained, expanded upon, or connected to the page’s main purpose.
Penalty:
- If the phrase is mentioned only in passing without context.
Score scale (Emphasis Index, EI):
- 0: No mention
- 1: Passing mention only
- 2: Mention + minimal context
- 3: Mention + prominent placement
- 4: Strong emphasis + multiple supporting elements
- 5: Central focus of the content
6. Storing Your Results
Suggested spreadsheet columns: | Domain | Topic | Page URL | Page Title | EI Score | Exact Hits | Near Hits | Supporting Context | Evidence Snippets |
7. Legal & Ethical Notes
- Respect robots.txt and site terms of use.
- Use polite rate limits (1–4 requests/sec) for automated scripts.
- Include an identifiable user-agent if crawling.
- Allow site owners to opt out if requested.
8. Recommended Tools
- For bulk crawling: Scrapy, Apify
- For search seeding: SerpAPI, Bing Web Search API
- For cleaning HTML: Readability
- For storing results: Google Sheets, Excel, SQLite, or Postgres
9. Quick Start for Non-Coders
- Make your list of websites from directories or search results.
- Manually visit each site and find relevant pages.
- Copy the text into a spreadsheet.
- Search for your target phrase using your spreadsheet’s “find” function.
- Assign scores using the rules above.