(Beginner-friendly, copy-and-do format)
Step 0 — What We’re Doing
We’re going to use Apify, a cloud platform, to:
- Crawl a list of websites you give it,
- Follow only links that look relevant (e.g., “beliefs”, “what we believe”),
- Check for any phrase you choose (e.g., “Fear of the Lord”),
- Give you a CSV showing how often and how prominently the phrase is used.
No installation. No coding background needed.
Step 1 — Go to Apify
- Open this link in your browser:
https://apify.com - Click Sign Up (top-right).
- Create a free account (Google, GitHub, or email).
Step 2 — Create a New Actor
- Once you’re logged in, click Actors in the left menu.
- Click the + New button.
- Choose JavaScript (the default template).
- In the Name field, type:
phrase-emphasis-crawler
Step 3 — Replace the Code
- In the code editor that appears, select all the default code and delete it.
- Paste in this code:
import { Actor } from 'apify';
import { CheerioCrawler, Dataset, RequestQueue } from 'crawlee';
const PHRASE = 'fear of the lord'; // Change to any phrase
const NEAR = ['fear of god', 'reverent awe']; // Optional variants
const LIKELY = [
'belief', 'our-beliefs', 'what-we-believe', 'statement-of-faith',
'faith', 'doctrine', 'about', 'core-beliefs', 'we-believe'
];
function countOccurences(text, needle) {
if (!needle) return 0;
const safe = needle.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
const matches = text.match(new RegExp(safe, 'gi'));
return matches ? matches.length : 0;
}
await Actor.init();
const input = (await Actor.getInput()) || {
startUrls: [{ url: 'https://example.org' }],
sameDomainOnly: true,
maxDepth: 2
};
const queue = await RequestQueue.open();
for (const s of input.startUrls || []) {
await queue.addRequest({ url: s.url });
}
const crawler = new CheerioCrawler({
requestQueue: queue,
maxConcurrency: 10,
async requestHandler({ request, $, body, enqueueLinks }) {
const url = request.loadedUrl || request.url;
const host = new URL(url).host;
const title = ($('title').text() || '').trim();
const text = ($('body').text() || '').replace(/\s+/g, ' ').trim().toLowerCase();
const headings = $('h1,h2,h3').text().toLowerCase();
const exact = countOccurences(text, PHRASE) + (title.toLowerCase().includes(PHRASE) ? 1 : 0);
const near = (NEAR || []).reduce((n, p) => n + countOccurences(text, p), 0);
const headingFlag = headings.includes(PHRASE) ? 1 : 0;
const earlyFlag = text.slice(0, 2000).includes(PHRASE) ? 1 : 0;
let base = exact + near === 0 ? 0 : exact + near <= 1 ? 1 : exact + near <= 3 ? 2 : exact + near <= 6 ? 3 : 4;
const EI = Math.min(5, base + headingFlag + earlyFlag);
await Dataset.pushData({
host, url, title,
exact_hits: exact, near_hits: near,
heading_flag: headingFlag, early_flag: earlyFlag,
EI
});
const patterns = LIKELY.map(k => new RegExp(k, 'i'));
await enqueueLinks({
strategy: 'same-domain',
transformRequestFunction: req => {
if (input.sameDomainOnly && new URL(req.url).host !== host) return null;
if (!patterns.some(rx => rx.test(req.url))) return null;
return req;
},
maxDepth: input.maxDepth ?? 2,
});
},
});
await crawler.run();
await Actor.exit();
- Click Save.
Step 4 — Provide Your Website List
- In the left panel, click Input.
- Paste JSON like this (change to your sites):
{
"startUrls": [
{"url": "https://examplechurch.org"},
{"url": "https://anotherchurch.com"}
],
"sameDomainOnly": true,
"maxDepth": 2
}
startUrls: your starting websites.sameDomainOnly: stay on the same site.maxDepth: how many clicks deep to go (2 is enough for most).
Step 5 — Run the Actor
- Click Run (top-right).
- Wait for it to finish.
(You can watch the log scroll if you want.)
Step 6 — Get Your Results
- When it’s done, click the Dataset tab.
- Click Export.
- Choose CSV (or JSON if you prefer).
- Open the CSV in Excel or Google Sheets — sort by the
EIcolumn to see the strongest emphasis first.
Step 7 — Change the Phrase Anytime
- In the code editor:
const PHRASE = 'your new phrase'; const NEAR = ['variant 1', 'variant 2']; - Click Save → Run again.
Step 8 — Good Practices
- Keep
maxDepthlow for speed and to avoid crawling entire sites unnecessarily. - Add only the relevant
NEARvariants to avoid false positives. - Use descriptive start URLs — homepage or known “beliefs” page.
Fastest to working demo: Custom GPT + Apify Actions.
Cheapest at scale / most control (but slower to set up): Assistants API + your own Scrapy/infra.
TL;DR pick
- Quickest (hours, not days): Spin up an Apify Actor (we already wrote the code), then make a Custom GPT with a simple Action that:
- starts the crawl, 2) polls status, 3) fetches the dataset, 4) summarizes EI.
No servers, no DevOps. You’re live today.
- starts the crawl, 2) polls status, 3) fetches the dataset, 4) summarizes EI.
Why this is “most efficient” right now
- Zero infra & maintenance: Apify handles crawling, queues, retries, exports.
- Low cognitive load: Your GPT just calls 3 endpoints. No long prompts, no background jobs.
- Scales cleanly: Add seeds, schedule runs in Apify. The GPT stays thin.
- Cost visibility: Apify usage + tiny OpenAI tokens for the summary (not for the crawl).
When to choose the other route
- If you need on-prem data custody / custom scoring / pennies-per-million pages:
Go Assistants API + Scrapy on your box or cloud. More efficient per page long-term, but expect 1–3 days to productionize (CI, storage, monitoring).
One-sentence plan
Start with Apify + Custom GPT Action for instant capability; once the workflow and scoring are stable, clone the pipeline as Scrapy + Assistants if/when you want full control and lower marginal costs.