How to Use Apify to Scan Websites for a Phrase


(Beginner-friendly, copy-and-do format)


Step 0 — What We’re Doing

We’re going to use Apify, a cloud platform, to:

  • Crawl a list of websites you give it,
  • Follow only links that look relevant (e.g., “beliefs”, “what we believe”),
  • Check for any phrase you choose (e.g., “Fear of the Lord”),
  • Give you a CSV showing how often and how prominently the phrase is used.

No installation. No coding background needed.


Step 1 — Go to Apify

  1. Open this link in your browser:
    https://apify.com
  2. Click Sign Up (top-right).
  3. Create a free account (Google, GitHub, or email).

Step 2 — Create a New Actor

  1. Once you’re logged in, click Actors in the left menu.
  2. Click the + New button.
  3. Choose JavaScript (the default template).
  4. In the Name field, type: phrase-emphasis-crawler

Step 3 — Replace the Code

  1. In the code editor that appears, select all the default code and delete it.
  2. Paste in this code:
import { Actor } from 'apify';
import { CheerioCrawler, Dataset, RequestQueue } from 'crawlee';

const PHRASE = 'fear of the lord';           // Change to any phrase
const NEAR = ['fear of god', 'reverent awe']; // Optional variants

const LIKELY = [
  'belief', 'our-beliefs', 'what-we-believe', 'statement-of-faith',
  'faith', 'doctrine', 'about', 'core-beliefs', 'we-believe'
];

function countOccurences(text, needle) {
  if (!needle) return 0;
  const safe = needle.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
  const matches = text.match(new RegExp(safe, 'gi'));
  return matches ? matches.length : 0;
}

await Actor.init();

const input = (await Actor.getInput()) || {
  startUrls: [{ url: 'https://example.org' }],
  sameDomainOnly: true,
  maxDepth: 2
};

const queue = await RequestQueue.open();
for (const s of input.startUrls || []) {
  await queue.addRequest({ url: s.url });
}

const crawler = new CheerioCrawler({
  requestQueue: queue,
  maxConcurrency: 10,
  async requestHandler({ request, $, body, enqueueLinks }) {
    const url = request.loadedUrl || request.url;
    const host = new URL(url).host;
    const title = ($('title').text() || '').trim();
    const text = ($('body').text() || '').replace(/\s+/g, ' ').trim().toLowerCase();
    const headings = $('h1,h2,h3').text().toLowerCase();

    const exact = countOccurences(text, PHRASE) + (title.toLowerCase().includes(PHRASE) ? 1 : 0);
    const near = (NEAR || []).reduce((n, p) => n + countOccurences(text, p), 0);
    const headingFlag = headings.includes(PHRASE) ? 1 : 0;
    const earlyFlag = text.slice(0, 2000).includes(PHRASE) ? 1 : 0;

    let base = exact + near === 0 ? 0 : exact + near <= 1 ? 1 : exact + near <= 3 ? 2 : exact + near <= 6 ? 3 : 4;
    const EI = Math.min(5, base + headingFlag + earlyFlag);

    await Dataset.pushData({
      host, url, title,
      exact_hits: exact, near_hits: near,
      heading_flag: headingFlag, early_flag: earlyFlag,
      EI
    });

    const patterns = LIKELY.map(k => new RegExp(k, 'i'));
    await enqueueLinks({
      strategy: 'same-domain',
      transformRequestFunction: req => {
        if (input.sameDomainOnly && new URL(req.url).host !== host) return null;
        if (!patterns.some(rx => rx.test(req.url))) return null;
        return req;
      },
      maxDepth: input.maxDepth ?? 2,
    });
  },
});

await crawler.run();
await Actor.exit();
  1. Click Save.

Step 4 — Provide Your Website List

  1. In the left panel, click Input.
  2. Paste JSON like this (change to your sites):
{
  "startUrls": [
    {"url": "https://examplechurch.org"},
    {"url": "https://anotherchurch.com"}
  ],
  "sameDomainOnly": true,
  "maxDepth": 2
}
  • startUrls: your starting websites.
  • sameDomainOnly: stay on the same site.
  • maxDepth: how many clicks deep to go (2 is enough for most).

Step 5 — Run the Actor

  1. Click Run (top-right).
  2. Wait for it to finish.
    (You can watch the log scroll if you want.)

Step 6 — Get Your Results

  1. When it’s done, click the Dataset tab.
  2. Click Export.
  3. Choose CSV (or JSON if you prefer).
  4. Open the CSV in Excel or Google Sheets — sort by the EI column to see the strongest emphasis first.

Step 7 — Change the Phrase Anytime

  • In the code editor: const PHRASE = 'your new phrase'; const NEAR = ['variant 1', 'variant 2'];
  • Click Save → Run again.

Step 8 — Good Practices

  • Keep maxDepth low for speed and to avoid crawling entire sites unnecessarily.
  • Add only the relevant NEAR variants to avoid false positives.
  • Use descriptive start URLs — homepage or known “beliefs” page.

Fastest to working demo: Custom GPT + Apify Actions.
Cheapest at scale / most control (but slower to set up): Assistants API + your own Scrapy/infra.

TL;DR pick

  • Quickest (hours, not days): Spin up an Apify Actor (we already wrote the code), then make a Custom GPT with a simple Action that:
    1. starts the crawl, 2) polls status, 3) fetches the dataset, 4) summarizes EI.
      No servers, no DevOps. You’re live today.

Why this is “most efficient” right now

  • Zero infra & maintenance: Apify handles crawling, queues, retries, exports.
  • Low cognitive load: Your GPT just calls 3 endpoints. No long prompts, no background jobs.
  • Scales cleanly: Add seeds, schedule runs in Apify. The GPT stays thin.
  • Cost visibility: Apify usage + tiny OpenAI tokens for the summary (not for the crawl).

When to choose the other route

  • If you need on-prem data custody / custom scoring / pennies-per-million pages:
    Go Assistants API + Scrapy on your box or cloud. More efficient per page long-term, but expect 1–3 days to productionize (CI, storage, monitoring).

One-sentence plan

Start with Apify + Custom GPT Action for instant capability; once the workflow and scoring are stable, clone the pipeline as Scrapy + Assistants if/when you want full control and lower marginal costs.