How to Use Apify to Scan Websites for a Phrase – SolveForce Unified Intelligence

(Beginner-friendly, copy-and-do format)

Step 0 — What We’re Doing

We’re going to use Apify, a cloud platform, to:

Crawl a list of websites you give it,
Follow only links that look relevant (e.g., “beliefs”, “what we believe”),
Check for any phrase you choose (e.g., “Fear of the Lord”),
Give you a CSV showing how often and how prominently the phrase is used.

No installation. No coding background needed.

Step 1 — Go to Apify

Open this link in your browser:
https://apify.com
Click Sign Up (top-right).
Create a free account (Google, GitHub, or email).

Step 2 — Create a New Actor

Once you’re logged in, click Actors in the left menu.
Click the + New button.
Choose JavaScript (the default template).
In the Name field, type: phrase-emphasis-crawler

Step 3 — Replace the Code

In the code editor that appears, select all the default code and delete it.
Paste in this code:

import { Actor } from 'apify';
import { CheerioCrawler, Dataset, RequestQueue } from 'crawlee';

const PHRASE = 'fear of the lord';           // Change to any phrase
const NEAR = ['fear of god', 'reverent awe']; // Optional variants

const LIKELY = [
  'belief', 'our-beliefs', 'what-we-believe', 'statement-of-faith',
  'faith', 'doctrine', 'about', 'core-beliefs', 'we-believe'
];

function countOccurences(text, needle) {
  if (!needle) return 0;
  const safe = needle.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
  const matches = text.match(new RegExp(safe, 'gi'));
  return matches ? matches.length : 0;
}

await Actor.init();

const input = (await Actor.getInput()) || {
  startUrls: [{ url: 'https://example.org' }],
  sameDomainOnly: true,
  maxDepth: 2
};

const queue = await RequestQueue.open();
for (const s of input.startUrls || []) {
  await queue.addRequest({ url: s.url });
}

const crawler = new CheerioCrawler({
  requestQueue: queue,
  maxConcurrency: 10,
  async requestHandler({ request, $, body, enqueueLinks }) {
    const url = request.loadedUrl || request.url;
    const host = new URL(url).host;
    const title = ($('title').text() || '').trim();
    const text = ($('body').text() || '').replace(/\s+/g, ' ').trim().toLowerCase();
    const headings = $('h1,h2,h3').text().toLowerCase();

    const exact = countOccurences(text, PHRASE) + (title.toLowerCase().includes(PHRASE) ? 1 : 0);
    const near = (NEAR || []).reduce((n, p) => n + countOccurences(text, p), 0);
    const headingFlag = headings.includes(PHRASE) ? 1 : 0;
    const earlyFlag = text.slice(0, 2000).includes(PHRASE) ? 1 : 0;

    let base = exact + near === 0 ? 0 : exact + near <= 1 ? 1 : exact + near <= 3 ? 2 : exact + near <= 6 ? 3 : 4;
    const EI = Math.min(5, base + headingFlag + earlyFlag);

    await Dataset.pushData({
      host, url, title,
      exact_hits: exact, near_hits: near,
      heading_flag: headingFlag, early_flag: earlyFlag,
      EI
    });

    const patterns = LIKELY.map(k => new RegExp(k, 'i'));
    await enqueueLinks({
      strategy: 'same-domain',
      transformRequestFunction: req => {
        if (input.sameDomainOnly && new URL(req.url).host !== host) return null;
        if (!patterns.some(rx => rx.test(req.url))) return null;
        return req;
      },
      maxDepth: input.maxDepth ?? 2,
    });
  },
});

await crawler.run();
await Actor.exit();

Click Save.

Step 4 — Provide Your Website List

In the left panel, click Input.
Paste JSON like this (change to your sites):

{
  "startUrls": [
    {"url": "https://examplechurch.org"},
    {"url": "https://anotherchurch.com"}
  ],
  "sameDomainOnly": true,
  "maxDepth": 2
}

startUrls: your starting websites.
sameDomainOnly: stay on the same site.
maxDepth: how many clicks deep to go (2 is enough for most).

Step 5 — Run the Actor

Click Run (top-right).
Wait for it to finish.
(You can watch the log scroll if you want.)

Step 6 — Get Your Results

When it’s done, click the Dataset tab.
Click Export.
Choose CSV (or JSON if you prefer).
Open the CSV in Excel or Google Sheets — sort by the EI column to see the strongest emphasis first.

Step 7 — Change the Phrase Anytime

In the code editor: const PHRASE = 'your new phrase'; const NEAR = ['variant 1', 'variant 2'];
Click Save → Run again.

Step 8 — Good Practices

Keep maxDepth low for speed and to avoid crawling entire sites unnecessarily.
Add only the relevant NEAR variants to avoid false positives.
Use descriptive start URLs — homepage or known “beliefs” page.

Fastest to working demo: Custom GPT + Apify Actions.
Cheapest at scale / most control (but slower to set up): Assistants API + your own Scrapy/infra.

TL;DR pick

Quickest (hours, not days): Spin up an Apify Actor (we already wrote the code), then make a Custom GPT with a simple Action that:
1. starts the crawl, 2) polls status, 3) fetches the dataset, 4) summarizes EI.
  No servers, no DevOps. You’re live today.

Why this is “most efficient” right now

Zero infra & maintenance: Apify handles crawling, queues, retries, exports.
Low cognitive load: Your GPT just calls 3 endpoints. No long prompts, no background jobs.
Scales cleanly: Add seeds, schedule runs in Apify. The GPT stays thin.
Cost visibility: Apify usage + tiny OpenAI tokens for the summary (not for the crawl).

When to choose the other route

If you need on-prem data custody / custom scoring / pennies-per-million pages:
Go Assistants API + Scrapy on your box or cloud. More efficient per page long-term, but expect 1–3 days to productionize (CI, storage, monitoring).

One-sentence plan

Start with Apify + Custom GPT Action for instant capability; once the workflow and scoring are stable, clone the pipeline as Scrapy + Assistants if/when you want full control and lower marginal costs.

https://solveforce.com/how-to-set-up-a-custom-gpt/