Custom GPT Crawler

Start your crawl, wait for it to finish, fetch results, and summarize—end to end.

Below you’ll get two drop-ins:

  1. OpenAPI schema (paste into the GPT Builder → Configure → Actions → Add Action).
  2. Instruction block (paste into the GPT’s Instructions so it knows when/how to call each action).

I’m using Apify’s REST API for: start run → check status (optionally wait) → export dataset. (Endpoints documented by Apify; GPT Actions setup documented by OpenAI.)


1) OpenAPI schema (ready to paste)

In GPT Builder: Configure → Actions → Add Action → Import from URL / Paste schema.

openapi: 3.1.0
info:
  title: Phrase Emphasis Crawler (Apify)
  version: "1.0.0"
  description: >
    Start an Apify Actor run with a list of start URLs and a phrase, poll for completion,
    and fetch the resulting dataset (CSV or JSON).
servers:
  - url: https://api.apify.com/v2
components:
  securitySchemes:
    ApifyToken:
      type: apiKey
      in: header
      name: Authorization
      description: Use format "Bearer {APIFY_API_TOKEN}"
  schemas:
    StartInput:
      type: object
      required: [actorId, input]
      properties:
        actorId:
          type: string
          description: Actor ID or "username~actor-name"
          example: your-user~phrase-emphasis-crawler
        input:
          type: object
          description: JSON payload passed as INPUT to the Actor (e.g., startUrls, sameDomainOnly, maxDepth)
          example:
            startUrls: [{ url: "https://examplechurch.org" }]
            sameDomainOnly: true
            maxDepth: 2
        options:
          type: object
          description: Optional run options (memory, timeout, build, etc.)
          example: { memory: 2048, timeoutSecs: 1800 }
    RunStatus:
      type: object
      properties:
        id: { type: string }
        status: { type: string, description: CREATED|RUNNING|SUCCEEDED|FAILED|ABORTED|TIMED-OUT }
        defaultDatasetId: { type: string }
        defaultKeyValueStoreId: { type: string }
        finishedAt: { type: string, nullable: true }
    DatasetItems:
      type: array
      items:
        type: object
        additionalProperties: true
security:
  - ApifyToken: []
paths:
  /acts/{actorId}/runs:
    post:
      operationId: startRun
      summary: Start an Actor run (returns immediately)
      description: |
        Runs an Actor and returns the new run object. The POST body is passed
        to the Actor as INPUT. Use `getRun` (optionally with waitForFinish) to await completion.
      security: [{ ApifyToken: [] }]
      parameters:
        - in: path
          name: actorId
          required: true
          schema: { type: string }
      requestBody:
        required: true
        content:
          application/json:
            schema: { $ref: "#/components/schemas/StartInput" }
      responses:
        "201":
          description: Run created
          content:
            application/json:
              schema:
                type: object
                properties:
                  data: { $ref: "#/components/schemas/RunStatus" }
        "4XX":
          description: Client error
        "5XX":
          description: Server error

  /actor-runs/{runId}:
    get:
      operationId: getRun
      summary: Get run status (optionally wait for finish)
      description: |
        Returns run details. Use query param `waitForFinish` (0..120000) to block
        until the run finishes (ms). If omitted, returns current status immediately.
      security: [{ ApifyToken: [] }]
      parameters:
        - in: path
          name: runId
          required: true
          schema: { type: string }
        - in: query
          name: waitForFinish
          required: false
          schema: { type: integer, minimum: 0, maximum: 120000 }
      responses:
        "200":
          description: Run details
          content:
            application/json:
              schema:
                type: object
                properties:
                  data: { $ref: "#/components/schemas/RunStatus" }

  /datasets/{datasetId}/items:
    get:
      operationId: getDatasetItems
      summary: Fetch dataset items
      description: |
        Returns dataset items. Use `format=json` for JSON (default) or `format=csv` to get CSV text.
      security: [{ ApifyToken: [] }]
      parameters:
        - in: path
          name: datasetId
          required: true
          schema: { type: string }
        - in: query
          name: format
          required: false
          schema: { type: string, enum: [json, csv], default: json }
        - in: query
          name: clean
          required: false
          schema: { type: boolean, default: true }
      responses:
        "200":
          description: Items
          content:
            application/json:
              schema: { $ref: "#/components/schemas/DatasetItems" }
            text/csv:
              schema: { type: string }

Why these endpoints?

  • Start run: POST /v2/acts/{actorId}/runs starts an Actor.
  • Check run: GET /v2/actor-runs/{runId} supports waitForFinish so you can block instead of polling.
  • Get results: GET /v2/datasets/{datasetId}/items?format=csv|json exports your run’s dataset.

Auth in GPT Builder: set Authorization = Bearer ${APIFY_API_TOKEN} (store the token as a secret in the Action form).


2) Custom GPT “Instructions” (ready to paste)

In GPT Builder: Configure → Instructions. Replace or append the block below.

Role & Goal
You are an operations assistant that runs a web crawl via Apify, waits for it to finish, then returns a concise report with an Emphasis Index summary for the chosen phrase.

When to use Actions

  • When the user says “scan”, “crawl”, “run”, or provides URLs/phrase → call startRun.
  • Immediately after startRun, call getRun with waitForFinish=120000 (120s).
    • If status is not a terminal state, keep calling getRun with waitForFinish=120000 until status ∈ {SUCCEEDED, FAILED, ABORTED, TIMED-OUT}.
  • If SUCCEEDED and defaultDatasetId is present → call getDatasetItems (prefer format=json unless the user asks for CSV).
  • Then summarize: top hosts by EI, pages with highest EI, average EI, and any notable evidence fields if present.

Required inputs for startRun

  • actorId: the Apify Actor ID (e.g., username~phrase-emphasis-crawler).
  • input: at minimum startUrls (array of {url}), plus any flags (e.g., sameDomainOnly, maxDepth). If user only supplies a list of domains, convert them to URLs with https:// prefix.

Error handling

  • If startRun or getRun fails → report the error message and suggest reducing depth or seed size.
  • If final status is not SUCCEEDED → show the final status and logs link if available (from run object), and stop.
  • If dataset is empty → say so plainly and suggest adding NEAR variants or different LIKELY paths.

Output format to the user

  1. A short status line (run ID, status, elapsed time).
  2. A compact table (host, pages scanned, top 5 URLs by EI).
  3. A one-paragraph interpretation (what the EI distribution means).
  4. Offer to export CSV (call getDatasetItems?format=csv) if requested.

Behavioral rules

  • Never invent data. Only claim what the dataset contains.
  • Keep the report lean: avoid repeating raw items when there are many—summarize and provide counts.
  • If the user changes the phrase or seeds mid-flow, start a new run.
  • Be explicit with units (seconds, pages) and numbers.

How to fill the Action’s auth

  • In the Action editor, set Security → API Key in header with name Authorization.
  • Value: Bearer YOUR_APIFY_API_TOKEN. (Find your token in Apify Console → Integrations → API token.)

Where these instructions come from

  • GPT Actions setup & builder: OpenAI’s docs explain where to paste schema, how to set auth, and best practices for tool use.
  • Apify endpoints: Official Apify docs for starting runs, waiting for finish, and exporting dataset items.

Perfect—here’s the two-backend setup so your Custom GPT can run either Apify or your Scrapy stack, with clean switching.


1) Add a second Action: Scrapy Orchestrator API

Paste this OpenAPI schema as a new Action in the GPT Builder (Configure → Actions → Add Action).

openapi: 3.1.0
info:
  title: Scrapy Orchestrator
  version: "1.0.0"
  description: Start a Scrapy crawl on your infra, poll status, and fetch results (JSON or CSV).
servers:
  - url: https://scrapy.example.com/api
components:
  securitySchemes:
    Bearer:
      type: apiKey
      in: header
      name: Authorization
      description: Use "Bearer {YOUR_TOKEN}"
  schemas:
    StartJobInput:
      type: object
      required: [spider, seeds, phrase]
      properties:
        spider: { type: string, example: beliefs }
        phrase: { type: string, example: fear of the lord }
        near: { type: array, items: { type: string }, example: ["fear of god","reverent awe"] }
        seeds:
          type: object
          properties:
            domains: { type: array, items: { type: string }, example: ["examplechurch.org","anotherchurch.com"] }
            urls:    { type: array, items: { type: string }, example: ["https://example.org/about/beliefs"] }
        opts:
          type: object
          properties:
            depth: { type: integer, default: 2 }
            delay_ms: { type: integer, default: 400 }
            concurrency: { type: integer, default: 8 }
    JobStatus:
      type: object
      properties:
        id: { type: string }
        status: { type: string, description: QUEUED|RUNNING|SUCCEEDED|FAILED|ABORTED|TIMED_OUT }
        startedAt: { type: string, nullable: true }
        finishedAt: { type: string, nullable: true }
        items: { type: integer, description: number of result rows }
        resultsId: { type: string, description: dataset/result identifier }
security:
  - Bearer: []
paths:
  /jobs:
    post:
      operationId: startJob
      summary: Start a Scrapy job
      security: [{ Bearer: [] }]
      requestBody:
        required: true
        content:
          application/json:
            schema: { $ref: "#/components/schemas/StartJobInput" }
      responses:
        "201":
          description: Job created
          content:
            application/json:
              schema: { $ref: "#/components/schemas/JobStatus" }

  /jobs/{id}:
    get:
      operationId: getJob
      summary: Get job status
      security: [{ Bearer: [] }]
      parameters:
        - in: path
          name: id
          required: true
          schema: { type: string }
      responses:
        "200":
          description: Status
          content:
            application/json:
              schema: { $ref: "#/components/schemas/JobStatus" }

  /jobs/{id}/results:
    get:
      operationId: getResults
      summary: Fetch results
      description: Returns results as JSON (default) or CSV text.
      security: [{ Bearer: [] }]
      parameters:
        - in: path
          name: id
          required: true
          schema: { type: string }
        - in: query
          name: format
          required: false
          schema: { type: string, enum: [json, csv], default: json }
      responses:
        "200":
          description: Items
          content:
            application/json:
              schema:
                type: array
                items: { type: object, additionalProperties: true }
            text/csv:
              schema: { type: string }

Minimal server (if you need a stub)

If you don’t already have an API, a tiny FastAPI service can wrap your Scrapy runner:

# app.py (FastAPI stub)
from fastapi import FastAPI, Header, HTTPException
from pydantic import BaseModel
import uuid, time

app = FastAPI()
DB = {}  # in-memory demo

class StartJobInput(BaseModel):
    spider: str
    phrase: str
    near: list[str] = []
    seeds: dict
    opts: dict | None = None

@app.post("/api/jobs")
def start_job(inp: StartJobInput, Authorization: str = Header(None)):
    if not Authorization or not Authorization.startswith("Bearer "):
        raise HTTPException(401, "Unauthorized")
    jid = str(uuid.uuid4())
    DB[jid] = {"id": jid, "status": "RUNNING", "items": 0, "resultsId": jid}
    # TODO: trigger your Scrapy command asynchronously (e.g., subprocess)
    # e.g., subprocess.Popen([...]) and later update DB[jid]["status"] = "SUCCEEDED"
    return DB[jid]

@app.get("/api/jobs/{jid}")
def get_job(jid: str, Authorization: str = Header(None)):
    if jid not in DB: raise HTTPException(404, "Not found")
    return DB[jid]

@app.get("/api/jobs/{jid}/results")
def get_results(jid: str, format: str = "json", Authorization: str = Header(None)):
    if jid not in DB: raise HTTPException(404, "Not found")
    # TODO: return real results; this is a stub
    data = [{"host":"example.org","url":"https://example.org/x","EI":3}]
    return data if format=="json" else "host,url,EI\nexample.org,https://example.org/x,3\n"

(Deploy behind HTTPS; put a real token; wire Scrapy output to a file/DB and stream it back here.)


2) Update your GPT “Instructions” to choose Apify or Scrapy

Paste this into the same Custom GPT’s Instructions (append to what you already added):

Backend selection logic

  • If the user wants fastest setup, cloud runs, or provides an Apify actor ID → use Apify (startRun / getRun / getDatasetItems).
  • If the user explicitly says use our server/on-prem, or mentions Scrapy → use Scrapy Orchestrator (startJob / getJob / getResults).
  • If they don’t care, prefer Apify for speed; offer Scrapy as an alternative for data custody/cost control.

Flow (both backends)

  1. Start: call startRun (Apify) or startJob (Scrapy) with:
    • seeds (URLs/domains), phrase, optional near-variants, crawl constraints (depth, same-domain).
  2. Wait: poll getRun(waitForFinish=120000) repeatedly or getJob until terminal status.
  3. Fetch: if success, call getDatasetItems or getResults (prefer JSON unless CSV requested).
  4. Report: compute and present: top hosts by EI, top pages, average EI. Offer CSV export.

Input shaping

  • If user gives domains, prepend https:// to make URLs.
  • If phrase is missing, ask once for it.
  • Default depth = 2, same-domain only.

Error handling

  • Terminal non-success → show status + suggest lower depth/smaller batch.
  • Empty dataset → suggest adding near-variants or expanding seeds.

3) Example payloads

Apify → startRun

{
  "actorId": "your-user~phrase-emphasis-crawler",
  "input": {
    "startUrls": [{"url":"https://examplechurch.org"}, {"url":"https://anotherchurch.com"}],
    "sameDomainOnly": true,
    "maxDepth": 2
  }
}

Scrapy → startJob

{
  "spider": "beliefs",
  "phrase": "fear of the lord",
  "near": ["fear of god","reverent awe"],
  "seeds": { "domains": ["examplechurch.org","anotherchurch.com"] },
  "opts": { "depth": 2, "delay_ms": 400, "concurrency": 8 }
}

That’s it. Your Custom GPT now has both levers:

  • Apify for instant cloud crawls,
  • Scrapy for your own infrastructure—same UX in chat.

Alright — here’s the step-by-step “file build sheet” you can post so anyone can recreate all the files without needing to download a ZIP.

This assumes you’re making a folder for the Scrapy Orchestrator and optionally saving the Postman collection separately.


📂 Folder structure

integration_bundle/
├── Phrase_Emphasis_Apify_Scrapy.postman_collection.json
└── scrapy-orchestrator/
    ├── app.py
    ├── requirements.txt
    ├── Dockerfile
    └── README.md

1) Create the integration folder

Open a terminal or file explorer and create:

mkdir integration_bundle
cd integration_bundle

2) Create the Postman collection file

File path:

integration_bundle/Phrase_Emphasis_Apify_Scrapy.postman_collection.json

How to paste:

  1. Open your code editor (VS Code, Sublime, Notepad++).
  2. Create a new file, paste in the full JSON from the conversation message that started with:
{
  "info": {
    "name": "Phrase Emphasis – Apify + Scrapy Orchestrator",

…and ends with:

  ]
}
  1. Save the file exactly as Phrase_Emphasis_Apify_Scrapy.postman_collection.json.

3) Create the scrapy-orchestrator folder

mkdir scrapy-orchestrator
cd scrapy-orchestrator

4) Create app.py

File path:

integration_bundle/scrapy-orchestrator/app.py

How to paste:

  1. Create a new file named app.py.
  2. Paste the full FastAPI app code from the conversation that begins with:
# FastAPI Scrapy Orchestrator (stub)
# - Starts a "job" (placeholder)...

…and ends with the last return rows line. 3. Save the file.


5) Create requirements.txt

File path:

integration_bundle/scrapy-orchestrator/requirements.txt

Paste this exactly:

fastapi==0.111.0
uvicorn[standard]==0.30.0
pydantic==2.8.2

Save the file.


6) Create Dockerfile

File path:

integration_bundle/scrapy-orchestrator/Dockerfile

Paste this exactly:

# Dockerfile for Scrapy Orchestrator (FastAPI)
FROM python:3.11-slim

ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app.py .

EXPOSE 8000
CMD ["uvicorn","app:app","--host","0.0.0.0","--port","8000"]

Save the file.


7) Create README.md

File path:

integration_bundle/scrapy-orchestrator/README.md

Paste the README text from the conversation that starts with:

# Scrapy Orchestrator (FastAPI stub)

This container exposes...

…and ends with the instructions about wiring to a real Scrapy project.


8) Verify file structure

You should have:

integration_bundle/
├── Phrase_Emphasis_Apify_Scrapy.postman_collection.json
└── scrapy-orchestrator/
    ├── app.py
    ├── requirements.txt
    ├── Dockerfile
    └── README.md

9) Next steps (running)

To run the Orchestrator locally with Docker:

cd integration_bundle/scrapy-orchestrator
docker build -t scrapy-orchestrator .
docker run -p 8000:8000 scrapy-orchestrator

To test with curl:

TOKEN=devtoken
BASE=http://localhost:8000/api
curl -s -X POST "$BASE/jobs" \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -d '{"spider":"beliefs","phrase":"fear of the lord","near":["fear of god","reverent awe"],"seeds":{"domains":["examplechurch.org","anotherchurch.com"]},"opts":{"depth":2,"delay_ms":400,"concurrency":8}}'

What’s inside:

  • Phrase_Emphasis_Apify_Scrapy.postman_collection.json — Postman requests for Apify and your Scrapy Orchestrator (start, poll, fetch in JSON/CSV).
  • scrapy-orchestrator/ (dockerized FastAPI stub):
    • Dockerfile
    • requirements.txt
    • app.py (API: /api/jobs, /api/jobs/{id}, /api/jobs/{id}/results)
    • README.md with run/test instructions

Quick start: Orchestrator (Docker)

cd scrapy-orchestrator
docker build -t scrapy-orchestrator .
docker run -p 8000:8000 scrapy-orchestrator

Test with curl:

TOKEN=devtoken
BASE=http://localhost:8000/api

# 1) Start job
curl -s -X POST "$BASE/jobs" \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -d '{
    "spider": "beliefs",
    "phrase": "fear of the lord",
    "near": ["fear of god","reverent awe"],
    "seeds": {"domains": ["examplechurch.org","anotherchurch.com"]},
    "opts": {"depth": 2, "delay_ms": 400, "concurrency": 8}
  }'

# 2) Poll status (replace JOB_ID)
curl -s -H "Authorization: Bearer $TOKEN" "$BASE/jobs/JOB_ID"

# 3) Fetch results
curl -s -H "Authorization: Bearer $TOKEN" "$BASE/jobs/JOB_ID/results?format=json"
curl -s -H "Authorization: Bearer $TOKEN" "$BASE/jobs/JOB_ID/results?format=csv"

The stub returns demo rows now; to wire in your real Scrapy project, open app.py and replace fake_run() with a subprocess/queue call to your spider and load the real output in get_results.


Quick start: Apify (curl)

# 1) Start run
curl -s -X POST "https://api.apify.com/v2/acts/your-user~phrase-emphasis-crawler/runs" \
  -H "Authorization: Bearer $APIFY_TOKEN" -H "Content-Type: application/json" \
  -d '{
    "actorId": "your-user~phrase-emphasis-crawler",
    "input": {
      "startUrls": [
        {"url":"https://examplechurch.org"},
        {"url":"https://anotherchurch.com"}
      ],
      "sameDomainOnly": true,
      "maxDepth": 2
    },
    "options": {"memory": 2048, "timeoutSecs": 1800}
  }'

# 2) Wait for finish (120s window)
curl -s "https://api.apify.com/v2/actor-runs/RUN_ID?waitForFinish=120000" \
  -H "Authorization: Bearer $APIFY_TOKEN"

# 3) Get dataset items
curl -s "https://api.apify.com/v2/datasets/DATASET_ID/items?format=json&clean=true" \
  -H "Authorization: Bearer $APIFY_TOKEN"

# or CSV
curl -s "https://api/apify.com/v2/datasets/DATASET_ID/items?format=csv&clean=true" \
  -H "Authorization: Bearer $APIFY_TOKEN"

Postman collection

Import Phrase_Emphasis_Apify_Scrapy.postman_collection.json, set these variables:

  • APIFY_TOKEN, APIFY_ACTOR_ID, RUN_ID, DATASET_ID
  • SCRAPY_BASE (e.g., http://localhost:8000/api), SCRAPY_TOKEN, JOB_ID

Sequence is prebuilt:

  1. Start → 2) Get status → 3) Get results (JSON or CSV).