Start your crawl, wait for it to finish, fetch results, and summarize—end to end.

Below you’ll get two drop-ins:

OpenAPI schema (paste into the GPT Builder → Configure → Actions → Add Action).
Instruction block (paste into the GPT’s Instructions so it knows when/how to call each action).

I’m using Apify’s REST API for: start run → check status (optionally wait) → export dataset. (Endpoints documented by Apify; GPT Actions setup documented by OpenAI.)

1) OpenAPI schema (ready to paste)

In GPT Builder: Configure → Actions → Add Action → Import from URL / Paste schema.

openapi: 3.1.0
info:
  title: Phrase Emphasis Crawler (Apify)
  version: "1.0.0"
  description: >
    Start an Apify Actor run with a list of start URLs and a phrase, poll for completion,
    and fetch the resulting dataset (CSV or JSON).
servers:
  - url: https://api.apify.com/v2
components:
  securitySchemes:
    ApifyToken:
      type: apiKey
      in: header
      name: Authorization
      description: Use format "Bearer {APIFY_API_TOKEN}"
  schemas:
    StartInput:
      type: object
      required: [actorId, input]
      properties:
        actorId:
          type: string
          description: Actor ID or "username~actor-name"
          example: your-user~phrase-emphasis-crawler
        input:
          type: object
          description: JSON payload passed as INPUT to the Actor (e.g., startUrls, sameDomainOnly, maxDepth)
          example:
            startUrls: [{ url: "https://examplechurch.org" }]
            sameDomainOnly: true
            maxDepth: 2
        options:
          type: object
          description: Optional run options (memory, timeout, build, etc.)
          example: { memory: 2048, timeoutSecs: 1800 }
    RunStatus:
      type: object
      properties:
        id: { type: string }
        status: { type: string, description: CREATED|RUNNING|SUCCEEDED|FAILED|ABORTED|TIMED-OUT }
        defaultDatasetId: { type: string }
        defaultKeyValueStoreId: { type: string }
        finishedAt: { type: string, nullable: true }
    DatasetItems:
      type: array
      items:
        type: object
        additionalProperties: true
security:
  - ApifyToken: []
paths:
  /acts/{actorId}/runs:
    post:
      operationId: startRun
      summary: Start an Actor run (returns immediately)
      description: |
        Runs an Actor and returns the new run object. The POST body is passed
        to the Actor as INPUT. Use `getRun` (optionally with waitForFinish) to await completion.
      security: [{ ApifyToken: [] }]
      parameters:
        - in: path
          name: actorId
          required: true
          schema: { type: string }
      requestBody:
        required: true
        content:
          application/json:
            schema: { $ref: "#/components/schemas/StartInput" }
      responses:
        "201":
          description: Run created
          content:
            application/json:
              schema:
                type: object
                properties:
                  data: { $ref: "#/components/schemas/RunStatus" }
        "4XX":
          description: Client error
        "5XX":
          description: Server error

  /actor-runs/{runId}:
    get:
      operationId: getRun
      summary: Get run status (optionally wait for finish)
      description: |
        Returns run details. Use query param `waitForFinish` (0..120000) to block
        until the run finishes (ms). If omitted, returns current status immediately.
      security: [{ ApifyToken: [] }]
      parameters:
        - in: path
          name: runId
          required: true
          schema: { type: string }
        - in: query
          name: waitForFinish
          required: false
          schema: { type: integer, minimum: 0, maximum: 120000 }
      responses:
        "200":
          description: Run details
          content:
            application/json:
              schema:
                type: object
                properties:
                  data: { $ref: "#/components/schemas/RunStatus" }

  /datasets/{datasetId}/items:
    get:
      operationId: getDatasetItems
      summary: Fetch dataset items
      description: |
        Returns dataset items. Use `format=json` for JSON (default) or `format=csv` to get CSV text.
      security: [{ ApifyToken: [] }]
      parameters:
        - in: path
          name: datasetId
          required: true
          schema: { type: string }
        - in: query
          name: format
          required: false
          schema: { type: string, enum: [json, csv], default: json }
        - in: query
          name: clean
          required: false
          schema: { type: boolean, default: true }
      responses:
        "200":
          description: Items
          content:
            application/json:
              schema: { $ref: "#/components/schemas/DatasetItems" }
            text/csv:
              schema: { type: string }

Why these endpoints?

Start run: POST /v2/acts/{actorId}/runs starts an Actor.
Check run: GET /v2/actor-runs/{runId} supports waitForFinish so you can block instead of polling.
Get results: GET /v2/datasets/{datasetId}/items?format=csv|json exports your run’s dataset.

Auth in GPT Builder: set Authorization = Bearer ${APIFY_API_TOKEN} (store the token as a secret in the Action form).

2) Custom GPT “Instructions” (ready to paste)

In GPT Builder: Configure → Instructions. Replace or append the block below.

Role & Goal
You are an operations assistant that runs a web crawl via Apify, waits for it to finish, then returns a concise report with an Emphasis Index summary for the chosen phrase.

When to use Actions

When the user says “scan”, “crawl”, “run”, or provides URLs/phrase → call startRun.
Immediately after startRun, call getRun with waitForFinish=120000 (120s).
- If status is not a terminal state, keep calling getRun with waitForFinish=120000 until status ∈ {SUCCEEDED, FAILED, ABORTED, TIMED-OUT}.
If SUCCEEDED and defaultDatasetId is present → call getDatasetItems (prefer format=json unless the user asks for CSV).
Then summarize: top hosts by EI, pages with highest EI, average EI, and any notable evidence fields if present.

Required inputs for startRun

actorId: the Apify Actor ID (e.g., username~phrase-emphasis-crawler).
input: at minimum startUrls (array of {url}), plus any flags (e.g., sameDomainOnly, maxDepth). If user only supplies a list of domains, convert them to URLs with https:// prefix.

Error handling

If startRun or getRun fails → report the error message and suggest reducing depth or seed size.
If final status is not SUCCEEDED → show the final status and logs link if available (from run object), and stop.
If dataset is empty → say so plainly and suggest adding NEAR variants or different LIKELY paths.

Output format to the user

A short status line (run ID, status, elapsed time).
A compact table (host, pages scanned, top 5 URLs by EI).
A one-paragraph interpretation (what the EI distribution means).
Offer to export CSV (call getDatasetItems?format=csv) if requested.

Behavioral rules

Never invent data. Only claim what the dataset contains.
Keep the report lean: avoid repeating raw items when there are many—summarize and provide counts.
If the user changes the phrase or seeds mid-flow, start a new run.
Be explicit with units (seconds, pages) and numbers.

How to fill the Action’s auth

In the Action editor, set Security → API Key in header with name Authorization.
Value: Bearer YOUR_APIFY_API_TOKEN. (Find your token in Apify Console → Integrations → API token.)

Where these instructions come from

GPT Actions setup & builder: OpenAI’s docs explain where to paste schema, how to set auth, and best practices for tool use.
Apify endpoints: Official Apify docs for starting runs, waiting for finish, and exporting dataset items.

Perfect—here’s the two-backend setup so your Custom GPT can run either Apify or your Scrapy stack, with clean switching.

1) Add a second Action: Scrapy Orchestrator API

Paste this OpenAPI schema as a new Action in the GPT Builder (Configure → Actions → Add Action).

openapi: 3.1.0
info:
  title: Scrapy Orchestrator
  version: "1.0.0"
  description: Start a Scrapy crawl on your infra, poll status, and fetch results (JSON or CSV).
servers:
  - url: https://scrapy.example.com/api
components:
  securitySchemes:
    Bearer:
      type: apiKey
      in: header
      name: Authorization
      description: Use "Bearer {YOUR_TOKEN}"
  schemas:
    StartJobInput:
      type: object
      required: [spider, seeds, phrase]
      properties:
        spider: { type: string, example: beliefs }
        phrase: { type: string, example: fear of the lord }
        near: { type: array, items: { type: string }, example: ["fear of god","reverent awe"] }
        seeds:
          type: object
          properties:
            domains: { type: array, items: { type: string }, example: ["examplechurch.org","anotherchurch.com"] }
            urls:    { type: array, items: { type: string }, example: ["https://example.org/about/beliefs"] }
        opts:
          type: object
          properties:
            depth: { type: integer, default: 2 }
            delay_ms: { type: integer, default: 400 }
            concurrency: { type: integer, default: 8 }
    JobStatus:
      type: object
      properties:
        id: { type: string }
        status: { type: string, description: QUEUED|RUNNING|SUCCEEDED|FAILED|ABORTED|TIMED_OUT }
        startedAt: { type: string, nullable: true }
        finishedAt: { type: string, nullable: true }
        items: { type: integer, description: number of result rows }
        resultsId: { type: string, description: dataset/result identifier }
security:
  - Bearer: []
paths:
  /jobs:
    post:
      operationId: startJob
      summary: Start a Scrapy job
      security: [{ Bearer: [] }]
      requestBody:
        required: true
        content:
          application/json:
            schema: { $ref: "#/components/schemas/StartJobInput" }
      responses:
        "201":
          description: Job created
          content:
            application/json:
              schema: { $ref: "#/components/schemas/JobStatus" }

  /jobs/{id}:
    get:
      operationId: getJob
      summary: Get job status
      security: [{ Bearer: [] }]
      parameters:
        - in: path
          name: id
          required: true
          schema: { type: string }
      responses:
        "200":
          description: Status
          content:
            application/json:
              schema: { $ref: "#/components/schemas/JobStatus" }

  /jobs/{id}/results:
    get:
      operationId: getResults
      summary: Fetch results
      description: Returns results as JSON (default) or CSV text.
      security: [{ Bearer: [] }]
      parameters:
        - in: path
          name: id
          required: true
          schema: { type: string }
        - in: query
          name: format
          required: false
          schema: { type: string, enum: [json, csv], default: json }
      responses:
        "200":
          description: Items
          content:
            application/json:
              schema:
                type: array
                items: { type: object, additionalProperties: true }
            text/csv:
              schema: { type: string }

Minimal server (if you need a stub)

If you don’t already have an API, a tiny FastAPI service can wrap your Scrapy runner:

# app.py (FastAPI stub)
from fastapi import FastAPI, Header, HTTPException
from pydantic import BaseModel
import uuid, time

app = FastAPI()
DB = {}  # in-memory demo

class StartJobInput(BaseModel):
    spider: str
    phrase: str
    near: list[str] = []
    seeds: dict
    opts: dict | None = None

@app.post("/api/jobs")
def start_job(inp: StartJobInput, Authorization: str = Header(None)):
    if not Authorization or not Authorization.startswith("Bearer "):
        raise HTTPException(401, "Unauthorized")
    jid = str(uuid.uuid4())
    DB[jid] = {"id": jid, "status": "RUNNING", "items": 0, "resultsId": jid}
    # TODO: trigger your Scrapy command asynchronously (e.g., subprocess)
    # e.g., subprocess.Popen([...]) and later update DB[jid]["status"] = "SUCCEEDED"
    return DB[jid]

@app.get("/api/jobs/{jid}")
def get_job(jid: str, Authorization: str = Header(None)):
    if jid not in DB: raise HTTPException(404, "Not found")
    return DB[jid]

@app.get("/api/jobs/{jid}/results")
def get_results(jid: str, format: str = "json", Authorization: str = Header(None)):
    if jid not in DB: raise HTTPException(404, "Not found")
    # TODO: return real results; this is a stub
    data = [{"host":"example.org","url":"https://example.org/x","EI":3}]
    return data if format=="json" else "host,url,EI\nexample.org,https://example.org/x,3\n"

(Deploy behind HTTPS; put a real token; wire Scrapy output to a file/DB and stream it back here.)

2) Update your GPT “Instructions” to choose Apify or Scrapy

Paste this into the same Custom GPT’s Instructions (append to what you already added):

Backend selection logic

If the user wants fastest setup, cloud runs, or provides an Apify actor ID → use Apify (startRun / getRun / getDatasetItems).
If the user explicitly says use our server/on-prem, or mentions Scrapy → use Scrapy Orchestrator (startJob / getJob / getResults).
If they don’t care, prefer Apify for speed; offer Scrapy as an alternative for data custody/cost control.

Flow (both backends)

Start: call startRun (Apify) or startJob (Scrapy) with:
- seeds (URLs/domains), phrase, optional near-variants, crawl constraints (depth, same-domain).
Wait: poll getRun(waitForFinish=120000) repeatedly or getJob until terminal status.
Fetch: if success, call getDatasetItems or getResults (prefer JSON unless CSV requested).
Report: compute and present: top hosts by EI, top pages, average EI. Offer CSV export.

Input shaping

If user gives domains, prepend https:// to make URLs.
If phrase is missing, ask once for it.
Default depth = 2, same-domain only.

Error handling

Terminal non-success → show status + suggest lower depth/smaller batch.
Empty dataset → suggest adding near-variants or expanding seeds.

3) Example payloads

Apify → startRun

{
  "actorId": "your-user~phrase-emphasis-crawler",
  "input": {
    "startUrls": [{"url":"https://examplechurch.org"}, {"url":"https://anotherchurch.com"}],
    "sameDomainOnly": true,
    "maxDepth": 2
  }
}

Scrapy → startJob

{
  "spider": "beliefs",
  "phrase": "fear of the lord",
  "near": ["fear of god","reverent awe"],
  "seeds": { "domains": ["examplechurch.org","anotherchurch.com"] },
  "opts": { "depth": 2, "delay_ms": 400, "concurrency": 8 }
}

That’s it. Your Custom GPT now has both levers:

Apify for instant cloud crawls,
Scrapy for your own infrastructure—same UX in chat.

Alright — here’s the step-by-step “file build sheet” you can post so anyone can recreate all the files without needing to download a ZIP.

This assumes you’re making a folder for the Scrapy Orchestrator and optionally saving the Postman collection separately.

📂 Folder structure

integration_bundle/
├── Phrase_Emphasis_Apify_Scrapy.postman_collection.json
└── scrapy-orchestrator/
    ├── app.py
    ├── requirements.txt
    ├── Dockerfile
    └── README.md

1) Create the integration folder

Open a terminal or file explorer and create:

mkdir integration_bundle
cd integration_bundle

2) Create the Postman collection file

File path:

integration_bundle/Phrase_Emphasis_Apify_Scrapy.postman_collection.json

How to paste:

Open your code editor (VS Code, Sublime, Notepad++).
Create a new file, paste in the full JSON from the conversation message that started with:

{
  "info": {
    "name": "Phrase Emphasis – Apify + Scrapy Orchestrator",

…and ends with:

]
}

Save the file exactly as Phrase_Emphasis_Apify_Scrapy.postman_collection.json.

3) Create the `scrapy-orchestrator` folder

mkdir scrapy-orchestrator
cd scrapy-orchestrator

4) Create `app.py`

File path:

integration_bundle/scrapy-orchestrator/app.py

How to paste:

Create a new file named app.py.
Paste the full FastAPI app code from the conversation that begins with:

# FastAPI Scrapy Orchestrator (stub)
# - Starts a "job" (placeholder)...

…and ends with the last return rows line. 3. Save the file.

5) Create `requirements.txt`

File path:

integration_bundle/scrapy-orchestrator/requirements.txt

Paste this exactly:

fastapi==0.111.0
uvicorn[standard]==0.30.0
pydantic==2.8.2

Save the file.

6) Create `Dockerfile`

File path:

integration_bundle/scrapy-orchestrator/Dockerfile

Paste this exactly:

# Dockerfile for Scrapy Orchestrator (FastAPI)
FROM python:3.11-slim

ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app.py .

EXPOSE 8000
CMD ["uvicorn","app:app","--host","0.0.0.0","--port","8000"]

Save the file.

7) Create `README.md`

File path:

integration_bundle/scrapy-orchestrator/README.md

Paste the README text from the conversation that starts with:

# Scrapy Orchestrator (FastAPI stub)

This container exposes...

…and ends with the instructions about wiring to a real Scrapy project.

8) Verify file structure

You should have:

integration_bundle/
├── Phrase_Emphasis_Apify_Scrapy.postman_collection.json
└── scrapy-orchestrator/
    ├── app.py
    ├── requirements.txt
    ├── Dockerfile
    └── README.md

9) Next steps (running)

To run the Orchestrator locally with Docker:

cd integration_bundle/scrapy-orchestrator
docker build -t scrapy-orchestrator .
docker run -p 8000:8000 scrapy-orchestrator

To test with curl:

TOKEN=devtoken
BASE=http://localhost:8000/api
curl -s -X POST "$BASE/jobs" \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -d '{"spider":"beliefs","phrase":"fear of the lord","near":["fear of god","reverent awe"],"seeds":{"domains":["examplechurch.org","anotherchurch.com"]},"opts":{"depth":2,"delay_ms":400,"concurrency":8}}'

What’s inside:

Phrase_Emphasis_Apify_Scrapy.postman_collection.json — Postman requests for Apify and your Scrapy Orchestrator (start, poll, fetch in JSON/CSV).
scrapy-orchestrator/ (dockerized FastAPI stub):
- Dockerfile
- requirements.txt
- app.py (API: /api/jobs, /api/jobs/{id}, /api/jobs/{id}/results)
- README.md with run/test instructions

Quick start: Orchestrator (Docker)

cd scrapy-orchestrator
docker build -t scrapy-orchestrator .
docker run -p 8000:8000 scrapy-orchestrator

Test with curl:

TOKEN=devtoken
BASE=http://localhost:8000/api

# 1) Start job
curl -s -X POST "$BASE/jobs" \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -d '{
    "spider": "beliefs",
    "phrase": "fear of the lord",
    "near": ["fear of god","reverent awe"],
    "seeds": {"domains": ["examplechurch.org","anotherchurch.com"]},
    "opts": {"depth": 2, "delay_ms": 400, "concurrency": 8}
  }'

# 2) Poll status (replace JOB_ID)
curl -s -H "Authorization: Bearer $TOKEN" "$BASE/jobs/JOB_ID"

# 3) Fetch results
curl -s -H "Authorization: Bearer $TOKEN" "$BASE/jobs/JOB_ID/results?format=json"
curl -s -H "Authorization: Bearer $TOKEN" "$BASE/jobs/JOB_ID/results?format=csv"

The stub returns demo rows now; to wire in your real Scrapy project, open app.py and replace fake_run() with a subprocess/queue call to your spider and load the real output in get_results.

Quick start: Apify (curl)

# 1) Start run
curl -s -X POST "https://api.apify.com/v2/acts/your-user~phrase-emphasis-crawler/runs" \
  -H "Authorization: Bearer $APIFY_TOKEN" -H "Content-Type: application/json" \
  -d '{
    "actorId": "your-user~phrase-emphasis-crawler",
    "input": {
      "startUrls": [
        {"url":"https://examplechurch.org"},
        {"url":"https://anotherchurch.com"}
      ],
      "sameDomainOnly": true,
      "maxDepth": 2
    },
    "options": {"memory": 2048, "timeoutSecs": 1800}
  }'

# 2) Wait for finish (120s window)
curl -s "https://api.apify.com/v2/actor-runs/RUN_ID?waitForFinish=120000" \
  -H "Authorization: Bearer $APIFY_TOKEN"

# 3) Get dataset items
curl -s "https://api.apify.com/v2/datasets/DATASET_ID/items?format=json&clean=true" \
  -H "Authorization: Bearer $APIFY_TOKEN"

# or CSV
curl -s "https://api/apify.com/v2/datasets/DATASET_ID/items?format=csv&clean=true" \
  -H "Authorization: Bearer $APIFY_TOKEN"

Postman collection

Import Phrase_Emphasis_Apify_Scrapy.postman_collection.json, set these variables:

APIFY_TOKEN, APIFY_ACTOR_ID, RUN_ID, DATASET_ID
SCRAPY_BASE (e.g., http://localhost:8000/api), SCRAPY_TOKEN, JOB_ID

Sequence is prebuilt:

Start → 2) Get status → 3) Get results (JSON or CSV).

Start your crawl, wait for it to finish, fetch results, and summarize—end to end.

1) OpenAPI schema (ready to paste)

2) Custom GPT “Instructions” (ready to paste)

How to fill the Action’s auth

Where these instructions come from

1) Add a second Action: Scrapy Orchestrator API

Minimal server (if you need a stub)

2) Update your GPT “Instructions” to choose Apify or Scrapy

3) Example payloads

📂 Folder structure

1) Create the integration folder

2) Create the Postman collection file

3) Create the scrapy-orchestrator folder

4) Create app.py

5) Create requirements.txt

6) Create Dockerfile

7) Create README.md

8) Verify file structure

9) Next steps (running)

Quick start: Orchestrator (Docker)

Quick start: Apify (curl)

Postman collection

3) Create the `scrapy-orchestrator` folder

4) Create `app.py`

5) Create `requirements.txt`

6) Create `Dockerfile`

7) Create `README.md`