Start your crawl, wait for it to finish, fetch results, and summarize—end to end.
Below you’ll get two drop-ins:
- OpenAPI schema (paste into the GPT Builder → Configure → Actions → Add Action).
- Instruction block (paste into the GPT’s Instructions so it knows when/how to call each action).
I’m using Apify’s REST API for: start run → check status (optionally wait) → export dataset. (Endpoints documented by Apify; GPT Actions setup documented by OpenAI.)
1) OpenAPI schema (ready to paste)
In GPT Builder: Configure → Actions → Add Action → Import from URL / Paste schema.
openapi: 3.1.0
info:
title: Phrase Emphasis Crawler (Apify)
version: "1.0.0"
description: >
Start an Apify Actor run with a list of start URLs and a phrase, poll for completion,
and fetch the resulting dataset (CSV or JSON).
servers:
- url: https://api.apify.com/v2
components:
securitySchemes:
ApifyToken:
type: apiKey
in: header
name: Authorization
description: Use format "Bearer {APIFY_API_TOKEN}"
schemas:
StartInput:
type: object
required: [actorId, input]
properties:
actorId:
type: string
description: Actor ID or "username~actor-name"
example: your-user~phrase-emphasis-crawler
input:
type: object
description: JSON payload passed as INPUT to the Actor (e.g., startUrls, sameDomainOnly, maxDepth)
example:
startUrls: [{ url: "https://examplechurch.org" }]
sameDomainOnly: true
maxDepth: 2
options:
type: object
description: Optional run options (memory, timeout, build, etc.)
example: { memory: 2048, timeoutSecs: 1800 }
RunStatus:
type: object
properties:
id: { type: string }
status: { type: string, description: CREATED|RUNNING|SUCCEEDED|FAILED|ABORTED|TIMED-OUT }
defaultDatasetId: { type: string }
defaultKeyValueStoreId: { type: string }
finishedAt: { type: string, nullable: true }
DatasetItems:
type: array
items:
type: object
additionalProperties: true
security:
- ApifyToken: []
paths:
/acts/{actorId}/runs:
post:
operationId: startRun
summary: Start an Actor run (returns immediately)
description: |
Runs an Actor and returns the new run object. The POST body is passed
to the Actor as INPUT. Use `getRun` (optionally with waitForFinish) to await completion.
security: [{ ApifyToken: [] }]
parameters:
- in: path
name: actorId
required: true
schema: { type: string }
requestBody:
required: true
content:
application/json:
schema: { $ref: "#/components/schemas/StartInput" }
responses:
"201":
description: Run created
content:
application/json:
schema:
type: object
properties:
data: { $ref: "#/components/schemas/RunStatus" }
"4XX":
description: Client error
"5XX":
description: Server error
/actor-runs/{runId}:
get:
operationId: getRun
summary: Get run status (optionally wait for finish)
description: |
Returns run details. Use query param `waitForFinish` (0..120000) to block
until the run finishes (ms). If omitted, returns current status immediately.
security: [{ ApifyToken: [] }]
parameters:
- in: path
name: runId
required: true
schema: { type: string }
- in: query
name: waitForFinish
required: false
schema: { type: integer, minimum: 0, maximum: 120000 }
responses:
"200":
description: Run details
content:
application/json:
schema:
type: object
properties:
data: { $ref: "#/components/schemas/RunStatus" }
/datasets/{datasetId}/items:
get:
operationId: getDatasetItems
summary: Fetch dataset items
description: |
Returns dataset items. Use `format=json` for JSON (default) or `format=csv` to get CSV text.
security: [{ ApifyToken: [] }]
parameters:
- in: path
name: datasetId
required: true
schema: { type: string }
- in: query
name: format
required: false
schema: { type: string, enum: [json, csv], default: json }
- in: query
name: clean
required: false
schema: { type: boolean, default: true }
responses:
"200":
description: Items
content:
application/json:
schema: { $ref: "#/components/schemas/DatasetItems" }
text/csv:
schema: { type: string }
Why these endpoints?
- Start run:
POST /v2/acts/{actorId}/runsstarts an Actor. - Check run:
GET /v2/actor-runs/{runId}supportswaitForFinishso you can block instead of polling. - Get results:
GET /v2/datasets/{datasetId}/items?format=csv|jsonexports your run’s dataset.
Auth in GPT Builder: set Authorization = Bearer ${APIFY_API_TOKEN} (store the token as a secret in the Action form).
2) Custom GPT “Instructions” (ready to paste)
In GPT Builder: Configure → Instructions. Replace or append the block below.
Role & Goal
You are an operations assistant that runs a web crawl via Apify, waits for it to finish, then returns a concise report with an Emphasis Index summary for the chosen phrase.
When to use Actions
- When the user says “scan”, “crawl”, “run”, or provides URLs/phrase → call
startRun. - Immediately after
startRun, callgetRunwithwaitForFinish=120000(120s).- If status is not a terminal state, keep calling
getRunwithwaitForFinish=120000until status ∈ {SUCCEEDED, FAILED, ABORTED, TIMED-OUT}.
- If status is not a terminal state, keep calling
- If SUCCEEDED and
defaultDatasetIdis present → callgetDatasetItems(preferformat=jsonunless the user asks for CSV). - Then summarize: top hosts by EI, pages with highest EI, average EI, and any notable evidence fields if present.
Required inputs for startRun
actorId: the Apify Actor ID (e.g.,username~phrase-emphasis-crawler).input: at minimumstartUrls(array of{url}), plus any flags (e.g.,sameDomainOnly,maxDepth). If user only supplies a list of domains, convert them to URLs withhttps://prefix.
Error handling
- If
startRunorgetRunfails → report the error message and suggest reducing depth or seed size. - If final status is not
SUCCEEDED→ show the final status and logs link if available (from run object), and stop. - If dataset is empty → say so plainly and suggest adding NEAR variants or different LIKELY paths.
Output format to the user
- A short status line (run ID, status, elapsed time).
- A compact table (host, pages scanned, top 5 URLs by EI).
- A one-paragraph interpretation (what the EI distribution means).
- Offer to export CSV (call
getDatasetItems?format=csv) if requested.
Behavioral rules
- Never invent data. Only claim what the dataset contains.
- Keep the report lean: avoid repeating raw items when there are many—summarize and provide counts.
- If the user changes the phrase or seeds mid-flow, start a new run.
- Be explicit with units (seconds, pages) and numbers.
How to fill the Action’s auth
- In the Action editor, set Security → API Key in header with name
Authorization. - Value:
Bearer YOUR_APIFY_API_TOKEN. (Find your token in Apify Console → Integrations → API token.)
Where these instructions come from
- GPT Actions setup & builder: OpenAI’s docs explain where to paste schema, how to set auth, and best practices for tool use.
- Apify endpoints: Official Apify docs for starting runs, waiting for finish, and exporting dataset items.
Perfect—here’s the two-backend setup so your Custom GPT can run either Apify or your Scrapy stack, with clean switching.
1) Add a second Action: Scrapy Orchestrator API
Paste this OpenAPI schema as a new Action in the GPT Builder (Configure → Actions → Add Action).
openapi: 3.1.0
info:
title: Scrapy Orchestrator
version: "1.0.0"
description: Start a Scrapy crawl on your infra, poll status, and fetch results (JSON or CSV).
servers:
- url: https://scrapy.example.com/api
components:
securitySchemes:
Bearer:
type: apiKey
in: header
name: Authorization
description: Use "Bearer {YOUR_TOKEN}"
schemas:
StartJobInput:
type: object
required: [spider, seeds, phrase]
properties:
spider: { type: string, example: beliefs }
phrase: { type: string, example: fear of the lord }
near: { type: array, items: { type: string }, example: ["fear of god","reverent awe"] }
seeds:
type: object
properties:
domains: { type: array, items: { type: string }, example: ["examplechurch.org","anotherchurch.com"] }
urls: { type: array, items: { type: string }, example: ["https://example.org/about/beliefs"] }
opts:
type: object
properties:
depth: { type: integer, default: 2 }
delay_ms: { type: integer, default: 400 }
concurrency: { type: integer, default: 8 }
JobStatus:
type: object
properties:
id: { type: string }
status: { type: string, description: QUEUED|RUNNING|SUCCEEDED|FAILED|ABORTED|TIMED_OUT }
startedAt: { type: string, nullable: true }
finishedAt: { type: string, nullable: true }
items: { type: integer, description: number of result rows }
resultsId: { type: string, description: dataset/result identifier }
security:
- Bearer: []
paths:
/jobs:
post:
operationId: startJob
summary: Start a Scrapy job
security: [{ Bearer: [] }]
requestBody:
required: true
content:
application/json:
schema: { $ref: "#/components/schemas/StartJobInput" }
responses:
"201":
description: Job created
content:
application/json:
schema: { $ref: "#/components/schemas/JobStatus" }
/jobs/{id}:
get:
operationId: getJob
summary: Get job status
security: [{ Bearer: [] }]
parameters:
- in: path
name: id
required: true
schema: { type: string }
responses:
"200":
description: Status
content:
application/json:
schema: { $ref: "#/components/schemas/JobStatus" }
/jobs/{id}/results:
get:
operationId: getResults
summary: Fetch results
description: Returns results as JSON (default) or CSV text.
security: [{ Bearer: [] }]
parameters:
- in: path
name: id
required: true
schema: { type: string }
- in: query
name: format
required: false
schema: { type: string, enum: [json, csv], default: json }
responses:
"200":
description: Items
content:
application/json:
schema:
type: array
items: { type: object, additionalProperties: true }
text/csv:
schema: { type: string }
Minimal server (if you need a stub)
If you don’t already have an API, a tiny FastAPI service can wrap your Scrapy runner:
# app.py (FastAPI stub)
from fastapi import FastAPI, Header, HTTPException
from pydantic import BaseModel
import uuid, time
app = FastAPI()
DB = {} # in-memory demo
class StartJobInput(BaseModel):
spider: str
phrase: str
near: list[str] = []
seeds: dict
opts: dict | None = None
@app.post("/api/jobs")
def start_job(inp: StartJobInput, Authorization: str = Header(None)):
if not Authorization or not Authorization.startswith("Bearer "):
raise HTTPException(401, "Unauthorized")
jid = str(uuid.uuid4())
DB[jid] = {"id": jid, "status": "RUNNING", "items": 0, "resultsId": jid}
# TODO: trigger your Scrapy command asynchronously (e.g., subprocess)
# e.g., subprocess.Popen([...]) and later update DB[jid]["status"] = "SUCCEEDED"
return DB[jid]
@app.get("/api/jobs/{jid}")
def get_job(jid: str, Authorization: str = Header(None)):
if jid not in DB: raise HTTPException(404, "Not found")
return DB[jid]
@app.get("/api/jobs/{jid}/results")
def get_results(jid: str, format: str = "json", Authorization: str = Header(None)):
if jid not in DB: raise HTTPException(404, "Not found")
# TODO: return real results; this is a stub
data = [{"host":"example.org","url":"https://example.org/x","EI":3}]
return data if format=="json" else "host,url,EI\nexample.org,https://example.org/x,3\n"
(Deploy behind HTTPS; put a real token; wire Scrapy output to a file/DB and stream it back here.)
2) Update your GPT “Instructions” to choose Apify or Scrapy
Paste this into the same Custom GPT’s Instructions (append to what you already added):
Backend selection logic
- If the user wants fastest setup, cloud runs, or provides an Apify actor ID → use Apify (
startRun/getRun/getDatasetItems). - If the user explicitly says use our server/on-prem, or mentions Scrapy → use Scrapy Orchestrator (
startJob/getJob/getResults). - If they don’t care, prefer Apify for speed; offer Scrapy as an alternative for data custody/cost control.
Flow (both backends)
- Start: call
startRun(Apify) orstartJob(Scrapy) with:- seeds (URLs/domains), phrase, optional near-variants, crawl constraints (depth, same-domain).
- Wait: poll
getRun(waitForFinish=120000)repeatedly orgetJobuntil terminal status. - Fetch: if success, call
getDatasetItemsorgetResults(prefer JSON unless CSV requested). - Report: compute and present: top hosts by EI, top pages, average EI. Offer CSV export.
Input shaping
- If user gives domains, prepend
https://to make URLs. - If phrase is missing, ask once for it.
- Default depth = 2, same-domain only.
Error handling
- Terminal non-success → show status + suggest lower depth/smaller batch.
- Empty dataset → suggest adding near-variants or expanding seeds.
3) Example payloads
Apify → startRun
{
"actorId": "your-user~phrase-emphasis-crawler",
"input": {
"startUrls": [{"url":"https://examplechurch.org"}, {"url":"https://anotherchurch.com"}],
"sameDomainOnly": true,
"maxDepth": 2
}
}
Scrapy → startJob
{
"spider": "beliefs",
"phrase": "fear of the lord",
"near": ["fear of god","reverent awe"],
"seeds": { "domains": ["examplechurch.org","anotherchurch.com"] },
"opts": { "depth": 2, "delay_ms": 400, "concurrency": 8 }
}
That’s it. Your Custom GPT now has both levers:
- Apify for instant cloud crawls,
- Scrapy for your own infrastructure—same UX in chat.
Alright — here’s the step-by-step “file build sheet” you can post so anyone can recreate all the files without needing to download a ZIP.
This assumes you’re making a folder for the Scrapy Orchestrator and optionally saving the Postman collection separately.
📂 Folder structure
integration_bundle/
├── Phrase_Emphasis_Apify_Scrapy.postman_collection.json
└── scrapy-orchestrator/
├── app.py
├── requirements.txt
├── Dockerfile
└── README.md
1) Create the integration folder
Open a terminal or file explorer and create:
mkdir integration_bundle
cd integration_bundle
2) Create the Postman collection file
File path:
integration_bundle/Phrase_Emphasis_Apify_Scrapy.postman_collection.json
How to paste:
- Open your code editor (VS Code, Sublime, Notepad++).
- Create a new file, paste in the full JSON from the conversation message that started with:
{
"info": {
"name": "Phrase Emphasis – Apify + Scrapy Orchestrator",
…and ends with:
]
}
- Save the file exactly as
Phrase_Emphasis_Apify_Scrapy.postman_collection.json.
3) Create the scrapy-orchestrator folder
mkdir scrapy-orchestrator
cd scrapy-orchestrator
4) Create app.py
File path:
integration_bundle/scrapy-orchestrator/app.py
How to paste:
- Create a new file named
app.py. - Paste the full FastAPI app code from the conversation that begins with:
# FastAPI Scrapy Orchestrator (stub)
# - Starts a "job" (placeholder)...
…and ends with the last return rows line. 3. Save the file.
5) Create requirements.txt
File path:
integration_bundle/scrapy-orchestrator/requirements.txt
Paste this exactly:
fastapi==0.111.0
uvicorn[standard]==0.30.0
pydantic==2.8.2
Save the file.
6) Create Dockerfile
File path:
integration_bundle/scrapy-orchestrator/Dockerfile
Paste this exactly:
# Dockerfile for Scrapy Orchestrator (FastAPI)
FROM python:3.11-slim
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE 8000
CMD ["uvicorn","app:app","--host","0.0.0.0","--port","8000"]
Save the file.
7) Create README.md
File path:
integration_bundle/scrapy-orchestrator/README.md
Paste the README text from the conversation that starts with:
# Scrapy Orchestrator (FastAPI stub)
This container exposes...
…and ends with the instructions about wiring to a real Scrapy project.
8) Verify file structure
You should have:
integration_bundle/
├── Phrase_Emphasis_Apify_Scrapy.postman_collection.json
└── scrapy-orchestrator/
├── app.py
├── requirements.txt
├── Dockerfile
└── README.md
9) Next steps (running)
To run the Orchestrator locally with Docker:
cd integration_bundle/scrapy-orchestrator
docker build -t scrapy-orchestrator .
docker run -p 8000:8000 scrapy-orchestrator
To test with curl:
TOKEN=devtoken
BASE=http://localhost:8000/api
curl -s -X POST "$BASE/jobs" \
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-d '{"spider":"beliefs","phrase":"fear of the lord","near":["fear of god","reverent awe"],"seeds":{"domains":["examplechurch.org","anotherchurch.com"]},"opts":{"depth":2,"delay_ms":400,"concurrency":8}}'
What’s inside:
Phrase_Emphasis_Apify_Scrapy.postman_collection.json— Postman requests for Apify and your Scrapy Orchestrator (start, poll, fetch in JSON/CSV).scrapy-orchestrator/(dockerized FastAPI stub):Dockerfilerequirements.txtapp.py(API:/api/jobs,/api/jobs/{id},/api/jobs/{id}/results)README.mdwith run/test instructions
Quick start: Orchestrator (Docker)
cd scrapy-orchestrator
docker build -t scrapy-orchestrator .
docker run -p 8000:8000 scrapy-orchestrator
Test with curl:
TOKEN=devtoken
BASE=http://localhost:8000/api
# 1) Start job
curl -s -X POST "$BASE/jobs" \
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-d '{
"spider": "beliefs",
"phrase": "fear of the lord",
"near": ["fear of god","reverent awe"],
"seeds": {"domains": ["examplechurch.org","anotherchurch.com"]},
"opts": {"depth": 2, "delay_ms": 400, "concurrency": 8}
}'
# 2) Poll status (replace JOB_ID)
curl -s -H "Authorization: Bearer $TOKEN" "$BASE/jobs/JOB_ID"
# 3) Fetch results
curl -s -H "Authorization: Bearer $TOKEN" "$BASE/jobs/JOB_ID/results?format=json"
curl -s -H "Authorization: Bearer $TOKEN" "$BASE/jobs/JOB_ID/results?format=csv"
The stub returns demo rows now; to wire in your real Scrapy project, open app.py and replace fake_run() with a subprocess/queue call to your spider and load the real output in get_results.
Quick start: Apify (curl)
# 1) Start run
curl -s -X POST "https://api.apify.com/v2/acts/your-user~phrase-emphasis-crawler/runs" \
-H "Authorization: Bearer $APIFY_TOKEN" -H "Content-Type: application/json" \
-d '{
"actorId": "your-user~phrase-emphasis-crawler",
"input": {
"startUrls": [
{"url":"https://examplechurch.org"},
{"url":"https://anotherchurch.com"}
],
"sameDomainOnly": true,
"maxDepth": 2
},
"options": {"memory": 2048, "timeoutSecs": 1800}
}'
# 2) Wait for finish (120s window)
curl -s "https://api.apify.com/v2/actor-runs/RUN_ID?waitForFinish=120000" \
-H "Authorization: Bearer $APIFY_TOKEN"
# 3) Get dataset items
curl -s "https://api.apify.com/v2/datasets/DATASET_ID/items?format=json&clean=true" \
-H "Authorization: Bearer $APIFY_TOKEN"
# or CSV
curl -s "https://api/apify.com/v2/datasets/DATASET_ID/items?format=csv&clean=true" \
-H "Authorization: Bearer $APIFY_TOKEN"
Postman collection
Import Phrase_Emphasis_Apify_Scrapy.postman_collection.json, set these variables:
APIFY_TOKEN,APIFY_ACTOR_ID,RUN_ID,DATASET_IDSCRAPY_BASE(e.g.,http://localhost:8000/api),SCRAPY_TOKEN,JOB_ID
Sequence is prebuilt:
- Start → 2) Get status → 3) Get results (JSON or CSV).