Do it now: Here’s the exact, do-it-now checklist.
1) Open the GPT builder
- Go to chatgpt.com/gpts/editor (or ChatGPT → Explore GPTs → + Create). You’ll see Create and Configure tabs.
2) Paste the Instructions
- In Configure → Instructions, paste the Instruction block I gave you (“Backend selection logic… Flow (both backends)… Error handling…”).
- These tell the GPT when to call Apify vs Scrapy and how to summarize the results.
3) Add the Apify Action
- Configure → Actions → Add Action → Paste OpenAPI (the “Phrase Emphasis Crawler (Apify)” schema I gave you).
- This schema uses:
- Start run:
POST /v2/acts/{actorId}/runs
- Get run (+ optional wait):
GET /v2/actor-runs/{runId}?waitForFinish=…
- Get items:
GET /v2/datasets/{datasetId}/items?format=json|csv
- Start run:
- Set Auth header to
Authorization: Bearer {APIFY_API_TOKEN}
(store as a secret).
- This schema uses:
Tip (even faster): if your crawls are small, you can wire the sync endpoint to return results in one call:
POST /v2/acts/{actorId}/run-sync-get-dataset-items?format=json|csv
(timeouts at ~300s).
4) Add the Scrapy Orchestrator Action
- Actions → Add Action → Paste OpenAPI (the “Scrapy Orchestrator” schema I gave you).
- Endpoints:
POST /api/jobs
,GET /api/jobs/{id}
,GET /api/jobs/{id}/results?format=…
- Auth:
Authorization: Bearer {YOUR_TOKEN}
(secret).
- Endpoints:
5) Capabilities & Knowledge (optional)
- Toggle Browse/Code only if you want them. You can upload your Quick Reference or EPO docs as Knowledge later.
6) Save, set visibility, test
- Click Save. Choose visibility (Only me / Link / Org). Open a new chat with your GPT and say: “Scan these sites for the phrase X: … Use Apify.”
It should call startRun → getRun (wait) → getDatasetItems and return a compact EI report.
What I can’t do “right here”
I can’t click in your account or set secrets for you. But everything you need to paste is above, and the endpoints/flow are straight from the official docs.
If you want next-level polish
- Swap Apify to the sync endpoint for single-call runs (small batches).
- Add a second “export CSV” action-call the GPT can trigger on request.
- Prefer Assistants API if you want to run this from your own app instead of ChatGPT’s UI.
Here’s the exact Instructions block you can copy straight into your Custom GPT in the builder under Configure → Instructions.
This is the fully merged version for both backends (Apify + Scrapy Orchestrator), with the flow and rules already embedded.
Custom GPT Instructions – Phrase Emphasis Crawler (Apify + Scrapy)
Role & Goal
You are an operations assistant that runs a web crawl via either Apify or a Scrapy Orchestrator backend, waits for it to finish, then returns a concise report with an Emphasis Index (EI) summary for the chosen phrase.
Backend selection logic
- If the user asks for fast setup, cloud runs, or provides an Apify actor ID → use Apify (
startRun
/getRun
/getDatasetItems
). - If the user explicitly says use our server/on-prem or Scrapy → use Scrapy Orchestrator (
startJob
/getJob
/getResults
). - If they don’t care, prefer Apify for speed; offer Scrapy for data custody or cost control.
Flow (both backends)
- Start:
- Apify → call
startRun
with:actorId
: Apify Actor ID (e.g.,username~phrase-emphasis-crawler
)input
: at leaststartUrls
(array of{url}
), plus any flags likesameDomainOnly
,maxDepth
- Scrapy → call
startJob
with:spider
: spider name (default:beliefs
)phrase
: main search phrasenear
: optional array of near-variantsseeds
:{domains:[], urls:[]}
opts
: optional crawl settings (depth
,delay_ms
,concurrency
)
- Apify → call
- Wait:
- Apify → call
getRun
withwaitForFinish=120000
(120s wait). - Scrapy → call
getJob
in a loop until status ∈ {SUCCEEDED
,FAILED
,ABORTED
,TIMED_OUT
}.
- Apify → call
- Fetch results:
- Apify → if status is
SUCCEEDED
anddefaultDatasetId
exists → callgetDatasetItems
(defaultformat=json
unless user requests CSV). - Scrapy → if status is
SUCCEEDED
→ callgetResults
(defaultformat=json
unless user requests CSV).
- Apify → if status is
- Report:
- Summarize:
- Top hosts by average EI.
- Top pages by EI.
- Average EI across dataset.
- Show a short table:
Host Pages Top EI Avg EI example.org 12 5 3.4 ...
- Provide a brief interpretation (one paragraph) of EI distribution.
- Summarize:
- Offer export:
- If user wants CSV → call dataset/results endpoint with
format=csv
and provide as downloadable text.
- If user wants CSV → call dataset/results endpoint with
Input shaping
- If user gives only domains, prepend
https://
to create full URLs. - If phrase is missing, ask for it once.
- Default:
maxDepth
= 2sameDomainOnly
= true- Crawl likely paths if unspecified.
Error handling
- If start call fails → show error message.
- If final status is not
SUCCEEDED
→ report status and anyerrorMessage
from API. - If dataset/results empty → say so and suggest adding near-variants or expanding seeds.
Output format to the user
- Status line with run/job ID, status, elapsed time.
- Compact EI table.
- One-paragraph summary.
- Offer CSV export if not already provided.
Behavior rules
- Never invent data; only report actual dataset fields.
- Keep numbers exact from dataset.
- Don’t echo the whole dataset unless user asks.
- Minimize API calls — prefer
waitForFinish
or batch polling.
Here are the two final, copy-paste OpenAPI schemas for your Custom GPT Actions. Add each one in the GPT Builder: Configure → Actions → Add Action → Paste schema. Then set the auth header as noted.
1) Apify Action — Phrase Emphasis Crawler (Cloud)
Uses Apify’s REST API to start an Actor run, wait for completion, and fetch dataset items (JSON/CSV).
openapi: 3.1.0
info:
title: Phrase Emphasis Crawler (Apify)
version: "1.0.0"
description: >
Start an Apify Actor run with start URLs and a phrase, check status
(with optional wait), and fetch the resulting dataset as JSON or CSV.
servers:
- url: https://api.apify.com/v2
components:
securitySchemes:
ApifyToken:
type: apiKey
in: header
name: Authorization
description: Use format: Bearer {APIFY_API_TOKEN}
schemas:
StartInput:
type: object
required: [actorId, input]
properties:
actorId:
type: string
description: Actor ID or "username~actor-name"
example: your-user~phrase-emphasis-crawler
input:
type: object
description: INPUT JSON passed to the Actor (e.g., startUrls, sameDomainOnly, maxDepth)
example:
startUrls: [{ url: "https://examplechurch.org" }, { url: "https://anotherchurch.com" }]
sameDomainOnly: true
maxDepth: 2
options:
type: object
description: Optional run options (memory, timeoutSecs, build, etc.)
example: { memory: 2048, timeoutSecs: 1800 }
RunStatus:
type: object
properties:
id: { type: string }
status:
type: string
description: CREATED | RUNNING | SUCCEEDED | FAILED | ABORTED | TIMED-OUT
defaultDatasetId: { type: string, nullable: true }
defaultKeyValueStoreId: { type: string, nullable: true }
finishedAt: { type: string, nullable: true }
DatasetItems:
type: array
items:
type: object
additionalProperties: true
security:
- ApifyToken: []
paths:
/acts/{actorId}/runs:
post:
operationId: startRun
summary: Start an Actor run (returns immediately)
description: Runs an Actor and returns the run object. Use getRun (with waitForFinish) to await completion.
security: [{ ApifyToken: [] }]
parameters:
- in: path
name: actorId
required: true
schema: { type: string }
requestBody:
required: true
content:
application/json:
schema: { $ref: "#/components/schemas/StartInput" }
responses:
"201":
description: Run created
content:
application/json:
schema:
type: object
properties:
data: { $ref: "#/components/schemas/RunStatus" }
/actor-runs/{runId}:
get:
operationId: getRun
summary: Get run status (optionally wait for finish)
description: |
Returns run details. Query param waitForFinish (ms, 0..120000) blocks until finished or timeout.
security: [{ ApifyToken: [] }]
parameters:
- in: path
name: runId
required: true
schema: { type: string }
- in: query
name: waitForFinish
required: false
schema: { type: integer, minimum: 0, maximum: 120000 }
responses:
"200":
description: Run details
content:
application/json:
schema:
type: object
properties:
data: { $ref: "#/components/schemas/RunStatus" }
/datasets/{datasetId}/items:
get:
operationId: getDatasetItems
summary: Fetch dataset items (JSON or CSV)
description: Returns dataset items; choose format=json (default) or format=csv.
security: [{ ApifyToken: [] }]
parameters:
- in: path
name: datasetId
required: true
schema: { type: string }
- in: query
name: format
required: false
schema: { type: string, enum: [json, csv], default: json }
- in: query
name: clean
required: false
schema: { type: boolean, default: true }
responses:
"200":
description: Items
content:
application/json:
schema: { $ref: "#/components/schemas/DatasetItems" }
text/csv:
schema: { type: string }
Auth to set in Builder:
Header Authorization: Bearer {APIFY_API_TOKEN}
(store the token as a secret in the Action).
2) Scrapy Orchestrator Action — Your Server / On-Prem
Wraps your Scrapy runner (orchestration API) with start/poll/fetch endpoints.
openapi: 3.1.0
info:
title: Scrapy Orchestrator
version: "1.0.0"
description: Start a Scrapy crawl job on your infra, poll status, and fetch results (JSON or CSV).
servers:
- url: https://scrapy.example.com/api
components:
securitySchemes:
Bearer:
type: apiKey
in: header
name: Authorization
description: Use: Bearer {YOUR_SCRAPY_ORCH_TOKEN}
schemas:
StartJobInput:
type: object
required: [spider, phrase, seeds]
properties:
spider:
type: string
example: beliefs
phrase:
type: string
example: fear of the lord
near:
type: array
items: { type: string }
example: ["fear of god", "reverent awe"]
seeds:
type: object
properties:
domains: { type: array, items: { type: string }, example: ["examplechurch.org","anotherchurch.com"] }
urls: { type: array, items: { type: string }, example: ["https://example.org/about/beliefs"] }
opts:
type: object
properties:
depth: { type: integer, default: 2 }
delay_ms: { type: integer, default: 400 }
concurrency: { type: integer, default: 8 }
JobStatus:
type: object
properties:
id: { type: string }
status:
type: string
description: QUEUED | RUNNING | SUCCEEDED | FAILED | ABORTED | TIMED_OUT
startedAt: { type: string, nullable: true }
finishedAt: { type: string, nullable: true }
items: { type: integer, description: number of result rows }
resultsId: { type: string, description: dataset/result identifier }
security:
- Bearer: []
paths:
/jobs:
post:
operationId: startJob
summary: Start a Scrapy job
description: Launches a crawl; returns a job object immediately.
security: [{ Bearer: [] }]
requestBody:
required: true
content:
application/json:
schema: { $ref: "#/components/schemas/StartJobInput" }
responses:
"201":
description: Job created
content:
application/json:
schema: { $ref: "#/components/schemas/JobStatus" }
/jobs/{id}:
get:
operationId: getJob
summary: Get job status
description: Returns current status and counters for a job.
security: [{ Bearer: [] }]
parameters:
- in: path
name: id
required: true
schema: { type: string }
responses:
"200":
description: Status
content:
application/json:
schema: { $ref: "#/components/schemas/JobStatus" }
/jobs/{id}/results:
get:
operationId: getResults
summary: Fetch results
description: Returns results as JSON (default) or CSV text; only valid after SUCCEEDED.
security: [{ Bearer: [] }]
parameters:
- in: path
name: id
required: true
schema: { type: string }
- in: query
name: format
required: false
schema: { type: string, enum: [json, csv], default: json }
responses:
"200":
description: Items
content:
application/json:
schema:
type: array
items: { type: object, additionalProperties: true }
text/csv:
schema: { type: string }
Auth to set in Builder:
Header Authorization: Bearer {YOUR_SCRAPY_ORCH_TOKEN}
(store as a secret in the Action).
Quick Builder checklist (1 minute)
- Configure → Instructions: paste the big Instructions block I gave you.
- Actions → Add Action: paste Apify schema → set header
Authorization: Bearer {APIFY_API_TOKEN}
. - Actions → Add Action: paste Scrapy Orchestrator schema → set header
Authorization: Bearer {YOUR_SCRAPY_ORCH_TOKEN}
. - Save, set visibility, test with: “Scan these sites for ‘fear of the lord’: https://examplechurch.org, https://anotherchurch.com — use Apify.”
Here’s the one-call Apify “sync” Action so your Custom GPT can launch a crawl and get back the dataset in the same response (perfect for small/medium batches).
Use this alongside the two Actions you already added. Tell the GPT to prefer sync for small runs, and fall back to the async pair when jobs are larger or might exceed the sync window.
3) Apify “Sync” Action — Single call, returns dataset immediately
Starts the Actor and waits on the same HTTP request, then returns dataset items. Best for small jobs (server-side time window is limited by Apify; think minutes, not hours).
openapi: 3.1.0
info:
title: Phrase Emphasis Crawler (Apify Sync)
version: "1.0.0"
description: >
Run an Apify Actor and return its dataset items in a single call.
Use for small runs; for larger jobs use the regular start/poll/fetch actions.
servers:
- url: https://api.apify.com/v2
components:
securitySchemes:
ApifyToken:
type: apiKey
in: header
name: Authorization
description: Use: Bearer {APIFY_API_TOKEN}
schemas:
SyncRunInput:
type: object
required: [input]
properties:
input:
type: object
description: INPUT JSON passed to the Actor (e.g., startUrls, sameDomainOnly, maxDepth)
example:
startUrls: [{ url: "https://examplechurch.org" }, { url: "https://anotherchurch.com" }]
sameDomainOnly: true
maxDepth: 2
options:
type: object
description: Optional run options (memory, timeoutSecs, build, etc.)
example: { memory: 2048, timeoutSecs: 300 }
security:
- ApifyToken: []
paths:
/acts/{actorId}/run-sync-get-dataset-items:
post:
operationId: runSyncGetDatasetItems
summary: Run Actor and return dataset items (sync)
description: >
Executes the Actor synchronously and returns its dataset items.
Use query param format=json (default) or format=csv.
security: [{ ApifyToken: [] }]
parameters:
- in: path
name: actorId
required: true
schema: { type: string }
description: Actor ID or "username~actor-name"
- in: query
name: format
required: false
schema: { type: string, enum: [json, csv], default: json }
- in: query
name: clean
required: false
schema: { type: boolean, default: true }
requestBody:
required: true
content:
application/json:
schema: { $ref: "#/components/schemas/SyncRunInput" }
responses:
"200":
description: Dataset items
content:
application/json:
schema:
type: array
items: { type: object, additionalProperties: true }
text/csv:
schema: { type: string }
Auth to set in Builder:
Header Authorization: Bearer {APIFY_API_TOKEN}
(store as a secret).
How your GPT should choose between Sync vs Async
Append this to your GPT’s Instructions (or blend into the “Backend selection logic”):
- If the user’s seed list is small (e.g., ≤ 50–100 start URLs, depth ≤ 2) then call
runSyncGetDatasetItems
(Apify Sync) withformat=json
by default. - If the user supplies a large list or asks for deeper crawls, use the async Apify flow:
startRun
→getRun(waitForFinish=120000)
(loop) →getDatasetItems
. - On timeout or 4xx/5xx from Sync, fall back automatically to the async flow and tell the user you switched to a long-running job.
Example payload (Sync)
{
"input": {
"startUrls": [
{"url": "https://examplechurch.org"},
{"url": "https://anotherchurch.com"}
],
"sameDomainOnly": true,
"maxDepth": 2
},
"options": { "timeoutSecs": 300 }
}
Ask for CSV by adding
?format=csv&clean=true
to the path.
Minimal test script (curl)
# JSON
curl -s -X POST \
"https://api.apify.com/v2/acts/your-user~phrase-emphasis-crawler/run-sync-get-dataset-items?format=json&clean=true" \
-H "Authorization: Bearer $APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{"input":{"startUrls":[{"url":"https://examplechurch.org"}],"sameDomainOnly":true,"maxDepth":2}}'
# CSV
curl -s -X POST \
"https://api.apify.com/v2/acts/your-user~phrase-emphasis-crawler/run-sync-get-dataset-items?format=csv&clean=true" \
-H "Authorization: Bearer $APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{"input":{"startUrls":[{"url":"https://examplechurch.com"}],"sameDomainOnly":true,"maxDepth":2}}'
Here are two clean ways to add a Result Summarizer so your Custom GPT reports the same way every time—no matter whether results came from Apify Sync, Apify Async, or Scrapy.
Option A — Use GPT itself (no new Action)
When to use: fastest path; nothing to host.
How: enable Code Interpreter (a.k.a. “Advanced Data Analysis”) for your Custom GPT. Then add this to your GPT’s Instructions:
Summarization rule (no Action):
When you receive dataset items (JSON array or CSV text) with fields likehost
,url
,EI
(and optionallyexact_hits
,near_hits
,heading_flag
,early_flag
), do the following in the tool:
- Parse records.
- Compute, per host: pages scanned, max EI, average EI (two decimals).
- List top 5 pages by EI (host, EI, url).
- Output a compact table:
Host Pages Top EI Avg EI example.org 12 5 3.41
- One-paragraph interpretation (concise).
- If user requests CSV, return CSV text assembled from the parsed items.
That’s it—no external call. The GPT will do the math inside the sandbox.
Option B — Add a tiny Summarizer Action (uniform, API-driven)
When to use: you want deterministic, audited summaries (same result every time), or you don’t want to enable Code Interpreter.
You’ll add:
- A Summarizer endpoint to your Scrapy Orchestrator (FastAPI).
- A Summarizer OpenAPI schema as a third Action in the Custom GPT.
1) FastAPI endpoint (drop-in to your existing app.py
)
Add this to the bottom of your orchestrator (or keep it as a separate microservice). It accepts raw items (JSON array) and returns the standardized report.
from pydantic import BaseModel
from statistics import mean
from typing import List, Dict, Any
class SummarizeInput(BaseModel):
items: List[Dict[str, Any]]
@app.post("/api/summarize")
def summarize(inp: SummarizeInput, Authorization: str | None = Header(None)):
require_auth(Authorization)
items = inp.items or []
# Normalize fields
norm = []
for r in items:
host = str(r.get("host") or r.get("domain") or "").strip()
url = str(r.get("url") or "").strip()
ei = r.get("EI")
try:
ei = int(ei)
except Exception:
try:
ei = round(float(ei))
except Exception:
ei = None
if host and url and ei is not None:
norm.append({"host": host, "url": url, "EI": int(ei)})
# Group by host
hosts: Dict[str, List[Dict[str, Any]]] = {}
for row in norm:
hosts.setdefault(row["host"], []).append(row)
# Per-host rollups
host_rows = []
for h, rows in hosts.items():
eis = [r["EI"] for r in rows if isinstance(r["EI"], (int, float))]
pages = len(rows)
top_ei = max(eis) if eis else 0
avg_ei = round(mean(eis), 2) if eis else 0.0
host_rows.append({"host": h, "pages": pages, "top_ei": top_ei, "avg_ei": avg_ei})
# Sort outputs
host_rows.sort(key=lambda x: (-x["avg_ei"], -x["top_ei"], x["host"]))
top_pages = sorted(norm, key=lambda r: (-r["EI"], r["host"]))[:5]
# Overall
overall_avg = round(mean([r["EI"] for r in norm]), 2) if norm else 0.0
return {
"overview": {
"total_items": len(norm),
"hosts": len(host_rows),
"overall_avg_EI": overall_avg
},
"hosts": host_rows,
"top_pages": top_pages,
"notes": "EI = Emphasis Index (0–5). host/pages/top_ei/avg_ei sorted by avg_ei desc."
}
Auth note: it reuses your existing
require_auth()
so the same Bearer token protects it.
2) Summarizer OpenAPI schema (add as third Action)
In the GPT Builder: Configure → Actions → Add Action → Paste schema, then set the Authorization header to Bearer {YOUR_SCRAPY_ORCH_TOKEN}
(the same token you used for the Orchestrator).
openapi: 3.1.0
info:
title: Phrase Emphasis Summarizer
version: "1.0.0"
description: Summarize EI results into per-host rollups, top pages, and overall stats.
servers:
- url: https://scrapy.example.com/api
components:
securitySchemes:
Bearer:
type: apiKey
in: header
name: Authorization
description: Use: Bearer {YOUR_SCRAPY_ORCH_TOKEN}
schemas:
SummarizeInput:
type: object
required: [items]
properties:
items:
type: array
items:
type: object
additionalProperties: true
SummaryOutput:
type: object
properties:
overview:
type: object
properties:
total_items: { type: integer }
hosts: { type: integer }
overall_avg_EI: { type: number }
hosts:
type: array
items:
type: object
properties:
host: { type: string }
pages: { type: integer }
top_ei: { type: integer }
avg_ei: { type: number }
top_pages:
type: array
items:
type: object
properties:
host: { type: string }
url: { type: string }
EI: { type: integer }
notes: { type: string }
security:
- Bearer: []
paths:
/summarize:
post:
operationId: summarizeResults
summary: Summarize EI dataset
description: Provide raw items (JSON array). Returns host rollups, top pages, and overall stats.
security: [{ Bearer: [] }]
requestBody:
required: true
content:
application/json:
schema: { $ref: "#/components/schemas/SummarizeInput" }
responses:
"200":
description: Summary
content:
application/json:
schema: { $ref: "#/components/schemas/SummaryOutput" }
Tell the GPT when to call the Summarizer
Append this snippet to your GPT’s Instructions:
Summarizer usage:
After fetching dataset items (JSON) from Apify or Scrapy, callsummarizeResults
with the items. Render the returnedhosts
table andtop_pages
(max 5). Includeoverview.overall_avg_EI
. If the user asks for CSV, fetch CSV from the source backend instead of the summarizer.
Minimal end-to-end flow (example conversation)
- You: “Scan these sites for ‘fear of the lord’: https://examplechurch.org, https://anotherchurch.com — use Apify (sync).”
- GPT (Actions):
runSyncGetDatasetItems
→ gets JSON itemssummarizeResults
→ receives standardized summary- Returns a neat table + one-paragraph interpretation
- Offers “Export CSV” button (calls Apify dataset endpoint with
format=csv
if requested)