Custom GPT with Apify + Scrapy backends

Do it now: Here’s the exact, do-it-now checklist.

1) Open the GPT builder

  • Go to chatgpt.com/gpts/editor (or ChatGPT → Explore GPTs → + Create). You’ll see Create and Configure tabs.

2) Paste the Instructions

  • In Configure → Instructions, paste the Instruction block I gave you (“Backend selection logic… Flow (both backends)… Error handling…”).
  • These tell the GPT when to call Apify vs Scrapy and how to summarize the results.

3) Add the Apify Action

  • Configure → Actions → Add Action → Paste OpenAPI (the “Phrase Emphasis Crawler (Apify)” schema I gave you).
    • This schema uses:
      • Start run: POST /v2/acts/{actorId}/runs
      • Get run (+ optional wait): GET /v2/actor-runs/{runId}?waitForFinish=…
      • Get items: GET /v2/datasets/{datasetId}/items?format=json|csv
    • Set Auth header to Authorization: Bearer {APIFY_API_TOKEN} (store as a secret).

Tip (even faster): if your crawls are small, you can wire the sync endpoint to return results in one call:
POST /v2/acts/{actorId}/run-sync-get-dataset-items?format=json|csv (timeouts at ~300s).

4) Add the Scrapy Orchestrator Action

  • Actions → Add Action → Paste OpenAPI (the “Scrapy Orchestrator” schema I gave you).
    • Endpoints: POST /api/jobs, GET /api/jobs/{id}, GET /api/jobs/{id}/results?format=…
    • Auth: Authorization: Bearer {YOUR_TOKEN} (secret).

5) Capabilities & Knowledge (optional)

  • Toggle Browse/Code only if you want them. You can upload your Quick Reference or EPO docs as Knowledge later.

6) Save, set visibility, test

  • Click Save. Choose visibility (Only me / Link / Org). Open a new chat with your GPT and say: “Scan these sites for the phrase X: … Use Apify.”
    It should call startRun → getRun (wait) → getDatasetItems and return a compact EI report.

What I can’t do “right here”

I can’t click in your account or set secrets for you. But everything you need to paste is above, and the endpoints/flow are straight from the official docs.

If you want next-level polish

  • Swap Apify to the sync endpoint for single-call runs (small batches).
  • Add a second “export CSV” action-call the GPT can trigger on request.
  • Prefer Assistants API if you want to run this from your own app instead of ChatGPT’s UI.

Here’s the exact Instructions block you can copy straight into your Custom GPT in the builder under Configure → Instructions.
This is the fully merged version for both backends (Apify + Scrapy Orchestrator), with the flow and rules already embedded.


Custom GPT Instructions – Phrase Emphasis Crawler (Apify + Scrapy)

Role & Goal
You are an operations assistant that runs a web crawl via either Apify or a Scrapy Orchestrator backend, waits for it to finish, then returns a concise report with an Emphasis Index (EI) summary for the chosen phrase.


Backend selection logic

  • If the user asks for fast setup, cloud runs, or provides an Apify actor ID → use Apify (startRun / getRun / getDatasetItems).
  • If the user explicitly says use our server/on-prem or Scrapy → use Scrapy Orchestrator (startJob / getJob / getResults).
  • If they don’t care, prefer Apify for speed; offer Scrapy for data custody or cost control.

Flow (both backends)

  1. Start:
    • Apify → call startRun with:
      • actorId: Apify Actor ID (e.g., username~phrase-emphasis-crawler)
      • input: at least startUrls (array of {url}), plus any flags like sameDomainOnly, maxDepth
    • Scrapy → call startJob with:
      • spider: spider name (default: beliefs)
      • phrase: main search phrase
      • near: optional array of near-variants
      • seeds: {domains:[], urls:[]}
      • opts: optional crawl settings (depth, delay_ms, concurrency)
  2. Wait:
    • Apify → call getRun with waitForFinish=120000 (120s wait).
    • Scrapy → call getJob in a loop until status ∈ {SUCCEEDED, FAILED, ABORTED, TIMED_OUT}.
  3. Fetch results:
    • Apify → if status is SUCCEEDED and defaultDatasetId exists → call getDatasetItems (default format=json unless user requests CSV).
    • Scrapy → if status is SUCCEEDED → call getResults (default format=json unless user requests CSV).
  4. Report:
    • Summarize:
      • Top hosts by average EI.
      • Top pages by EI.
      • Average EI across dataset.
    • Show a short table: Host Pages Top EI Avg EI example.org 12 5 3.4 ...
    • Provide a brief interpretation (one paragraph) of EI distribution.
  5. Offer export:
    • If user wants CSV → call dataset/results endpoint with format=csv and provide as downloadable text.

Input shaping

  • If user gives only domains, prepend https:// to create full URLs.
  • If phrase is missing, ask for it once.
  • Default:
    • maxDepth = 2
    • sameDomainOnly = true
    • Crawl likely paths if unspecified.

Error handling

  • If start call fails → show error message.
  • If final status is not SUCCEEDED → report status and any errorMessage from API.
  • If dataset/results empty → say so and suggest adding near-variants or expanding seeds.

Output format to the user

  1. Status line with run/job ID, status, elapsed time.
  2. Compact EI table.
  3. One-paragraph summary.
  4. Offer CSV export if not already provided.

Behavior rules

  • Never invent data; only report actual dataset fields.
  • Keep numbers exact from dataset.
  • Don’t echo the whole dataset unless user asks.
  • Minimize API calls — prefer waitForFinish or batch polling.

Here are the two final, copy-paste OpenAPI schemas for your Custom GPT Actions. Add each one in the GPT Builder: Configure → Actions → Add Action → Paste schema. Then set the auth header as noted.


1) Apify Action — Phrase Emphasis Crawler (Cloud)

Uses Apify’s REST API to start an Actor run, wait for completion, and fetch dataset items (JSON/CSV).

openapi: 3.1.0
info:
  title: Phrase Emphasis Crawler (Apify)
  version: "1.0.0"
  description: >
    Start an Apify Actor run with start URLs and a phrase, check status
    (with optional wait), and fetch the resulting dataset as JSON or CSV.
servers:
  - url: https://api.apify.com/v2
components:
  securitySchemes:
    ApifyToken:
      type: apiKey
      in: header
      name: Authorization
      description: Use format: Bearer {APIFY_API_TOKEN}
  schemas:
    StartInput:
      type: object
      required: [actorId, input]
      properties:
        actorId:
          type: string
          description: Actor ID or "username~actor-name"
          example: your-user~phrase-emphasis-crawler
        input:
          type: object
          description: INPUT JSON passed to the Actor (e.g., startUrls, sameDomainOnly, maxDepth)
          example:
            startUrls: [{ url: "https://examplechurch.org" }, { url: "https://anotherchurch.com" }]
            sameDomainOnly: true
            maxDepth: 2
        options:
          type: object
          description: Optional run options (memory, timeoutSecs, build, etc.)
          example: { memory: 2048, timeoutSecs: 1800 }
    RunStatus:
      type: object
      properties:
        id: { type: string }
        status:
          type: string
          description: CREATED | RUNNING | SUCCEEDED | FAILED | ABORTED | TIMED-OUT
        defaultDatasetId: { type: string, nullable: true }
        defaultKeyValueStoreId: { type: string, nullable: true }
        finishedAt: { type: string, nullable: true }
    DatasetItems:
      type: array
      items:
        type: object
        additionalProperties: true
security:
  - ApifyToken: []
paths:
  /acts/{actorId}/runs:
    post:
      operationId: startRun
      summary: Start an Actor run (returns immediately)
      description: Runs an Actor and returns the run object. Use getRun (with waitForFinish) to await completion.
      security: [{ ApifyToken: [] }]
      parameters:
        - in: path
          name: actorId
          required: true
          schema: { type: string }
      requestBody:
        required: true
        content:
          application/json:
            schema: { $ref: "#/components/schemas/StartInput" }
      responses:
        "201":
          description: Run created
          content:
            application/json:
              schema:
                type: object
                properties:
                  data: { $ref: "#/components/schemas/RunStatus" }

  /actor-runs/{runId}:
    get:
      operationId: getRun
      summary: Get run status (optionally wait for finish)
      description: |
        Returns run details. Query param waitForFinish (ms, 0..120000) blocks until finished or timeout.
      security: [{ ApifyToken: [] }]
      parameters:
        - in: path
          name: runId
          required: true
          schema: { type: string }
        - in: query
          name: waitForFinish
          required: false
          schema: { type: integer, minimum: 0, maximum: 120000 }
      responses:
        "200":
          description: Run details
          content:
            application/json:
              schema:
                type: object
                properties:
                  data: { $ref: "#/components/schemas/RunStatus" }

  /datasets/{datasetId}/items:
    get:
      operationId: getDatasetItems
      summary: Fetch dataset items (JSON or CSV)
      description: Returns dataset items; choose format=json (default) or format=csv.
      security: [{ ApifyToken: [] }]
      parameters:
        - in: path
          name: datasetId
          required: true
          schema: { type: string }
        - in: query
          name: format
          required: false
          schema: { type: string, enum: [json, csv], default: json }
        - in: query
          name: clean
          required: false
          schema: { type: boolean, default: true }
      responses:
        "200":
          description: Items
          content:
            application/json:
              schema: { $ref: "#/components/schemas/DatasetItems" }
            text/csv:
              schema: { type: string }

Auth to set in Builder:
Header Authorization: Bearer {APIFY_API_TOKEN} (store the token as a secret in the Action).


2) Scrapy Orchestrator Action — Your Server / On-Prem

Wraps your Scrapy runner (orchestration API) with start/poll/fetch endpoints.

openapi: 3.1.0
info:
  title: Scrapy Orchestrator
  version: "1.0.0"
  description: Start a Scrapy crawl job on your infra, poll status, and fetch results (JSON or CSV).
servers:
  - url: https://scrapy.example.com/api
components:
  securitySchemes:
    Bearer:
      type: apiKey
      in: header
      name: Authorization
      description: Use: Bearer {YOUR_SCRAPY_ORCH_TOKEN}
  schemas:
    StartJobInput:
      type: object
      required: [spider, phrase, seeds]
      properties:
        spider:
          type: string
          example: beliefs
        phrase:
          type: string
          example: fear of the lord
        near:
          type: array
          items: { type: string }
          example: ["fear of god", "reverent awe"]
        seeds:
          type: object
          properties:
            domains: { type: array, items: { type: string }, example: ["examplechurch.org","anotherchurch.com"] }
            urls:    { type: array, items: { type: string }, example: ["https://example.org/about/beliefs"] }
        opts:
          type: object
          properties:
            depth: { type: integer, default: 2 }
            delay_ms: { type: integer, default: 400 }
            concurrency: { type: integer, default: 8 }
    JobStatus:
      type: object
      properties:
        id: { type: string }
        status:
          type: string
          description: QUEUED | RUNNING | SUCCEEDED | FAILED | ABORTED | TIMED_OUT
        startedAt: { type: string, nullable: true }
        finishedAt: { type: string, nullable: true }
        items: { type: integer, description: number of result rows }
        resultsId: { type: string, description: dataset/result identifier }
security:
  - Bearer: []
paths:
  /jobs:
    post:
      operationId: startJob
      summary: Start a Scrapy job
      description: Launches a crawl; returns a job object immediately.
      security: [{ Bearer: [] }]
      requestBody:
        required: true
        content:
          application/json:
            schema: { $ref: "#/components/schemas/StartJobInput" }
      responses:
        "201":
          description: Job created
          content:
            application/json:
              schema: { $ref: "#/components/schemas/JobStatus" }

  /jobs/{id}:
    get:
      operationId: getJob
      summary: Get job status
      description: Returns current status and counters for a job.
      security: [{ Bearer: [] }]
      parameters:
        - in: path
          name: id
          required: true
          schema: { type: string }
      responses:
        "200":
          description: Status
          content:
            application/json:
              schema: { $ref: "#/components/schemas/JobStatus" }

  /jobs/{id}/results:
    get:
      operationId: getResults
      summary: Fetch results
      description: Returns results as JSON (default) or CSV text; only valid after SUCCEEDED.
      security: [{ Bearer: [] }]
      parameters:
        - in: path
          name: id
          required: true
          schema: { type: string }
        - in: query
          name: format
          required: false
          schema: { type: string, enum: [json, csv], default: json }
      responses:
        "200":
          description: Items
          content:
            application/json:
              schema:
                type: array
                items: { type: object, additionalProperties: true }
            text/csv:
              schema: { type: string }

Auth to set in Builder:
Header Authorization: Bearer {YOUR_SCRAPY_ORCH_TOKEN} (store as a secret in the Action).


Quick Builder checklist (1 minute)

  1. Configure → Instructions: paste the big Instructions block I gave you.
  2. Actions → Add Action: paste Apify schema → set header Authorization: Bearer {APIFY_API_TOKEN}.
  3. Actions → Add Action: paste Scrapy Orchestrator schema → set header Authorization: Bearer {YOUR_SCRAPY_ORCH_TOKEN}.
  4. Save, set visibility, test with: “Scan these sites for ‘fear of the lord’: https://examplechurch.org, https://anotherchurch.com — use Apify.”

Here’s the one-call Apify “sync” Action so your Custom GPT can launch a crawl and get back the dataset in the same response (perfect for small/medium batches).

Use this alongside the two Actions you already added. Tell the GPT to prefer sync for small runs, and fall back to the async pair when jobs are larger or might exceed the sync window.


3) Apify “Sync” Action — Single call, returns dataset immediately

Starts the Actor and waits on the same HTTP request, then returns dataset items. Best for small jobs (server-side time window is limited by Apify; think minutes, not hours).

openapi: 3.1.0
info:
  title: Phrase Emphasis Crawler (Apify Sync)
  version: "1.0.0"
  description: >
    Run an Apify Actor and return its dataset items in a single call.
    Use for small runs; for larger jobs use the regular start/poll/fetch actions.
servers:
  - url: https://api.apify.com/v2
components:
  securitySchemes:
    ApifyToken:
      type: apiKey
      in: header
      name: Authorization
      description: Use: Bearer {APIFY_API_TOKEN}
  schemas:
    SyncRunInput:
      type: object
      required: [input]
      properties:
        input:
          type: object
          description: INPUT JSON passed to the Actor (e.g., startUrls, sameDomainOnly, maxDepth)
          example:
            startUrls: [{ url: "https://examplechurch.org" }, { url: "https://anotherchurch.com" }]
            sameDomainOnly: true
            maxDepth: 2
        options:
          type: object
          description: Optional run options (memory, timeoutSecs, build, etc.)
          example: { memory: 2048, timeoutSecs: 300 }
security:
  - ApifyToken: []
paths:
  /acts/{actorId}/run-sync-get-dataset-items:
    post:
      operationId: runSyncGetDatasetItems
      summary: Run Actor and return dataset items (sync)
      description: >
        Executes the Actor synchronously and returns its dataset items.
        Use query param format=json (default) or format=csv.
      security: [{ ApifyToken: [] }]
      parameters:
        - in: path
          name: actorId
          required: true
          schema: { type: string }
          description: Actor ID or "username~actor-name"
        - in: query
          name: format
          required: false
          schema: { type: string, enum: [json, csv], default: json }
        - in: query
          name: clean
          required: false
          schema: { type: boolean, default: true }
      requestBody:
        required: true
        content:
          application/json:
            schema: { $ref: "#/components/schemas/SyncRunInput" }
      responses:
        "200":
          description: Dataset items
          content:
            application/json:
              schema:
                type: array
                items: { type: object, additionalProperties: true }
            text/csv:
              schema: { type: string }

Auth to set in Builder:
Header Authorization: Bearer {APIFY_API_TOKEN} (store as a secret).


How your GPT should choose between Sync vs Async

Append this to your GPT’s Instructions (or blend into the “Backend selection logic”):

  • If the user’s seed list is small (e.g., ≤ 50–100 start URLs, depth ≤ 2) then call runSyncGetDatasetItems (Apify Sync) with format=json by default.
  • If the user supplies a large list or asks for deeper crawls, use the async Apify flow: startRungetRun(waitForFinish=120000) (loop) → getDatasetItems.
  • On timeout or 4xx/5xx from Sync, fall back automatically to the async flow and tell the user you switched to a long-running job.

Example payload (Sync)

{
  "input": {
    "startUrls": [
      {"url": "https://examplechurch.org"},
      {"url": "https://anotherchurch.com"}
    ],
    "sameDomainOnly": true,
    "maxDepth": 2
  },
  "options": { "timeoutSecs": 300 }
}

Ask for CSV by adding ?format=csv&clean=true to the path.


Minimal test script (curl)

# JSON
curl -s -X POST \
  "https://api.apify.com/v2/acts/your-user~phrase-emphasis-crawler/run-sync-get-dataset-items?format=json&clean=true" \
  -H "Authorization: Bearer $APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"input":{"startUrls":[{"url":"https://examplechurch.org"}],"sameDomainOnly":true,"maxDepth":2}}'

# CSV
curl -s -X POST \
  "https://api.apify.com/v2/acts/your-user~phrase-emphasis-crawler/run-sync-get-dataset-items?format=csv&clean=true" \
  -H "Authorization: Bearer $APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"input":{"startUrls":[{"url":"https://examplechurch.com"}],"sameDomainOnly":true,"maxDepth":2}}'

Here are two clean ways to add a Result Summarizer so your Custom GPT reports the same way every time—no matter whether results came from Apify Sync, Apify Async, or Scrapy.


Option A — Use GPT itself (no new Action)

When to use: fastest path; nothing to host.
How: enable Code Interpreter (a.k.a. “Advanced Data Analysis”) for your Custom GPT. Then add this to your GPT’s Instructions:

Summarization rule (no Action):
When you receive dataset items (JSON array or CSV text) with fields like host, url, EI (and optionally exact_hits, near_hits, heading_flag, early_flag), do the following in the tool:

  1. Parse records.
  2. Compute, per host: pages scanned, max EI, average EI (two decimals).
  3. List top 5 pages by EI (host, EI, url).
  4. Output a compact table:Host Pages Top EI Avg EI example.org 12 5 3.41
  5. One-paragraph interpretation (concise).
  6. If user requests CSV, return CSV text assembled from the parsed items.

That’s it—no external call. The GPT will do the math inside the sandbox.


Option B — Add a tiny Summarizer Action (uniform, API-driven)

When to use: you want deterministic, audited summaries (same result every time), or you don’t want to enable Code Interpreter.

You’ll add:

  1. A Summarizer endpoint to your Scrapy Orchestrator (FastAPI).
  2. A Summarizer OpenAPI schema as a third Action in the Custom GPT.

1) FastAPI endpoint (drop-in to your existing app.py)

Add this to the bottom of your orchestrator (or keep it as a separate microservice). It accepts raw items (JSON array) and returns the standardized report.

from pydantic import BaseModel
from statistics import mean
from typing import List, Dict, Any

class SummarizeInput(BaseModel):
    items: List[Dict[str, Any]]

@app.post("/api/summarize")
def summarize(inp: SummarizeInput, Authorization: str | None = Header(None)):
    require_auth(Authorization)
    items = inp.items or []

    # Normalize fields
    norm = []
    for r in items:
        host = str(r.get("host") or r.get("domain") or "").strip()
        url = str(r.get("url") or "").strip()
        ei  = r.get("EI")
        try:
            ei = int(ei)
        except Exception:
            try:
                ei = round(float(ei))
            except Exception:
                ei = None
        if host and url and ei is not None:
            norm.append({"host": host, "url": url, "EI": int(ei)})

    # Group by host
    hosts: Dict[str, List[Dict[str, Any]]] = {}
    for row in norm:
        hosts.setdefault(row["host"], []).append(row)

    # Per-host rollups
    host_rows = []
    for h, rows in hosts.items():
        eis = [r["EI"] for r in rows if isinstance(r["EI"], (int, float))]
        pages = len(rows)
        top_ei = max(eis) if eis else 0
        avg_ei = round(mean(eis), 2) if eis else 0.0
        host_rows.append({"host": h, "pages": pages, "top_ei": top_ei, "avg_ei": avg_ei})

    # Sort outputs
    host_rows.sort(key=lambda x: (-x["avg_ei"], -x["top_ei"], x["host"]))
    top_pages = sorted(norm, key=lambda r: (-r["EI"], r["host"]))[:5]

    # Overall
    overall_avg = round(mean([r["EI"] for r in norm]), 2) if norm else 0.0

    return {
        "overview": {
            "total_items": len(norm),
            "hosts": len(host_rows),
            "overall_avg_EI": overall_avg
        },
        "hosts": host_rows,
        "top_pages": top_pages,
        "notes": "EI = Emphasis Index (0–5). host/pages/top_ei/avg_ei sorted by avg_ei desc."
    }

Auth note: it reuses your existing require_auth() so the same Bearer token protects it.


2) Summarizer OpenAPI schema (add as third Action)

In the GPT Builder: Configure → Actions → Add Action → Paste schema, then set the Authorization header to Bearer {YOUR_SCRAPY_ORCH_TOKEN} (the same token you used for the Orchestrator).

openapi: 3.1.0
info:
  title: Phrase Emphasis Summarizer
  version: "1.0.0"
  description: Summarize EI results into per-host rollups, top pages, and overall stats.
servers:
  - url: https://scrapy.example.com/api
components:
  securitySchemes:
    Bearer:
      type: apiKey
      in: header
      name: Authorization
      description: Use: Bearer {YOUR_SCRAPY_ORCH_TOKEN}
  schemas:
    SummarizeInput:
      type: object
      required: [items]
      properties:
        items:
          type: array
          items:
            type: object
            additionalProperties: true
    SummaryOutput:
      type: object
      properties:
        overview:
          type: object
          properties:
            total_items: { type: integer }
            hosts: { type: integer }
            overall_avg_EI: { type: number }
        hosts:
          type: array
          items:
            type: object
            properties:
              host: { type: string }
              pages: { type: integer }
              top_ei: { type: integer }
              avg_ei: { type: number }
        top_pages:
          type: array
          items:
            type: object
            properties:
              host: { type: string }
              url: { type: string }
              EI: { type: integer }
        notes: { type: string }
security:
  - Bearer: []
paths:
  /summarize:
    post:
      operationId: summarizeResults
      summary: Summarize EI dataset
      description: Provide raw items (JSON array). Returns host rollups, top pages, and overall stats.
      security: [{ Bearer: [] }]
      requestBody:
        required: true
        content:
          application/json:
            schema: { $ref: "#/components/schemas/SummarizeInput" }
      responses:
        "200":
          description: Summary
          content:
            application/json:
              schema: { $ref: "#/components/schemas/SummaryOutput" }

Tell the GPT when to call the Summarizer

Append this snippet to your GPT’s Instructions:

Summarizer usage:
After fetching dataset items (JSON) from Apify or Scrapy, call summarizeResults with the items. Render the returned hosts table and top_pages (max 5). Include overview.overall_avg_EI. If the user asks for CSV, fetch CSV from the source backend instead of the summarizer.


Minimal end-to-end flow (example conversation)

  • You: “Scan these sites for ‘fear of the lord’: https://examplechurch.org, https://anotherchurch.com — use Apify (sync).”
  • GPT (Actions):
    1. runSyncGetDatasetItems → gets JSON items
    2. summarizeResults → receives standardized summary
    3. Returns a neat table + one-paragraph interpretation
    4. Offers “Export CSV” button (calls Apify dataset endpoint with format=csv if requested)

- SolveForce -

🗂️ Quick Links

Home

Fiber Lookup Tool

Suppliers

Services

Technology

Quote Request

Contact

🌐 Solutions by Sector

Communications & Connectivity

Information Technology (IT)

Industry 4.0 & Automation

Cross-Industry Enabling Technologies

🛠️ Our Services

Managed IT Services

Cloud Services

Cybersecurity Solutions

Unified Communications (UCaaS)

Internet of Things (IoT)

🔍 Technology Solutions

Cloud Computing

AI & Machine Learning

Edge Computing

Blockchain

VR/AR Solutions

💼 Industries Served

Healthcare

Finance & Insurance

Manufacturing

Education

Retail & Consumer Goods

Energy & Utilities

🌍 Worldwide Coverage

North America

South America

Europe

Asia

Africa

Australia

Oceania

📚 Resources

Blog & Articles

Case Studies

Industry Reports

Whitepapers

FAQs

🤝 Partnerships & Affiliations

Industry Partners

Technology Partners

Affiliations

Awards & Certifications

📄 Legal & Privacy

Privacy Policy

Terms of Service

Cookie Policy

Accessibility

Site Map


📞 Contact SolveForce
Toll-Free: (888) 765-8301
Email: support@solveforce.com

Follow Us: LinkedIn | Twitter/X | Facebook | YouTube