extractify

Extractify Docs

AI-powered website data extraction with schema-validated output. Provide a JSON Schema, optional instructions, and get structured JSON back.

Contents

Quick start

Install:

npm install @teichai/extractify

Minimal usage:

import { ExtractifyClient } from "@teichai/extractify";

const client = new ExtractifyClient({
  model: "openai/gpt-oss-20b",
  // apiKey: process.env.API_KEY, // optional if set in env
});

const result = await client.extractFromUrl({
  url: "https://example.com",
  schema: {
    type: "object",
    properties: {
      title: { type: "string" },
      price: { type: "number" },
    },
    required: ["title", "price"],
    additionalProperties: false,
  },
  instructions: "Extract the product title and numeric price.",
  // validate: false, // optional, defaults to true
});

console.log(JSON.stringify(result, null, 2));

Client options

Signature:

new ExtractifyClient(options: {
  baseUrl?: string;
  apiKey?: string;
  model: string;
  fetcher?: typeof fetch;
  headers?: Record<string, string>;
  flaresolverrUrl?: string;
})

Defaults:

Option reference:

Option Type Purpose
baseUrl string OpenAI-compatible base URL
apiKey string API key (or API_KEY)
model string Default model name
fetcher typeof fetch Custom fetch implementation
headers Record<string, string> Extra headers for AI requests
flaresolverrUrl string Use FlareSolverr for all requests

Extraction options

Signature:

extractFromUrl(params: {
  url: string;
  schema: Record<string, unknown>;
  instructions: string;
  allowedStatusCodes?: number[];
  model?: string;
  temperature?: number;
  validate?: boolean;
  flaresolverrUrl?: string;
  flaresolverrTimeoutMs?: number;
})

Behavior notes:

Request flow:

  1. Fetch the page HTML (directly or via FlareSolverr).
  2. Convert HTML to markdown (scripts/styles removed).
  3. Send a system prompt with your schema and a user prompt with instructions + markdown.
  4. Parse the XML response into JSON, normalize to the schema shape, then validate.

Schema validation

Extractify validates the parsed XML against your JSON Schema by default. If the response does not match, extractFromUrl throws an error.

Disable validation:

await client.extractFromUrl({
  url,
  schema,
  instructions,
  validate: false,
});

Number handling rules:

Array output rules:

<result>
  <products>
    <item><title>Example</title><price>10</price></item>
    <item><title>Example 2</title><price>20</price></item>
  </products>
</result>

FlareSolverr

Use FlareSolverr for pages with bot protection.

Client-wide:

const client = new ExtractifyClient({
  model: "openai/gpt-oss-20b",
  flaresolverrUrl: "http://localhost:8191",
});

Per request:

await client.extractFromUrl({
  url: "https://example.com",
  schema,
  instructions,
  flaresolverrUrl: "http://localhost:8191",
  flaresolverrTimeoutMs: 90000,
});

Examples

Run an example:

npm run build
node examples/basic.mjs

Example files: