AI-powered website data extraction with schema-validated output. Provide a JSON Schema, optional instructions, and get structured JSON back.
Install:
npm install @teichai/extractify
Minimal usage:
import { ExtractifyClient } from "@teichai/extractify";
const client = new ExtractifyClient({
model: "openai/gpt-oss-20b",
// apiKey: process.env.API_KEY, // optional if set in env
});
const result = await client.extractFromUrl({
url: "https://example.com",
schema: {
type: "object",
properties: {
title: { type: "string" },
price: { type: "number" },
},
required: ["title", "price"],
additionalProperties: false,
},
instructions: "Extract the product title and numeric price.",
// validate: false, // optional, defaults to true
});
console.log(JSON.stringify(result, null, 2));
Signature:
new ExtractifyClient(options: {
baseUrl?: string;
apiKey?: string;
model: string;
fetcher?: typeof fetch;
headers?: Record<string, string>;
flaresolverrUrl?: string;
})
Defaults:
apiKey falls back to process.env.API_KEYbaseUrl defaults to https://openrouter.ai/api/v1Option reference:
| Option | Type | Purpose |
|---|---|---|
baseUrl |
string |
OpenAI-compatible base URL |
apiKey |
string |
API key (or API_KEY) |
model |
string |
Default model name |
fetcher |
typeof fetch |
Custom fetch implementation |
headers |
Record<string, string> |
Extra headers for AI requests |
flaresolverrUrl |
string |
Use FlareSolverr for all requests |
Signature:
extractFromUrl(params: {
url: string;
schema: Record<string, unknown>;
instructions: string;
allowedStatusCodes?: number[];
model?: string;
temperature?: number;
validate?: boolean;
flaresolverrUrl?: string;
flaresolverrTimeoutMs?: number;
})
Behavior notes:
allowedStatusCodes overrides the default 2xx-only check.validate defaults to true and throws on schema mismatch.model overrides the client-level model per call.flaresolverrUrl at the call level overrides the client-wide value.flaresolverrTimeoutMs is passed to FlareSolverr as maxTimeout.Request flow:
Extractify validates the parsed XML against your JSON Schema by default. If the
response does not match, extractFromUrl throws an error.
Disable validation:
await client.extractFromUrl({
url,
schema,
instructions,
validate: false,
});
Number handling rules:
number or integer, the model must return digits only."$19.99" or "19.99USD" will fail validation."19.99"), not mixed text.Array output rules:
<item> elements inside the field tag.<result>
<products>
<item><title>Example</title><price>10</price></item>
<item><title>Example 2</title><price>20</price></item>
</products>
</result>
Use FlareSolverr for pages with bot protection.
Client-wide:
const client = new ExtractifyClient({
model: "openai/gpt-oss-20b",
flaresolverrUrl: "http://localhost:8191",
});
Per request:
await client.extractFromUrl({
url: "https://example.com",
schema,
instructions,
flaresolverrUrl: "http://localhost:8191",
flaresolverrTimeoutMs: 90000,
});
Run an example:
npm run build
node examples/basic.mjs
Example files:
examples/basic.mjsexamples/allowed-status.mjsexamples/flaresolverr_huggingface_models.mjsexamples/flaresolverr_wikipedia_article.mjsexamples/flaresolverr_wikipedia_homepage.mjs