Web Crawler Node

Technical Specifications

The Web Crawler Node provides a mechanism to extract content from specified public web URLs programmatically. This document details its technical specifications, including purpose, how it works, configuration schema, input/output data formats, and error handling.

Purpose and How the Node Works

The Web Crawler Node is designed to fetch and parse the content of a given URL, returning it in both plain text and Markdown formats.

  • Input Resolution: Primarily takes a url and an optional instruction as direct configuration.

  • Request/Processing:

    • The node issues an HTTP GET request to the specified URL.

    • It then parses the HTML response to extract the main textual content.

    • This extracted content is then formatted into both plain text and Markdown.

  • Execution Model: Blocking – the node waits for the webpage to be fetched and its content extracted and parsed.

  • Response Handling:

    • Success: Returns extractedContent (plain text) and markdown.

    • Failure: Returns an error message and statusCode indicating the nature of the failure.

Configuration Schema

The Web Crawler Node has a straightforward configuration:

Field
Type
Required
Description

url

string

The absolute URL of the webpage to crawl.

instruction

string

Optional

An instruction to guide the content extraction (e.g., "Extract only the main article text").

name

string

Optional

Optional display name for this node instance.

description

string

Optional

Longer description shown in the workflow designer.

Output Schema

The node's output port (crawledContent) will always conform to the following schema:

Field

Type

Always

Description

extractedContent

string

The plain text content extracted from the webpage.

markdown

string

The Markdown-formatted content extracted from the webpage.

statusCode

number

HTTP-like status code indicating the result of the operation.

error

string

No

Error message if processing failed; null on success.

statusCode Details:

  • 200: Successfully extracted data from the webpage.

  • 400: Bad Request. Indicates a malformed url or missing/invalid instruction.

  • 500: Internal Server Error. Indicates a failure to parse or extract data from the HTTP response.

Examples

Success Example

Configuration:

JSON

{

"url": "https://www.example.com/news/article-123",

"instruction": "Extract the main article text and headline."

}

Output:

JSON

{

"extractedContent": "Headline: Latest News. This is the main body of the article...",

"markdown": "# Latest News\n\nThis is the main body of the article...",

"statusCode": 200,

"error": null

}

Failure Example (Malformed URL)

Configuration:

JSON

{

"url": "invalid-url",

"instruction": "Extract content."

}

Output:

JSON

{

"extractedContent": null,

"markdown": null,

"statusCode": 400,

"error": "Malformed URL: 'invalid-url' is not a valid HTTP/HTTPS address."

}

Single-Node Test API

For testing a node in isolation (e.g., via the UI "Test" button or a dedicated API), the following endpoint is used:

  • Path: skill-runtime/workflows/nodes/WebCrawler/execute

  • Method: POST

  • Purpose: Execute one node in isolation.

  • Request Body:

  • JSON

{

"config": {

"url": "https://www.example.com",

"instruction": "Get all text"

},

"input": {}

}

Error Handling

Error Message
statusCode
Cause

Malformed URL

400

The provided URL is not a valid HTTP/HTTPS address.

Network error

500

The target URL could not be reached (DNS, connection refused, timeout).

HTTP error:

500

The target server returned an HTTP error (e.g., 404, 500).

Failed to parse or extract data from response

500

Internal error during content extraction from the webpage.

Security Notes

  • Public URLs Only: The Web Crawler Node is designed for public web content. Do not attempt to crawl private or authenticated resources unless explicitly designed to handle authentication (which is not a feature of this basic node).

  • No Sensitive Data in URL: Avoid placing sensitive information directly in the URL, as it may be logged.

  • robots.txt Compliance: While the node does not explicitly enforce robots.txt rules, it is the user's responsibility to ensure that crawling the specified URL complies with the target website's policies.

  • Logging: URLs and instructions may be logged for debugging and auditing purposes.

Last updated