Web Crawler Node
Technical Specifications
The Web Crawler Node provides a mechanism to extract content from specified public web URLs programmatically. This document details its technical specifications, including purpose, how it works, configuration schema, input/output data formats, and error handling.
Purpose and How the Node Works
The Web Crawler Node is designed to fetch and parse the content of a given URL, returning it in both plain text and Markdown formats.
Input Resolution: Primarily takes a url and an optional instruction as direct configuration.
Request/Processing:
The node issues an HTTP GET request to the specified URL.
It then parses the HTML response to extract the main textual content.
This extracted content is then formatted into both plain text and Markdown.
Execution Model: Blocking – the node waits for the webpage to be fetched and its content extracted and parsed.
Response Handling:
Success: Returns extractedContent (plain text) and markdown.
Failure: Returns an error message and statusCode indicating the nature of the failure.
Configuration Schema
The Web Crawler Node has a straightforward configuration:
url
string
✅
The absolute URL of the webpage to crawl.
instruction
string
Optional
An instruction to guide the content extraction (e.g., "Extract only the main article text").
name
string
Optional
Optional display name for this node instance.
description
string
Optional
Longer description shown in the workflow designer.
Output Schema
The node's output port (crawledContent) will always conform to the following schema:
Field
Type
Always
Description
extractedContent
string
✅
The plain text content extracted from the webpage.
markdown
string
✅
The Markdown-formatted content extracted from the webpage.
statusCode
number
✅
HTTP-like status code indicating the result of the operation.
error
string
No
Error message if processing failed; null on success.
statusCode Details:
200: Successfully extracted data from the webpage.
400: Bad Request. Indicates a malformed url or missing/invalid instruction.
500: Internal Server Error. Indicates a failure to parse or extract data from the HTTP response.
Examples
Success Example
Configuration:
JSON
{
"url": "https://www.example.com/news/article-123",
"instruction": "Extract the main article text and headline."
}
Output:
JSON
{
"extractedContent": "Headline: Latest News. This is the main body of the article...",
"markdown": "# Latest News\n\nThis is the main body of the article...",
"statusCode": 200,
"error": null
}
Failure Example (Malformed URL)
Configuration:
JSON
{
"url": "invalid-url",
"instruction": "Extract content."
}
Output:
JSON
{
"extractedContent": null,
"markdown": null,
"statusCode": 400,
"error": "Malformed URL: 'invalid-url' is not a valid HTTP/HTTPS address."
}
Single-Node Test API
For testing a node in isolation (e.g., via the UI "Test" button or a dedicated API), the following endpoint is used:
Path: skill-runtime/workflows/nodes/WebCrawler/execute
Method: POST
Purpose: Execute one node in isolation.
Request Body:
JSON
{
"config": {
"url": "https://www.example.com",
"instruction": "Get all text"
},
"input": {}
}
Error Handling
Malformed URL
400
The provided URL is not a valid HTTP/HTTPS address.
Network error
500
The target URL could not be reached (DNS, connection refused, timeout).
HTTP error:
500
The target server returned an HTTP error (e.g., 404, 500).
Failed to parse or extract data from response
500
Internal error during content extraction from the webpage.
Security Notes
Public URLs Only: The Web Crawler Node is designed for public web content. Do not attempt to crawl private or authenticated resources unless explicitly designed to handle authentication (which is not a feature of this basic node).
No Sensitive Data in URL: Avoid placing sensitive information directly in the URL, as it may be logged.
robots.txt Compliance: While the node does not explicitly enforce robots.txt rules, it is the user's responsibility to ensure that crawling the specified URL complies with the target website's policies.
Logging: URLs and instructions may be logged for debugging and auditing purposes.
Last updated