Speech Recognition Node

Audio Processing Technical Deep Dive

The Speech Recognition Node performs automatic speech-to-text transcription on a provided audio file. This document details its technical specifications, including purpose, how it works, configuration schema, input/output data formats, and error handling.

Purpose and How the Node Works

The Speech Recognition Node converts audio input into textual form, enabling downstream tasks like text analysis, classification, or storage.

  • Input Resolution: Primarily takes a downloadUrl for the audio file. It can use $input (previous node output) or $secret (vault secrets) to construct this URL or inject authentication headers if needed.

    • Example Reference: https://example.com/audio/$input.filePath

  • Request/Processing:

    • The node fetches the audio file from the provided downloadUrl.

    • It sends the audio stream to an underlying speech recognition service.

    • It receives the transcribed text from the service.

  • Execution Model: Blocking – the node waits for the transcription to complete before the workflow proceeds.

  • Response Handling:

    • Success: Returns a JSON object containing the transcription.

    • Failure: Returns null in the transcription field and includes the appropriate statusCode and an error message.

Configuration Schema

The Speech Recognition Node has a straightforward configuration:

Field
Type
Required
Description

downloadUrl

string

The direct URL to the audio file for transcription.

name

string

Optional

Optional display name for this node instance.

description

string

Optional

Optional description to document this node's purpose.

Note: No authentication is inherently required for the node itself to function, but the downloadUrl might require authentication handled via $secret in headers.

Output Schema

The node's output port (speechRecognitionResult) will always conform to the following schema:

Field

Type

Always

Description

transcription

string

Contains the transcribed text.

statusCode

number

HTTP-style status code for the outcome.

error

string

null

No

statusCode Details:

  • 200: Success – Transcription generated successfully.

  • 400: Invalid or inaccessible URL – The provided downloadUrl could not be resolved or accessed.

  • 422: Unsupported file format – The file at downloadUrl is not a recognized audio format.

  • 500: Internal transcription error – A service-side failure occurred during transcription.

Examples

Success Example

Configuration:

JSON

{

"downloadUrl": "https://example.com/audio-file.wav",

"name": "Meeting Transcription",

"description": "Transcribes the weekly team meeting audio."

}

Output:

JSON

{

"transcription": "This is the transcript of the audio file."

"statusCode": 200,

"error": null

}

Failure Example (Invalid URL)

Configuration:

JSON

{

"downloadUrl": "https://example.com/invalid-audio.wav"

}

Output:

JSON

{

"transcription": null,

"statusCode": 400,

"error": "Invalid or inaccessible URL: 'https://example.com/invalid-audio.wav' could not be resolved."

}

Single-Node Test API

For testing a node in isolation (e.g., via the UI "Test" button or a dedicated API), the following endpoint is used:

  • Path: skill-runtime/workflows/nodes/SpeechRecognition/execute

  • Method: POST

  • Purpose: Execute one node in isolation.

  • Request Body:

  • JSON

{

"config": {

"downloadUrl": "https://example.com/audio.wav"

},

"input": {}

}

Error Handling

Error Scenario
statusCode
Notes

Invalid or inaccessible URL

400

downloadUrl doesn't resolve or is forbidden.

Unsupported file format

422

Non-audio file or unsupported audio codec.

Internal transcription error

500

Service-side failure.

Security Notes

  • Secured Audio Files: If the audio file requires signed or secured access (e.g., from a private S3 bucket), use $secret to inject necessary headers or tokens into the downloadUrl resolution process.

  • Sensitive URL Logging: Do not log or persist sensitive downloadUrl values directly in workflow configurations or logs.

  • Transcription Content Redaction: Implement logging policies to redact transcription contents if they are flagged as sensitive, to prevent PII leakage in logs.

  • External Service Integration: Be aware that the underlying speech recognition service may be an external third-party provider. Ensure compliance with data privacy regulations regarding data sent to such services.

Last updated