Skip to content

API Reference


onnx_asr

A lightweight Python package for Automatic Speech Recognition using ONNX models.

Modules:

Name Description
adapters

ASR adapter classes.

asr

Base ASR classes.

cli

CLI for speech recognition from WAV files.

loader

Loader for ASR models.

models

ASR and VAD model implementations.

onnx

Helpers for ONNX.

preprocessors

ASR preprocessor implementations.

utils

Utils for ASR.

vad

Base VAD classes.

Functions:

Name Description
load_model

Load ASR model.

load_vad

Load VAD model.

load_model

load_model(model: str | ModelNames | ModelTypes, path: str | Path | None = None, *, quantization: str | None = None, sess_options: SessionOptions | None = None, providers: Sequence[str | tuple[str, dict[Any, Any]]] | None = None, provider_options: Sequence[dict[Any, Any]] | None = None, cpu_preprocessing: bool | None = None, asr_config: OnnxSessionOptions | None = None, preprocessor_config: PreprocessorRuntimeConfig | None = None, resampler_config: OnnxSessionOptions | None = None) -> TextResultsAsrAdapter

Load ASR model.

Parameters:

Name Type Description Default

model

str | ModelNames | ModelTypes

Model name or type (download from Hugging Face supported if full model name is provided):

GigaAM v2 (`gigaam-v2-ctc` | `gigaam-v2-rnnt`)
GigaAM v3 (`gigaam-v3-ctc` | `gigaam-v3-rnnt` |
           `gigaam-v3-e2e-ctc` | `gigaam-v3-e2e-rnnt`)
Kaldi Transducer (`kaldi-rnnt`)
NeMo Conformer (`nemo-conformer-ctc` | `nemo-conformer-rnnt` | `nemo-conformer-tdt` |
                `nemo-conformer-aed`)
NeMo FastConformer Hybrid Large Ru P&C (`nemo-fastconformer-ru-ctc` |
                                        `nemo-fastconformer-ru-rnnt`)
NeMo Parakeet 0.6B En (`nemo-parakeet-ctc-0.6b` | `nemo-parakeet-rnnt-0.6b` |
                       `nemo-parakeet-tdt-0.6b-v2`)
NeMo Parakeet 0.6B Multilingual (`nemo-parakeet-tdt-0.6b-v3`)
NeMo Canary (`nemo-canary-1b-v2`)
T-One (`t-one-ctc` | `t-tech/t-one`)
Vosk (`vosk` | `alphacep/vosk-model-ru` | `alphacep/vosk-model-small-ru`)
Whisper Base exported with onnxruntime (`whisper-ort` | `whisper-base-ort`)
Whisper from onnx-community (`whisper` | `onnx-community/whisper-large-v3-turbo` |
                             `onnx-community/*whisper*`)
required

path

str | Path | None

Path to directory with model files.

None

quantization

str | None

Model quantization (None | int8 | ... ).

None

sess_options

SessionOptions | None

Default SessionOptions for onnxruntime.

None

providers

Sequence[str | tuple[str, dict[Any, Any]]] | None

Default providers for onnxruntime.

None

provider_options

Sequence[dict[Any, Any]] | None

Default provider_options for onnxruntime.

None

cpu_preprocessing

bool | None

Deprecated and ignored, use preprocessor_config and resampler_config instead.

None

asr_config

OnnxSessionOptions | None

ASR ONNX config.

None

preprocessor_config

PreprocessorRuntimeConfig | None

Preprocessor ONNX and concurrency config.

None

resampler_config

OnnxSessionOptions | None

Resampler ONNX config.

None

Returns:

Type Description
TextResultsAsrAdapter

ASR model class.

Raises:

Type Description
ModelLoadingError

Model loading error (onnx-asr specific).

Source code in src/onnx_asr/loader.py
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
def load_model(
    model: str | ModelNames | ModelTypes,
    path: str | Path | None = None,
    *,
    quantization: str | None = None,
    sess_options: rt.SessionOptions | None = None,
    providers: Sequence[str | tuple[str, dict[Any, Any]]] | None = None,
    provider_options: Sequence[dict[Any, Any]] | None = None,
    cpu_preprocessing: bool | None = None,
    asr_config: OnnxSessionOptions | None = None,
    preprocessor_config: PreprocessorRuntimeConfig | None = None,
    resampler_config: OnnxSessionOptions | None = None,
) -> TextResultsAsrAdapter:
    """Load ASR model.

    Args:
        model: Model name or type (download from Hugging Face supported if full model name is provided):

                GigaAM v2 (`gigaam-v2-ctc` | `gigaam-v2-rnnt`)
                GigaAM v3 (`gigaam-v3-ctc` | `gigaam-v3-rnnt` |
                           `gigaam-v3-e2e-ctc` | `gigaam-v3-e2e-rnnt`)
                Kaldi Transducer (`kaldi-rnnt`)
                NeMo Conformer (`nemo-conformer-ctc` | `nemo-conformer-rnnt` | `nemo-conformer-tdt` |
                                `nemo-conformer-aed`)
                NeMo FastConformer Hybrid Large Ru P&C (`nemo-fastconformer-ru-ctc` |
                                                        `nemo-fastconformer-ru-rnnt`)
                NeMo Parakeet 0.6B En (`nemo-parakeet-ctc-0.6b` | `nemo-parakeet-rnnt-0.6b` |
                                       `nemo-parakeet-tdt-0.6b-v2`)
                NeMo Parakeet 0.6B Multilingual (`nemo-parakeet-tdt-0.6b-v3`)
                NeMo Canary (`nemo-canary-1b-v2`)
                T-One (`t-one-ctc` | `t-tech/t-one`)
                Vosk (`vosk` | `alphacep/vosk-model-ru` | `alphacep/vosk-model-small-ru`)
                Whisper Base exported with onnxruntime (`whisper-ort` | `whisper-base-ort`)
                Whisper from onnx-community (`whisper` | `onnx-community/whisper-large-v3-turbo` |
                                             `onnx-community/*whisper*`)
        path: Path to directory with model files.
        quantization: Model quantization (`None` | `int8` | ... ).
        sess_options: Default SessionOptions for onnxruntime.
        providers: Default providers for onnxruntime.
        provider_options: Default provider_options for onnxruntime.
        cpu_preprocessing: Deprecated and ignored, use `preprocessor_config` and `resampler_config` instead.
        asr_config: ASR ONNX config.
        preprocessor_config: Preprocessor ONNX and concurrency config.
        resampler_config: Resampler ONNX config.

    Returns:
        ASR model class.

    Raises:
        utils.ModelLoadingError: Model loading error (onnx-asr specific).

    """
    if cpu_preprocessing is not None:
        warnings.warn(
            "The cpu_preprocessing argument is deprecated and ignored (use preprocessor_config and resampler_config).",
            stacklevel=2,
        )

    loader = AsrLoader(model, path)

    default_onnx_config: OnnxSessionOptions = {
        "sess_options": sess_options,
        "providers": providers or rt.get_available_providers(),
        "provider_options": provider_options,
    }

    if asr_config is None:
        asr_config = update_onnx_providers(default_onnx_config, excluded_providers=loader.get_excluded_providers())

    if preprocessor_config is None:
        preprocessor_config = {
            **update_onnx_providers(
                default_onnx_config,
                new_options={"TensorrtExecutionProvider": {"trt_fp16_enable": False, "trt_int8_enable": False}},
                excluded_providers=OnnxPreprocessor._get_excluded_providers(),
            ),
            "max_concurrent_workers": 1,
        }

    if resampler_config is None:
        resampler_config = update_onnx_providers(
            default_onnx_config, excluded_providers=Resampler._get_excluded_providers()
        )

    return loader.create_model(asr_config, preprocessor_config, resampler_config, quantization=quantization)

load_vad

load_vad(model: VadNames = 'silero', path: str | Path | None = None, *, quantization: str | None = None, sess_options: SessionOptions | None = None, providers: Sequence[str | tuple[str, dict[Any, Any]]] | None = None, provider_options: Sequence[dict[Any, Any]] | None = None) -> Vad

Load VAD model.

Parameters:

Name Type Description Default

model

VadNames

VAD model name (supports download from Hugging Face).

'silero'

path

str | Path | None

Path to directory with model files.

None

quantization

str | None

Model quantization (None | int8 | ... ).

None

sess_options

SessionOptions | None

Optional SessionOptions for onnxruntime.

None

providers

Sequence[str | tuple[str, dict[Any, Any]]] | None

Optional providers for onnxruntime.

None

provider_options

Sequence[dict[Any, Any]] | None

Optional provider_options for onnxruntime.

None

Returns:

Type Description
Vad

VAD model class.

Raises:

Type Description
ModelLoadingError

Model loading error (onnx-asr specific).

Source code in src/onnx_asr/loader.py
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
def load_vad(
    model: VadNames = "silero",
    path: str | Path | None = None,
    *,
    quantization: str | None = None,
    sess_options: rt.SessionOptions | None = None,
    providers: Sequence[str | tuple[str, dict[Any, Any]]] | None = None,
    provider_options: Sequence[dict[Any, Any]] | None = None,
) -> Vad:
    """Load VAD model.

    Args:
        model: VAD model name (supports download from Hugging Face).
        path: Path to directory with model files.
        quantization: Model quantization (`None` | `int8` | ... ).
        sess_options: Optional SessionOptions for onnxruntime.
        providers: Optional providers for onnxruntime.
        provider_options: Optional provider_options for onnxruntime.

    Returns:
        VAD model class.

    Raises:
        utils.ModelLoadingError: Model loading error (onnx-asr specific).

    """
    loader = VadLoader(model, path)

    onnx_options = update_onnx_providers(
        {"providers": rt.get_available_providers()}, excluded_providers=loader.get_excluded_providers()
    ) | {
        "sess_options": sess_options,
        "providers": providers,
        "provider_options": provider_options,
    }

    return loader.create_model(onnx_options, quantization=quantization)

ModelNames module-attribute

ModelNames = Literal['gigaam-v2-ctc', 'gigaam-v2-rnnt', 'gigaam-v3-ctc', 'gigaam-v3-rnnt', 'gigaam-v3-e2e-ctc', 'gigaam-v3-e2e-rnnt', 'nemo-fastconformer-ru-ctc', 'nemo-fastconformer-ru-rnnt', 'nemo-parakeet-ctc-0.6b', 'nemo-parakeet-rnnt-0.6b', 'nemo-parakeet-tdt-0.6b-v2', 'nemo-parakeet-tdt-0.6b-v3', 'nemo-canary-1b-v2', 'alphacep/vosk-model-ru', 'alphacep/vosk-model-small-ru', 't-tech/t-one', 'whisper-base']

Supported ASR model names (can be automatically downloaded from the Hugging Face).

ModelTypes module-attribute

ModelTypes = Literal['kaldi-rnnt', 'nemo-conformer-ctc', 'nemo-conformer-rnnt', 'nemo-conformer-tdt', 'nemo-conformer-aed', 't-one-ctc', 'vosk', 'whisper-ort', 'whisper']

Supported ASR model types.

VadNames module-attribute

VadNames = Literal['silero']

Supported VAD model names (can be automatically downloaded from the Hugging Face).

OnnxSessionOptions typed-dict

OnnxSessionOptions(*, sess_options: SessionOptions | None = ..., providers: Sequence[str | tuple[str, dict[Any, Any]]] | None = ..., provider_options: Sequence[dict[Any, Any]] | None = ...)

Bases: TypedDict

Options for onnxruntime InferenceSession.

Parameters:

Name Type Description Default

sess_options

SessionOptions | None

ONNX Session options.

...

providers

Sequence[str | tuple[str, dict[Any, Any]]] | None

ONNX providers.

...

provider_options

Sequence[dict[Any, Any]] | None

ONNX provider options.

...

PreprocessorRuntimeConfig

PreprocessorRuntimeConfig(*, sess_options: SessionOptions | None = ..., providers: Sequence[str | tuple[str, dict[Any, Any]]] | None = ..., provider_options: Sequence[dict[Any, Any]] | None = ...)

Bases: OnnxSessionOptions

Preprocessor runtime config.

Parameters:

Name Type Description Default

sess_options

SessionOptions | None

ONNX Session options.

...

providers

Sequence[str | tuple[str, dict[Any, Any]]] | None

ONNX providers.

...

provider_options

Sequence[dict[Any, Any]] | None

ONNX provider options.

...

Attributes:

Name Type Description
max_concurrent_workers int | None

Max parallel preprocessing threads (None - auto, 1 - without parallel processing).

max_concurrent_workers instance-attribute

max_concurrent_workers: int | None

Max parallel preprocessing threads (None - auto, 1 - without parallel processing).

TensorRtOptions

Options for onnxruntime TensorRT providers.

Methods:

Name Description
add_profile

Add TensorRT profile options.

get_provider_names

Get TensorRT provider names.

is_fp16_enabled

Check if TensorRT provider use fp16 precision.

Attributes:

Name Type Description
profile_max_shapes dict[str, int]

Maximal value for model input shapes.

profile_min_shapes dict[str, int]

Minimal value for model input shapes.

profile_opt_shapes dict[str, int]

Optimal value for model input shapes.

profile_max_shapes class-attribute

profile_max_shapes: dict[str, int] = {'batch': 16, 'waveform_len_ms': 30000}

Maximal value for model input shapes.

profile_min_shapes class-attribute

profile_min_shapes: dict[str, int] = {'batch': 1, 'waveform_len_ms': 50}

Minimal value for model input shapes.

profile_opt_shapes class-attribute

profile_opt_shapes: dict[str, int] = {'batch': 1, 'waveform_len_ms': 20000}

Optimal value for model input shapes.

add_profile classmethod

add_profile(onnx_options: OnnxSessionOptions, transform_shapes: Callable[..., str]) -> OnnxSessionOptions

Add TensorRT profile options.

Source code in src/onnx_asr/onnx.py
38
39
40
41
42
43
44
45
46
47
@classmethod
def add_profile(cls, onnx_options: OnnxSessionOptions, transform_shapes: Callable[..., str]) -> OnnxSessionOptions:
    """Add TensorRT profile options."""
    return update_onnx_providers(
        onnx_options,
        default_options={
            "TensorrtExecutionProvider": cls._generate_profile("trt_profile", transform_shapes),
            "NvTensorRtRtxExecutionProvider": cls._generate_profile("nv_profile", transform_shapes),
        },
    )

get_provider_names staticmethod

get_provider_names() -> list[str]

Get TensorRT provider names.

Source code in src/onnx_asr/onnx.py
49
50
51
52
@staticmethod
def get_provider_names() -> list[str]:
    """Get TensorRT provider names."""
    return ["TensorrtExecutionProvider", "NvTensorRtRtxExecutionProvider"]

is_fp16_enabled staticmethod

is_fp16_enabled(onnx_options: OnnxSessionOptions) -> bool

Check if TensorRT provider use fp16 precision.

Source code in src/onnx_asr/onnx.py
54
55
56
57
58
59
60
61
@staticmethod
def is_fp16_enabled(onnx_options: OnnxSessionOptions) -> bool:
    """Check if TensorRT provider use fp16 precision."""
    return bool(
        _merge_onnx_provider_options(onnx_options)
        .get("TensorrtExecutionProvider", {})
        .get("trt_fp16_enable", False)
    )

ModelLoadingError

Bases: Exception

Model loading error.


adapters

ASR adapter classes.

Classes:

Name Description
AsrAdapter

Base ASR adapter class.

RecognizeOptions

Options for ASR recognition.

SegmentResultsAsrAdapter

ASR with VAD adapter (text results).

TextResultsAsrAdapter

ASR adapter (text results).

TimestampedResultsAsrAdapter

ASR adapter (timestamped results).

TimestampedSegmentResultsAsrAdapter

ASR with VAD adapter (timestamped results).

VadOptions

Options for VAD.

AsrAdapter

AsrAdapter(asr: Asr, resampler: Resampler)

Bases: ABC, Generic[R]

Base ASR adapter class.

Create ASR adapter.

Methods:

Name Description
recognize

Recognize speech (single or batch).

with_vad

Create ASR adapter with VAD.

Source code in src/onnx_asr/adapters.py
63
64
65
66
def __init__(self, asr: Asr, resampler: Resampler):
    """Create ASR adapter."""
    self.asr = asr
    self.resampler = resampler

recognize

recognize(waveform: str | Path | NDArray[float32], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> R
recognize(waveform: list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> list[R]
recognize(waveform: str | Path | NDArray[float32] | list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, language: str | None = ..., target_language: str | None = ..., pnc: Literal['pnc', 'nopnc'] | bool = ...) -> R | list[R]

Recognize speech (single or batch).

Parameters:

Name Type Description Default
waveform
str | Path | NDArray[float32] | list[str | Path | NDArray[float32]]

Path to wav file (only PCM_U8, PCM_16, PCM_24 and PCM_32 formats are supported) or Numpy array with PCM waveform. A list of file paths or numpy arrays for batch recognition are also supported.

required
sample_rate
SampleRates

Sample rate for Numpy arrays in waveform.

16000
language
str | None

Speech language (only for Whisper and Canary models).

...
target_language
str | None

Output language (only for Canary models).

...
pnc
Literal['pnc', 'nopnc'] | bool

Output punctuation and capitalization (only for Canary models).

...

Returns:

Type Description
R | list[R]

Speech recognition results (single or list for batch recognition).

Raises:

Type Description
AudioLoadingError

Audio loading error (onnx-asr specific).

FileNotFoundError

File not found error.

Error

WAV file reading error.

OSError

Other IO errors.

Source code in src/onnx_asr/adapters.py
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def recognize(
    self,
    waveform: str | Path | npt.NDArray[np.float32] | list[str | Path | npt.NDArray[np.float32]],
    *,
    sample_rate: SampleRates = 16_000,
    **kwargs: Unpack[RecognizeOptions],
) -> R | list[R]:
    """Recognize speech (single or batch).

    Args:
        waveform: Path to wav file (only PCM_U8, PCM_16, PCM_24 and PCM_32 formats are supported)
                  or Numpy array with PCM waveform.
                  A list of file paths or numpy arrays for batch recognition are also supported.
        sample_rate: Sample rate for Numpy arrays in waveform.
        **kwargs: ASR options.

    Returns:
        Speech recognition results (single or list for batch recognition).

    Raises:
        utils.AudioLoadingError: Audio loading error (onnx-asr specific).
        FileNotFoundError: File not found error.
        wave.Error: WAV file reading error.
        OSError: Other IO errors.

    """
    if isinstance(waveform, list) and not waveform:
        return []

    waveform_batch = waveform if isinstance(waveform, list) else [waveform]
    result = self._recognize_batch(*self.resampler(*read_wav_files(waveform_batch, sample_rate)), **kwargs)

    if isinstance(waveform, list):
        return list(result)
    return next(result)

with_vad

Create ASR adapter with VAD.

Parameters:

Name Type Description Default
vad
Vad

VAD model.

required
batch_size
int

Number of parallel processed segments.

...
threshold
float

Speech detection threshold.

...
neg_threshold
float

Non-speech detection threshold.

...
min_speech_duration_ms
float

Minimum speech segment duration in milliseconds.

...
max_speech_duration_s
float

Maximum speech segment duration in seconds.

...
min_silence_duration_ms
float

Minimum silence duration in milliseconds to split speech segments.

...
speech_pad_ms
float

Padding for speech segments in milliseconds.

...

Returns:

Type Description
SegmentResultsAsrAdapter

ASR with VAD adapter (text results).

Source code in src/onnx_asr/adapters.py
68
69
70
71
72
73
74
75
76
77
78
79
def with_vad(self, vad: Vad, **kwargs: Unpack[VadOptions]) -> SegmentResultsAsrAdapter:
    """Create ASR adapter with VAD.

    Args:
        vad: VAD model.
        **kwargs: VAD options.

    Returns:
        ASR with VAD adapter (text results).

    """
    return SegmentResultsAsrAdapter(self.asr, vad, self.resampler, **kwargs)

RecognizeOptions typed-dict

RecognizeOptions(*, language: str | None = ..., target_language: str | None = ..., pnc: Literal['pnc', 'nopnc'] | bool = ...)

Bases: TypedDict

Options for ASR recognition.

Parameters:

Name Type Description Default

language

str | None

Speech language (only for Whisper and Canary models).

...

target_language

str | None

Output language (only for Canary models).

...

pnc

Literal['pnc', 'nopnc'] | bool

Output punctuation and capitalization (only for Canary models).

...

SegmentResultsAsrAdapter

SegmentResultsAsrAdapter(asr: Asr, vad: Vad, resampler: Resampler, *, batch_size: int = ..., threshold: float = ..., neg_threshold: float = ..., min_speech_duration_ms: float = ..., max_speech_duration_s: float = ..., min_silence_duration_ms: float = ..., speech_pad_ms: float = ...)

Bases: AsrAdapter[Iterator[SegmentResult]]

ASR with VAD adapter (text results).

Create ASR adapter.

Parameters:

Name Type Description Default

batch_size

int

Number of parallel processed segments.

...

threshold

float

Speech detection threshold.

...

neg_threshold

float

Non-speech detection threshold.

...

min_speech_duration_ms

float

Minimum speech segment duration in milliseconds.

...

max_speech_duration_s

float

Maximum speech segment duration in seconds.

...

min_silence_duration_ms

float

Minimum silence duration in milliseconds to split speech segments.

...

speech_pad_ms

float

Padding for speech segments in milliseconds.

...

Methods:

Name Description
recognize

Recognize speech (single or batch).

with_timestamps

ASR with VAD adapter (timestamped results).

with_vad

Create ASR adapter with VAD.

Source code in src/onnx_asr/adapters.py
208
209
210
211
212
def __init__(self, asr: Asr, vad: Vad, resampler: Resampler, **kwargs: Unpack[VadOptions]):
    """Create ASR adapter."""
    super().__init__(asr, resampler)
    self.vad = vad
    self._vadargs = kwargs

recognize

recognize(waveform: str | Path | NDArray[float32], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> R
recognize(waveform: list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> list[R]
recognize(waveform: str | Path | NDArray[float32] | list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, language: str | None = ..., target_language: str | None = ..., pnc: Literal['pnc', 'nopnc'] | bool = ...) -> R | list[R]

Recognize speech (single or batch).

Parameters:

Name Type Description Default
waveform
str | Path | NDArray[float32] | list[str | Path | NDArray[float32]]

Path to wav file (only PCM_U8, PCM_16, PCM_24 and PCM_32 formats are supported) or Numpy array with PCM waveform. A list of file paths or numpy arrays for batch recognition are also supported.

required
sample_rate
SampleRates

Sample rate for Numpy arrays in waveform.

16000
language
str | None

Speech language (only for Whisper and Canary models).

...
target_language
str | None

Output language (only for Canary models).

...
pnc
Literal['pnc', 'nopnc'] | bool

Output punctuation and capitalization (only for Canary models).

...

Returns:

Type Description
R | list[R]

Speech recognition results (single or list for batch recognition).

Raises:

Type Description
AudioLoadingError

Audio loading error (onnx-asr specific).

FileNotFoundError

File not found error.

Error

WAV file reading error.

OSError

Other IO errors.

Source code in src/onnx_asr/adapters.py
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def recognize(
    self,
    waveform: str | Path | npt.NDArray[np.float32] | list[str | Path | npt.NDArray[np.float32]],
    *,
    sample_rate: SampleRates = 16_000,
    **kwargs: Unpack[RecognizeOptions],
) -> R | list[R]:
    """Recognize speech (single or batch).

    Args:
        waveform: Path to wav file (only PCM_U8, PCM_16, PCM_24 and PCM_32 formats are supported)
                  or Numpy array with PCM waveform.
                  A list of file paths or numpy arrays for batch recognition are also supported.
        sample_rate: Sample rate for Numpy arrays in waveform.
        **kwargs: ASR options.

    Returns:
        Speech recognition results (single or list for batch recognition).

    Raises:
        utils.AudioLoadingError: Audio loading error (onnx-asr specific).
        FileNotFoundError: File not found error.
        wave.Error: WAV file reading error.
        OSError: Other IO errors.

    """
    if isinstance(waveform, list) and not waveform:
        return []

    waveform_batch = waveform if isinstance(waveform, list) else [waveform]
    result = self._recognize_batch(*self.resampler(*read_wav_files(waveform_batch, sample_rate)), **kwargs)

    if isinstance(waveform, list):
        return list(result)
    return next(result)

with_timestamps

ASR with VAD adapter (timestamped results).

Source code in src/onnx_asr/adapters.py
214
215
216
def with_timestamps(self) -> TimestampedSegmentResultsAsrAdapter:
    """ASR with VAD adapter (timestamped results)."""
    return TimestampedSegmentResultsAsrAdapter(self.asr, self.vad, self.resampler, **self._vadargs)

with_vad

Create ASR adapter with VAD.

Parameters:

Name Type Description Default
vad
Vad

VAD model.

required
batch_size
int

Number of parallel processed segments.

...
threshold
float

Speech detection threshold.

...
neg_threshold
float

Non-speech detection threshold.

...
min_speech_duration_ms
float

Minimum speech segment duration in milliseconds.

...
max_speech_duration_s
float

Maximum speech segment duration in seconds.

...
min_silence_duration_ms
float

Minimum silence duration in milliseconds to split speech segments.

...
speech_pad_ms
float

Padding for speech segments in milliseconds.

...

Returns:

Type Description
SegmentResultsAsrAdapter

ASR with VAD adapter (text results).

Source code in src/onnx_asr/adapters.py
68
69
70
71
72
73
74
75
76
77
78
79
def with_vad(self, vad: Vad, **kwargs: Unpack[VadOptions]) -> SegmentResultsAsrAdapter:
    """Create ASR adapter with VAD.

    Args:
        vad: VAD model.
        **kwargs: VAD options.

    Returns:
        ASR with VAD adapter (text results).

    """
    return SegmentResultsAsrAdapter(self.asr, vad, self.resampler, **kwargs)

TextResultsAsrAdapter

TextResultsAsrAdapter(asr: Asr, resampler: Resampler)

Bases: AsrAdapter[str]

ASR adapter (text results).

Create ASR adapter.

Methods:

Name Description
recognize

Recognize speech (single or batch).

with_timestamps

ASR adapter (timestamped results).

with_vad

Create ASR adapter with VAD.

Source code in src/onnx_asr/adapters.py
63
64
65
66
def __init__(self, asr: Asr, resampler: Resampler):
    """Create ASR adapter."""
    self.asr = asr
    self.resampler = resampler

recognize

recognize(waveform: str | Path | NDArray[float32], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> R
recognize(waveform: list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> list[R]
recognize(waveform: str | Path | NDArray[float32] | list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, language: str | None = ..., target_language: str | None = ..., pnc: Literal['pnc', 'nopnc'] | bool = ...) -> R | list[R]

Recognize speech (single or batch).

Parameters:

Name Type Description Default
waveform
str | Path | NDArray[float32] | list[str | Path | NDArray[float32]]

Path to wav file (only PCM_U8, PCM_16, PCM_24 and PCM_32 formats are supported) or Numpy array with PCM waveform. A list of file paths or numpy arrays for batch recognition are also supported.

required
sample_rate
SampleRates

Sample rate for Numpy arrays in waveform.

16000
language
str | None

Speech language (only for Whisper and Canary models).

...
target_language
str | None

Output language (only for Canary models).

...
pnc
Literal['pnc', 'nopnc'] | bool

Output punctuation and capitalization (only for Canary models).

...

Returns:

Type Description
R | list[R]

Speech recognition results (single or list for batch recognition).

Raises:

Type Description
AudioLoadingError

Audio loading error (onnx-asr specific).

FileNotFoundError

File not found error.

Error

WAV file reading error.

OSError

Other IO errors.

Source code in src/onnx_asr/adapters.py
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def recognize(
    self,
    waveform: str | Path | npt.NDArray[np.float32] | list[str | Path | npt.NDArray[np.float32]],
    *,
    sample_rate: SampleRates = 16_000,
    **kwargs: Unpack[RecognizeOptions],
) -> R | list[R]:
    """Recognize speech (single or batch).

    Args:
        waveform: Path to wav file (only PCM_U8, PCM_16, PCM_24 and PCM_32 formats are supported)
                  or Numpy array with PCM waveform.
                  A list of file paths or numpy arrays for batch recognition are also supported.
        sample_rate: Sample rate for Numpy arrays in waveform.
        **kwargs: ASR options.

    Returns:
        Speech recognition results (single or list for batch recognition).

    Raises:
        utils.AudioLoadingError: Audio loading error (onnx-asr specific).
        FileNotFoundError: File not found error.
        wave.Error: WAV file reading error.
        OSError: Other IO errors.

    """
    if isinstance(waveform, list) and not waveform:
        return []

    waveform_batch = waveform if isinstance(waveform, list) else [waveform]
    result = self._recognize_batch(*self.resampler(*read_wav_files(waveform_batch, sample_rate)), **kwargs)

    if isinstance(waveform, list):
        return list(result)
    return next(result)

with_timestamps

with_timestamps() -> TimestampedResultsAsrAdapter

ASR adapter (timestamped results).

Source code in src/onnx_asr/adapters.py
161
162
163
def with_timestamps(self) -> TimestampedResultsAsrAdapter:
    """ASR adapter (timestamped results)."""
    return TimestampedResultsAsrAdapter(self.asr, self.resampler)

with_vad

Create ASR adapter with VAD.

Parameters:

Name Type Description Default
vad
Vad

VAD model.

required
batch_size
int

Number of parallel processed segments.

...
threshold
float

Speech detection threshold.

...
neg_threshold
float

Non-speech detection threshold.

...
min_speech_duration_ms
float

Minimum speech segment duration in milliseconds.

...
max_speech_duration_s
float

Maximum speech segment duration in seconds.

...
min_silence_duration_ms
float

Minimum silence duration in milliseconds to split speech segments.

...
speech_pad_ms
float

Padding for speech segments in milliseconds.

...

Returns:

Type Description
SegmentResultsAsrAdapter

ASR with VAD adapter (text results).

Source code in src/onnx_asr/adapters.py
68
69
70
71
72
73
74
75
76
77
78
79
def with_vad(self, vad: Vad, **kwargs: Unpack[VadOptions]) -> SegmentResultsAsrAdapter:
    """Create ASR adapter with VAD.

    Args:
        vad: VAD model.
        **kwargs: VAD options.

    Returns:
        ASR with VAD adapter (text results).

    """
    return SegmentResultsAsrAdapter(self.asr, vad, self.resampler, **kwargs)

TimestampedResultsAsrAdapter

TimestampedResultsAsrAdapter(asr: Asr, resampler: Resampler)

Bases: AsrAdapter[TimestampedResult]

ASR adapter (timestamped results).

Create ASR adapter.

Methods:

Name Description
recognize

Recognize speech (single or batch).

with_vad

Create ASR adapter with VAD.

Source code in src/onnx_asr/adapters.py
63
64
65
66
def __init__(self, asr: Asr, resampler: Resampler):
    """Create ASR adapter."""
    self.asr = asr
    self.resampler = resampler

recognize

recognize(waveform: str | Path | NDArray[float32], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> R
recognize(waveform: list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> list[R]
recognize(waveform: str | Path | NDArray[float32] | list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, language: str | None = ..., target_language: str | None = ..., pnc: Literal['pnc', 'nopnc'] | bool = ...) -> R | list[R]

Recognize speech (single or batch).

Parameters:

Name Type Description Default
waveform
str | Path | NDArray[float32] | list[str | Path | NDArray[float32]]

Path to wav file (only PCM_U8, PCM_16, PCM_24 and PCM_32 formats are supported) or Numpy array with PCM waveform. A list of file paths or numpy arrays for batch recognition are also supported.

required
sample_rate
SampleRates

Sample rate for Numpy arrays in waveform.

16000
language
str | None

Speech language (only for Whisper and Canary models).

...
target_language
str | None

Output language (only for Canary models).

...
pnc
Literal['pnc', 'nopnc'] | bool

Output punctuation and capitalization (only for Canary models).

...

Returns:

Type Description
R | list[R]

Speech recognition results (single or list for batch recognition).

Raises:

Type Description
AudioLoadingError

Audio loading error (onnx-asr specific).

FileNotFoundError

File not found error.

Error

WAV file reading error.

OSError

Other IO errors.

Source code in src/onnx_asr/adapters.py
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def recognize(
    self,
    waveform: str | Path | npt.NDArray[np.float32] | list[str | Path | npt.NDArray[np.float32]],
    *,
    sample_rate: SampleRates = 16_000,
    **kwargs: Unpack[RecognizeOptions],
) -> R | list[R]:
    """Recognize speech (single or batch).

    Args:
        waveform: Path to wav file (only PCM_U8, PCM_16, PCM_24 and PCM_32 formats are supported)
                  or Numpy array with PCM waveform.
                  A list of file paths or numpy arrays for batch recognition are also supported.
        sample_rate: Sample rate for Numpy arrays in waveform.
        **kwargs: ASR options.

    Returns:
        Speech recognition results (single or list for batch recognition).

    Raises:
        utils.AudioLoadingError: Audio loading error (onnx-asr specific).
        FileNotFoundError: File not found error.
        wave.Error: WAV file reading error.
        OSError: Other IO errors.

    """
    if isinstance(waveform, list) and not waveform:
        return []

    waveform_batch = waveform if isinstance(waveform, list) else [waveform]
    result = self._recognize_batch(*self.resampler(*read_wav_files(waveform_batch, sample_rate)), **kwargs)

    if isinstance(waveform, list):
        return list(result)
    return next(result)

with_vad

Create ASR adapter with VAD.

Parameters:

Name Type Description Default
vad
Vad

VAD model.

required
batch_size
int

Number of parallel processed segments.

...
threshold
float

Speech detection threshold.

...
neg_threshold
float

Non-speech detection threshold.

...
min_speech_duration_ms
float

Minimum speech segment duration in milliseconds.

...
max_speech_duration_s
float

Maximum speech segment duration in seconds.

...
min_silence_duration_ms
float

Minimum silence duration in milliseconds to split speech segments.

...
speech_pad_ms
float

Padding for speech segments in milliseconds.

...

Returns:

Type Description
SegmentResultsAsrAdapter

ASR with VAD adapter (text results).

Source code in src/onnx_asr/adapters.py
68
69
70
71
72
73
74
75
76
77
78
79
def with_vad(self, vad: Vad, **kwargs: Unpack[VadOptions]) -> SegmentResultsAsrAdapter:
    """Create ASR adapter with VAD.

    Args:
        vad: VAD model.
        **kwargs: VAD options.

    Returns:
        ASR with VAD adapter (text results).

    """
    return SegmentResultsAsrAdapter(self.asr, vad, self.resampler, **kwargs)

TimestampedSegmentResultsAsrAdapter

TimestampedSegmentResultsAsrAdapter(asr: Asr, vad: Vad, resampler: Resampler, *, batch_size: int = ..., threshold: float = ..., neg_threshold: float = ..., min_speech_duration_ms: float = ..., max_speech_duration_s: float = ..., min_silence_duration_ms: float = ..., speech_pad_ms: float = ...)

Bases: AsrAdapter[Iterator[TimestampedSegmentResult]]

ASR with VAD adapter (timestamped results).

Create ASR adapter.

Parameters:

Name Type Description Default

batch_size

int

Number of parallel processed segments.

...

threshold

float

Speech detection threshold.

...

neg_threshold

float

Non-speech detection threshold.

...

min_speech_duration_ms

float

Minimum speech segment duration in milliseconds.

...

max_speech_duration_s

float

Maximum speech segment duration in seconds.

...

min_silence_duration_ms

float

Minimum silence duration in milliseconds to split speech segments.

...

speech_pad_ms

float

Padding for speech segments in milliseconds.

...

Methods:

Name Description
recognize

Recognize speech (single or batch).

with_vad

Create ASR adapter with VAD.

Source code in src/onnx_asr/adapters.py
180
181
182
183
184
def __init__(self, asr: Asr, vad: Vad, resampler: Resampler, **kwargs: Unpack[VadOptions]):
    """Create ASR adapter."""
    super().__init__(asr, resampler)
    self.vad = vad
    self._vadargs = kwargs

recognize

recognize(waveform: str | Path | NDArray[float32], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> R
recognize(waveform: list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> list[R]
recognize(waveform: str | Path | NDArray[float32] | list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, language: str | None = ..., target_language: str | None = ..., pnc: Literal['pnc', 'nopnc'] | bool = ...) -> R | list[R]

Recognize speech (single or batch).

Parameters:

Name Type Description Default
waveform
str | Path | NDArray[float32] | list[str | Path | NDArray[float32]]

Path to wav file (only PCM_U8, PCM_16, PCM_24 and PCM_32 formats are supported) or Numpy array with PCM waveform. A list of file paths or numpy arrays for batch recognition are also supported.

required
sample_rate
SampleRates

Sample rate for Numpy arrays in waveform.

16000
language
str | None

Speech language (only for Whisper and Canary models).

...
target_language
str | None

Output language (only for Canary models).

...
pnc
Literal['pnc', 'nopnc'] | bool

Output punctuation and capitalization (only for Canary models).

...

Returns:

Type Description
R | list[R]

Speech recognition results (single or list for batch recognition).

Raises:

Type Description
AudioLoadingError

Audio loading error (onnx-asr specific).

FileNotFoundError

File not found error.

Error

WAV file reading error.

OSError

Other IO errors.

Source code in src/onnx_asr/adapters.py
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def recognize(
    self,
    waveform: str | Path | npt.NDArray[np.float32] | list[str | Path | npt.NDArray[np.float32]],
    *,
    sample_rate: SampleRates = 16_000,
    **kwargs: Unpack[RecognizeOptions],
) -> R | list[R]:
    """Recognize speech (single or batch).

    Args:
        waveform: Path to wav file (only PCM_U8, PCM_16, PCM_24 and PCM_32 formats are supported)
                  or Numpy array with PCM waveform.
                  A list of file paths or numpy arrays for batch recognition are also supported.
        sample_rate: Sample rate for Numpy arrays in waveform.
        **kwargs: ASR options.

    Returns:
        Speech recognition results (single or list for batch recognition).

    Raises:
        utils.AudioLoadingError: Audio loading error (onnx-asr specific).
        FileNotFoundError: File not found error.
        wave.Error: WAV file reading error.
        OSError: Other IO errors.

    """
    if isinstance(waveform, list) and not waveform:
        return []

    waveform_batch = waveform if isinstance(waveform, list) else [waveform]
    result = self._recognize_batch(*self.resampler(*read_wav_files(waveform_batch, sample_rate)), **kwargs)

    if isinstance(waveform, list):
        return list(result)
    return next(result)

with_vad

Create ASR adapter with VAD.

Parameters:

Name Type Description Default
vad
Vad

VAD model.

required
batch_size
int

Number of parallel processed segments.

...
threshold
float

Speech detection threshold.

...
neg_threshold
float

Non-speech detection threshold.

...
min_speech_duration_ms
float

Minimum speech segment duration in milliseconds.

...
max_speech_duration_s
float

Maximum speech segment duration in seconds.

...
min_silence_duration_ms
float

Minimum silence duration in milliseconds to split speech segments.

...
speech_pad_ms
float

Padding for speech segments in milliseconds.

...

Returns:

Type Description
SegmentResultsAsrAdapter

ASR with VAD adapter (text results).

Source code in src/onnx_asr/adapters.py
68
69
70
71
72
73
74
75
76
77
78
79
def with_vad(self, vad: Vad, **kwargs: Unpack[VadOptions]) -> SegmentResultsAsrAdapter:
    """Create ASR adapter with VAD.

    Args:
        vad: VAD model.
        **kwargs: VAD options.

    Returns:
        ASR with VAD adapter (text results).

    """
    return SegmentResultsAsrAdapter(self.asr, vad, self.resampler, **kwargs)

VadOptions typed-dict

Bases: TypedDict

Options for VAD.

Parameters:

Name Type Description Default

batch_size

int

Number of parallel processed segments.

...

threshold

float

Speech detection threshold.

...

neg_threshold

float

Non-speech detection threshold.

...

min_speech_duration_ms

float

Minimum speech segment duration in milliseconds.

...

max_speech_duration_s

float

Maximum speech segment duration in seconds.

...

min_silence_duration_ms

float

Minimum silence duration in milliseconds to split speech segments.

...

speech_pad_ms

float

Padding for speech segments in milliseconds.

...

TimestampedResult dataclass

TimestampedResult(text: str, timestamps: list[float] | None = None, tokens: list[str] | None = None, logprobs: list[float] | None = None)

Timestamped recognition result.

Attributes:

Name Type Description
logprobs list[float] | None

Tokens logprob list.

text str

Recognized text.

timestamps list[float] | None

Tokens timestamp list.

tokens list[str] | None

Tokens list.

logprobs class-attribute instance-attribute

logprobs: list[float] | None = None

Tokens logprob list.

text instance-attribute

text: str

Recognized text.

timestamps class-attribute instance-attribute

timestamps: list[float] | None = None

Tokens timestamp list.

tokens class-attribute instance-attribute

tokens: list[str] | None = None

Tokens list.

SegmentResult dataclass

SegmentResult(start: float, end: float, text: str)

Segment recognition result.

Attributes:

Name Type Description
end float

Segment end time.

start float

Segment start time.

text str

Segment recognized text.

end instance-attribute

end: float

Segment end time.

start instance-attribute

start: float

Segment start time.

text instance-attribute

text: str

Segment recognized text.

TimestampedSegmentResult dataclass

TimestampedSegmentResult(start: float, end: float, text: str, timestamps: list[float] | None = None, tokens: list[str] | None = None, logprobs: list[float] | None = None)

Bases: TimestampedResult, SegmentResult

Timestamped segment recognition result.

Attributes:

Name Type Description
end float

Segment end time.

logprobs list[float] | None

Tokens logprob list.

start float

Segment start time.

text str

Recognized text.

timestamps list[float] | None

Tokens timestamp list.

tokens list[str] | None

Tokens list.

end instance-attribute

end: float

Segment end time.

logprobs class-attribute instance-attribute

logprobs: list[float] | None = None

Tokens logprob list.

start instance-attribute

start: float

Segment start time.

text instance-attribute

text: str

Recognized text.

timestamps class-attribute instance-attribute

timestamps: list[float] | None = None

Tokens timestamp list.

tokens class-attribute instance-attribute

tokens: list[str] | None = None

Tokens list.

SampleRates module-attribute

SampleRates = Literal[8000, 11025, 16000, 22050, 24000, 32000, 44100, 48000]

Supported sample rates.

AudioLoadingError

Bases: ValueError

Audio loading error.