API Reference¶
onnx_asr
¶
A lightweight Python package for Automatic Speech Recognition using ONNX models.
Modules:
| Name | Description |
|---|---|
adapters |
ASR adapter classes. |
asr |
Base ASR classes. |
cli |
CLI for speech recognition from WAV files. |
loader |
Loader for ASR models. |
models |
ASR and VAD model implementations. |
onnx |
Helpers for ONNX. |
preprocessors |
ASR preprocessor implementations. |
utils |
Utils for ASR. |
vad |
Base VAD classes. |
Functions:
| Name | Description |
|---|---|
load_model |
Load ASR model. |
load_vad |
Load VAD model. |
load_model
¶
load_model(model: str | ModelNames | ModelTypes, path: str | Path | None = None, *, quantization: str | None = None, sess_options: SessionOptions | None = None, providers: Sequence[str | tuple[str, dict[Any, Any]]] | None = None, provider_options: Sequence[dict[Any, Any]] | None = None, cpu_preprocessing: bool | None = None, asr_config: OnnxSessionOptions | None = None, preprocessor_config: PreprocessorRuntimeConfig | None = None, resampler_config: OnnxSessionOptions | None = None) -> TextResultsAsrAdapter
Load ASR model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str | ModelNames | ModelTypes
|
Model name or type (download from Hugging Face supported if full model name is provided):
|
required |
|
str | Path | None
|
Path to directory with model files. |
None
|
|
str | None
|
Model quantization ( |
None
|
|
SessionOptions | None
|
Default SessionOptions for onnxruntime. |
None
|
|
Sequence[str | tuple[str, dict[Any, Any]]] | None
|
Default providers for onnxruntime. |
None
|
|
Sequence[dict[Any, Any]] | None
|
Default provider_options for onnxruntime. |
None
|
|
bool | None
|
Deprecated and ignored, use |
None
|
|
OnnxSessionOptions | None
|
ASR ONNX config. |
None
|
|
PreprocessorRuntimeConfig | None
|
Preprocessor ONNX and concurrency config. |
None
|
|
OnnxSessionOptions | None
|
Resampler ONNX config. |
None
|
Returns:
| Type | Description |
|---|---|
TextResultsAsrAdapter
|
ASR model class. |
Raises:
| Type | Description |
|---|---|
ModelLoadingError
|
Model loading error (onnx-asr specific). |
Source code in src/onnx_asr/loader.py
303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 | |
load_vad
¶
load_vad(model: VadNames = 'silero', path: str | Path | None = None, *, quantization: str | None = None, sess_options: SessionOptions | None = None, providers: Sequence[str | tuple[str, dict[Any, Any]]] | None = None, provider_options: Sequence[dict[Any, Any]] | None = None) -> Vad
Load VAD model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
VadNames
|
VAD model name (supports download from Hugging Face). |
'silero'
|
|
str | Path | None
|
Path to directory with model files. |
None
|
|
str | None
|
Model quantization ( |
None
|
|
SessionOptions | None
|
Optional SessionOptions for onnxruntime. |
None
|
|
Sequence[str | tuple[str, dict[Any, Any]]] | None
|
Optional providers for onnxruntime. |
None
|
|
Sequence[dict[Any, Any]] | None
|
Optional provider_options for onnxruntime. |
None
|
Returns:
| Type | Description |
|---|---|
Vad
|
VAD model class. |
Raises:
| Type | Description |
|---|---|
ModelLoadingError
|
Model loading error (onnx-asr specific). |
Source code in src/onnx_asr/loader.py
390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 | |
ModelNames
module-attribute
¶
ModelNames = Literal['gigaam-v2-ctc', 'gigaam-v2-rnnt', 'gigaam-v3-ctc', 'gigaam-v3-rnnt', 'gigaam-v3-e2e-ctc', 'gigaam-v3-e2e-rnnt', 'nemo-fastconformer-ru-ctc', 'nemo-fastconformer-ru-rnnt', 'nemo-parakeet-ctc-0.6b', 'nemo-parakeet-rnnt-0.6b', 'nemo-parakeet-tdt-0.6b-v2', 'nemo-parakeet-tdt-0.6b-v3', 'nemo-canary-1b-v2', 'alphacep/vosk-model-ru', 'alphacep/vosk-model-small-ru', 't-tech/t-one', 'whisper-base']
Supported ASR model names (can be automatically downloaded from the Hugging Face).
ModelTypes
module-attribute
¶
ModelTypes = Literal['kaldi-rnnt', 'nemo-conformer-ctc', 'nemo-conformer-rnnt', 'nemo-conformer-tdt', 'nemo-conformer-aed', 't-one-ctc', 'vosk', 'whisper-ort', 'whisper']
Supported ASR model types.
VadNames
module-attribute
¶
VadNames = Literal['silero']
Supported VAD model names (can be automatically downloaded from the Hugging Face).
OnnxSessionOptions
typed-dict
¶
OnnxSessionOptions(*, sess_options: SessionOptions | None = ..., providers: Sequence[str | tuple[str, dict[Any, Any]]] | None = ..., provider_options: Sequence[dict[Any, Any]] | None = ...)
Bases: TypedDict
Options for onnxruntime InferenceSession.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
SessionOptions | None
|
ONNX Session options. |
...
|
|
Sequence[str | tuple[str, dict[Any, Any]]] | None
|
ONNX providers. |
...
|
|
Sequence[dict[Any, Any]] | None
|
ONNX provider options. |
...
|
PreprocessorRuntimeConfig
¶
PreprocessorRuntimeConfig(*, sess_options: SessionOptions | None = ..., providers: Sequence[str | tuple[str, dict[Any, Any]]] | None = ..., provider_options: Sequence[dict[Any, Any]] | None = ...)
Bases: OnnxSessionOptions
Preprocessor runtime config.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
SessionOptions | None
|
ONNX Session options. |
...
|
|
Sequence[str | tuple[str, dict[Any, Any]]] | None
|
ONNX providers. |
...
|
|
Sequence[dict[Any, Any]] | None
|
ONNX provider options. |
...
|
Attributes:
| Name | Type | Description |
|---|---|---|
max_concurrent_workers |
int | None
|
Max parallel preprocessing threads (None - auto, 1 - without parallel processing). |
TensorRtOptions
¶
Options for onnxruntime TensorRT providers.
Methods:
| Name | Description |
|---|---|
add_profile |
Add TensorRT profile options. |
get_provider_names |
Get TensorRT provider names. |
is_fp16_enabled |
Check if TensorRT provider use fp16 precision. |
Attributes:
| Name | Type | Description |
|---|---|---|
profile_max_shapes |
dict[str, int]
|
Maximal value for model input shapes. |
profile_min_shapes |
dict[str, int]
|
Minimal value for model input shapes. |
profile_opt_shapes |
dict[str, int]
|
Optimal value for model input shapes. |
profile_max_shapes
class-attribute
¶
Maximal value for model input shapes.
profile_min_shapes
class-attribute
¶
Minimal value for model input shapes.
profile_opt_shapes
class-attribute
¶
Optimal value for model input shapes.
add_profile
classmethod
¶
add_profile(onnx_options: OnnxSessionOptions, transform_shapes: Callable[..., str]) -> OnnxSessionOptions
Add TensorRT profile options.
Source code in src/onnx_asr/onnx.py
38 39 40 41 42 43 44 45 46 47 | |
get_provider_names
staticmethod
¶
Get TensorRT provider names.
Source code in src/onnx_asr/onnx.py
49 50 51 52 | |
is_fp16_enabled
staticmethod
¶
is_fp16_enabled(onnx_options: OnnxSessionOptions) -> bool
Check if TensorRT provider use fp16 precision.
Source code in src/onnx_asr/onnx.py
54 55 56 57 58 59 60 61 | |
adapters
¶
ASR adapter classes.
Classes:
| Name | Description |
|---|---|
AsrAdapter |
Base ASR adapter class. |
RecognizeOptions |
Options for ASR recognition. |
SegmentResultsAsrAdapter |
ASR with VAD adapter (text results). |
TextResultsAsrAdapter |
ASR adapter (text results). |
TimestampedResultsAsrAdapter |
ASR adapter (timestamped results). |
TimestampedSegmentResultsAsrAdapter |
ASR with VAD adapter (timestamped results). |
VadOptions |
Options for VAD. |
AsrAdapter
¶
AsrAdapter(asr: Asr, resampler: Resampler)
Base ASR adapter class.
Create ASR adapter.
Methods:
| Name | Description |
|---|---|
recognize |
Recognize speech (single or batch). |
with_vad |
Create ASR adapter with VAD. |
Source code in src/onnx_asr/adapters.py
63 64 65 66 | |
recognize
¶
recognize(waveform: str | Path | NDArray[float32], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> R
recognize(waveform: list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> list[R]
recognize(waveform: str | Path | NDArray[float32] | list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, language: str | None = ..., target_language: str | None = ..., pnc: Literal['pnc', 'nopnc'] | bool = ...) -> R | list[R]
Recognize speech (single or batch).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str | Path | NDArray[float32] | list[str | Path | NDArray[float32]]
|
Path to wav file (only PCM_U8, PCM_16, PCM_24 and PCM_32 formats are supported) or Numpy array with PCM waveform. A list of file paths or numpy arrays for batch recognition are also supported. |
required |
|
SampleRates
|
Sample rate for Numpy arrays in waveform. |
16000
|
|
str | None
|
Speech language (only for Whisper and Canary models). |
...
|
|
str | None
|
Output language (only for Canary models). |
...
|
|
Literal['pnc', 'nopnc'] | bool
|
Output punctuation and capitalization (only for Canary models). |
...
|
Returns:
| Type | Description |
|---|---|
R | list[R]
|
Speech recognition results (single or list for batch recognition). |
Raises:
| Type | Description |
|---|---|
AudioLoadingError
|
Audio loading error (onnx-asr specific). |
FileNotFoundError
|
File not found error. |
Error
|
WAV file reading error. |
OSError
|
Other IO errors. |
Source code in src/onnx_asr/adapters.py
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
with_vad
¶
with_vad(vad: Vad, *, batch_size: int = ..., threshold: float = ..., neg_threshold: float = ..., min_speech_duration_ms: float = ..., max_speech_duration_s: float = ..., min_silence_duration_ms: float = ..., speech_pad_ms: float = ...) -> SegmentResultsAsrAdapter
Create ASR adapter with VAD.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Vad
|
VAD model. |
required |
|
int
|
Number of parallel processed segments. |
...
|
|
float
|
Speech detection threshold. |
...
|
|
float
|
Non-speech detection threshold. |
...
|
|
float
|
Minimum speech segment duration in milliseconds. |
...
|
|
float
|
Maximum speech segment duration in seconds. |
...
|
|
float
|
Minimum silence duration in milliseconds to split speech segments. |
...
|
|
float
|
Padding for speech segments in milliseconds. |
...
|
Returns:
| Type | Description |
|---|---|
SegmentResultsAsrAdapter
|
ASR with VAD adapter (text results). |
Source code in src/onnx_asr/adapters.py
68 69 70 71 72 73 74 75 76 77 78 79 | |
RecognizeOptions
typed-dict
¶
RecognizeOptions(*, language: str | None = ..., target_language: str | None = ..., pnc: Literal['pnc', 'nopnc'] | bool = ...)
Bases: TypedDict
Options for ASR recognition.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str | None
|
Speech language (only for Whisper and Canary models). |
...
|
|
str | None
|
Output language (only for Canary models). |
...
|
|
Literal['pnc', 'nopnc'] | bool
|
Output punctuation and capitalization (only for Canary models). |
...
|
SegmentResultsAsrAdapter
¶
SegmentResultsAsrAdapter(asr: Asr, vad: Vad, resampler: Resampler, *, batch_size: int = ..., threshold: float = ..., neg_threshold: float = ..., min_speech_duration_ms: float = ..., max_speech_duration_s: float = ..., min_silence_duration_ms: float = ..., speech_pad_ms: float = ...)
Bases: AsrAdapter[Iterator[SegmentResult]]
ASR with VAD adapter (text results).
Create ASR adapter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Number of parallel processed segments. |
...
|
|
float
|
Speech detection threshold. |
...
|
|
float
|
Non-speech detection threshold. |
...
|
|
float
|
Minimum speech segment duration in milliseconds. |
...
|
|
float
|
Maximum speech segment duration in seconds. |
...
|
|
float
|
Minimum silence duration in milliseconds to split speech segments. |
...
|
|
float
|
Padding for speech segments in milliseconds. |
...
|
Methods:
| Name | Description |
|---|---|
recognize |
Recognize speech (single or batch). |
with_timestamps |
ASR with VAD adapter (timestamped results). |
with_vad |
Create ASR adapter with VAD. |
Source code in src/onnx_asr/adapters.py
208 209 210 211 212 | |
recognize
¶
recognize(waveform: str | Path | NDArray[float32], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> R
recognize(waveform: list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> list[R]
recognize(waveform: str | Path | NDArray[float32] | list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, language: str | None = ..., target_language: str | None = ..., pnc: Literal['pnc', 'nopnc'] | bool = ...) -> R | list[R]
Recognize speech (single or batch).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str | Path | NDArray[float32] | list[str | Path | NDArray[float32]]
|
Path to wav file (only PCM_U8, PCM_16, PCM_24 and PCM_32 formats are supported) or Numpy array with PCM waveform. A list of file paths or numpy arrays for batch recognition are also supported. |
required |
|
SampleRates
|
Sample rate for Numpy arrays in waveform. |
16000
|
|
str | None
|
Speech language (only for Whisper and Canary models). |
...
|
|
str | None
|
Output language (only for Canary models). |
...
|
|
Literal['pnc', 'nopnc'] | bool
|
Output punctuation and capitalization (only for Canary models). |
...
|
Returns:
| Type | Description |
|---|---|
R | list[R]
|
Speech recognition results (single or list for batch recognition). |
Raises:
| Type | Description |
|---|---|
AudioLoadingError
|
Audio loading error (onnx-asr specific). |
FileNotFoundError
|
File not found error. |
Error
|
WAV file reading error. |
OSError
|
Other IO errors. |
Source code in src/onnx_asr/adapters.py
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
with_timestamps
¶
with_timestamps() -> TimestampedSegmentResultsAsrAdapter
ASR with VAD adapter (timestamped results).
Source code in src/onnx_asr/adapters.py
214 215 216 | |
with_vad
¶
with_vad(vad: Vad, *, batch_size: int = ..., threshold: float = ..., neg_threshold: float = ..., min_speech_duration_ms: float = ..., max_speech_duration_s: float = ..., min_silence_duration_ms: float = ..., speech_pad_ms: float = ...) -> SegmentResultsAsrAdapter
Create ASR adapter with VAD.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Vad
|
VAD model. |
required |
|
int
|
Number of parallel processed segments. |
...
|
|
float
|
Speech detection threshold. |
...
|
|
float
|
Non-speech detection threshold. |
...
|
|
float
|
Minimum speech segment duration in milliseconds. |
...
|
|
float
|
Maximum speech segment duration in seconds. |
...
|
|
float
|
Minimum silence duration in milliseconds to split speech segments. |
...
|
|
float
|
Padding for speech segments in milliseconds. |
...
|
Returns:
| Type | Description |
|---|---|
SegmentResultsAsrAdapter
|
ASR with VAD adapter (text results). |
Source code in src/onnx_asr/adapters.py
68 69 70 71 72 73 74 75 76 77 78 79 | |
TextResultsAsrAdapter
¶
TextResultsAsrAdapter(asr: Asr, resampler: Resampler)
Bases: AsrAdapter[str]
ASR adapter (text results).
Create ASR adapter.
Methods:
| Name | Description |
|---|---|
recognize |
Recognize speech (single or batch). |
with_timestamps |
ASR adapter (timestamped results). |
with_vad |
Create ASR adapter with VAD. |
Source code in src/onnx_asr/adapters.py
63 64 65 66 | |
recognize
¶
recognize(waveform: str | Path | NDArray[float32], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> R
recognize(waveform: list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> list[R]
recognize(waveform: str | Path | NDArray[float32] | list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, language: str | None = ..., target_language: str | None = ..., pnc: Literal['pnc', 'nopnc'] | bool = ...) -> R | list[R]
Recognize speech (single or batch).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str | Path | NDArray[float32] | list[str | Path | NDArray[float32]]
|
Path to wav file (only PCM_U8, PCM_16, PCM_24 and PCM_32 formats are supported) or Numpy array with PCM waveform. A list of file paths or numpy arrays for batch recognition are also supported. |
required |
|
SampleRates
|
Sample rate for Numpy arrays in waveform. |
16000
|
|
str | None
|
Speech language (only for Whisper and Canary models). |
...
|
|
str | None
|
Output language (only for Canary models). |
...
|
|
Literal['pnc', 'nopnc'] | bool
|
Output punctuation and capitalization (only for Canary models). |
...
|
Returns:
| Type | Description |
|---|---|
R | list[R]
|
Speech recognition results (single or list for batch recognition). |
Raises:
| Type | Description |
|---|---|
AudioLoadingError
|
Audio loading error (onnx-asr specific). |
FileNotFoundError
|
File not found error. |
Error
|
WAV file reading error. |
OSError
|
Other IO errors. |
Source code in src/onnx_asr/adapters.py
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
with_timestamps
¶
with_timestamps() -> TimestampedResultsAsrAdapter
ASR adapter (timestamped results).
Source code in src/onnx_asr/adapters.py
161 162 163 | |
with_vad
¶
with_vad(vad: Vad, *, batch_size: int = ..., threshold: float = ..., neg_threshold: float = ..., min_speech_duration_ms: float = ..., max_speech_duration_s: float = ..., min_silence_duration_ms: float = ..., speech_pad_ms: float = ...) -> SegmentResultsAsrAdapter
Create ASR adapter with VAD.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Vad
|
VAD model. |
required |
|
int
|
Number of parallel processed segments. |
...
|
|
float
|
Speech detection threshold. |
...
|
|
float
|
Non-speech detection threshold. |
...
|
|
float
|
Minimum speech segment duration in milliseconds. |
...
|
|
float
|
Maximum speech segment duration in seconds. |
...
|
|
float
|
Minimum silence duration in milliseconds to split speech segments. |
...
|
|
float
|
Padding for speech segments in milliseconds. |
...
|
Returns:
| Type | Description |
|---|---|
SegmentResultsAsrAdapter
|
ASR with VAD adapter (text results). |
Source code in src/onnx_asr/adapters.py
68 69 70 71 72 73 74 75 76 77 78 79 | |
TimestampedResultsAsrAdapter
¶
TimestampedResultsAsrAdapter(asr: Asr, resampler: Resampler)
Bases: AsrAdapter[TimestampedResult]
ASR adapter (timestamped results).
Create ASR adapter.
Methods:
| Name | Description |
|---|---|
recognize |
Recognize speech (single or batch). |
with_vad |
Create ASR adapter with VAD. |
Source code in src/onnx_asr/adapters.py
63 64 65 66 | |
recognize
¶
recognize(waveform: str | Path | NDArray[float32], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> R
recognize(waveform: list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> list[R]
recognize(waveform: str | Path | NDArray[float32] | list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, language: str | None = ..., target_language: str | None = ..., pnc: Literal['pnc', 'nopnc'] | bool = ...) -> R | list[R]
Recognize speech (single or batch).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str | Path | NDArray[float32] | list[str | Path | NDArray[float32]]
|
Path to wav file (only PCM_U8, PCM_16, PCM_24 and PCM_32 formats are supported) or Numpy array with PCM waveform. A list of file paths or numpy arrays for batch recognition are also supported. |
required |
|
SampleRates
|
Sample rate for Numpy arrays in waveform. |
16000
|
|
str | None
|
Speech language (only for Whisper and Canary models). |
...
|
|
str | None
|
Output language (only for Canary models). |
...
|
|
Literal['pnc', 'nopnc'] | bool
|
Output punctuation and capitalization (only for Canary models). |
...
|
Returns:
| Type | Description |
|---|---|
R | list[R]
|
Speech recognition results (single or list for batch recognition). |
Raises:
| Type | Description |
|---|---|
AudioLoadingError
|
Audio loading error (onnx-asr specific). |
FileNotFoundError
|
File not found error. |
Error
|
WAV file reading error. |
OSError
|
Other IO errors. |
Source code in src/onnx_asr/adapters.py
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
with_vad
¶
with_vad(vad: Vad, *, batch_size: int = ..., threshold: float = ..., neg_threshold: float = ..., min_speech_duration_ms: float = ..., max_speech_duration_s: float = ..., min_silence_duration_ms: float = ..., speech_pad_ms: float = ...) -> SegmentResultsAsrAdapter
Create ASR adapter with VAD.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Vad
|
VAD model. |
required |
|
int
|
Number of parallel processed segments. |
...
|
|
float
|
Speech detection threshold. |
...
|
|
float
|
Non-speech detection threshold. |
...
|
|
float
|
Minimum speech segment duration in milliseconds. |
...
|
|
float
|
Maximum speech segment duration in seconds. |
...
|
|
float
|
Minimum silence duration in milliseconds to split speech segments. |
...
|
|
float
|
Padding for speech segments in milliseconds. |
...
|
Returns:
| Type | Description |
|---|---|
SegmentResultsAsrAdapter
|
ASR with VAD adapter (text results). |
Source code in src/onnx_asr/adapters.py
68 69 70 71 72 73 74 75 76 77 78 79 | |
TimestampedSegmentResultsAsrAdapter
¶
TimestampedSegmentResultsAsrAdapter(asr: Asr, vad: Vad, resampler: Resampler, *, batch_size: int = ..., threshold: float = ..., neg_threshold: float = ..., min_speech_duration_ms: float = ..., max_speech_duration_s: float = ..., min_silence_duration_ms: float = ..., speech_pad_ms: float = ...)
Bases: AsrAdapter[Iterator[TimestampedSegmentResult]]
ASR with VAD adapter (timestamped results).
Create ASR adapter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Number of parallel processed segments. |
...
|
|
float
|
Speech detection threshold. |
...
|
|
float
|
Non-speech detection threshold. |
...
|
|
float
|
Minimum speech segment duration in milliseconds. |
...
|
|
float
|
Maximum speech segment duration in seconds. |
...
|
|
float
|
Minimum silence duration in milliseconds to split speech segments. |
...
|
|
float
|
Padding for speech segments in milliseconds. |
...
|
Methods:
| Name | Description |
|---|---|
recognize |
Recognize speech (single or batch). |
with_vad |
Create ASR adapter with VAD. |
Source code in src/onnx_asr/adapters.py
180 181 182 183 184 | |
recognize
¶
recognize(waveform: str | Path | NDArray[float32], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> R
recognize(waveform: list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, **kwargs: Unpack[RecognizeOptions]) -> list[R]
recognize(waveform: str | Path | NDArray[float32] | list[str | Path | NDArray[float32]], *, sample_rate: SampleRates = 16000, language: str | None = ..., target_language: str | None = ..., pnc: Literal['pnc', 'nopnc'] | bool = ...) -> R | list[R]
Recognize speech (single or batch).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str | Path | NDArray[float32] | list[str | Path | NDArray[float32]]
|
Path to wav file (only PCM_U8, PCM_16, PCM_24 and PCM_32 formats are supported) or Numpy array with PCM waveform. A list of file paths or numpy arrays for batch recognition are also supported. |
required |
|
SampleRates
|
Sample rate for Numpy arrays in waveform. |
16000
|
|
str | None
|
Speech language (only for Whisper and Canary models). |
...
|
|
str | None
|
Output language (only for Canary models). |
...
|
|
Literal['pnc', 'nopnc'] | bool
|
Output punctuation and capitalization (only for Canary models). |
...
|
Returns:
| Type | Description |
|---|---|
R | list[R]
|
Speech recognition results (single or list for batch recognition). |
Raises:
| Type | Description |
|---|---|
AudioLoadingError
|
Audio loading error (onnx-asr specific). |
FileNotFoundError
|
File not found error. |
Error
|
WAV file reading error. |
OSError
|
Other IO errors. |
Source code in src/onnx_asr/adapters.py
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
with_vad
¶
with_vad(vad: Vad, *, batch_size: int = ..., threshold: float = ..., neg_threshold: float = ..., min_speech_duration_ms: float = ..., max_speech_duration_s: float = ..., min_silence_duration_ms: float = ..., speech_pad_ms: float = ...) -> SegmentResultsAsrAdapter
Create ASR adapter with VAD.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Vad
|
VAD model. |
required |
|
int
|
Number of parallel processed segments. |
...
|
|
float
|
Speech detection threshold. |
...
|
|
float
|
Non-speech detection threshold. |
...
|
|
float
|
Minimum speech segment duration in milliseconds. |
...
|
|
float
|
Maximum speech segment duration in seconds. |
...
|
|
float
|
Minimum silence duration in milliseconds to split speech segments. |
...
|
|
float
|
Padding for speech segments in milliseconds. |
...
|
Returns:
| Type | Description |
|---|---|
SegmentResultsAsrAdapter
|
ASR with VAD adapter (text results). |
Source code in src/onnx_asr/adapters.py
68 69 70 71 72 73 74 75 76 77 78 79 | |
VadOptions
typed-dict
¶
VadOptions(*, batch_size: int = ..., threshold: float = ..., neg_threshold: float = ..., min_speech_duration_ms: float = ..., max_speech_duration_s: float = ..., min_silence_duration_ms: float = ..., speech_pad_ms: float = ...)
Bases: TypedDict
Options for VAD.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
Number of parallel processed segments. |
...
|
|
float
|
Speech detection threshold. |
...
|
|
float
|
Non-speech detection threshold. |
...
|
|
float
|
Minimum speech segment duration in milliseconds. |
...
|
|
float
|
Maximum speech segment duration in seconds. |
...
|
|
float
|
Minimum silence duration in milliseconds to split speech segments. |
...
|
|
float
|
Padding for speech segments in milliseconds. |
...
|
TimestampedResult
dataclass
¶
TimestampedResult(text: str, timestamps: list[float] | None = None, tokens: list[str] | None = None, logprobs: list[float] | None = None)
Timestamped recognition result.
Attributes:
| Name | Type | Description |
|---|---|---|
logprobs |
list[float] | None
|
Tokens logprob list. |
text |
str
|
Recognized text. |
timestamps |
list[float] | None
|
Tokens timestamp list. |
tokens |
list[str] | None
|
Tokens list. |
SegmentResult
dataclass
¶
TimestampedSegmentResult
dataclass
¶
TimestampedSegmentResult(start: float, end: float, text: str, timestamps: list[float] | None = None, tokens: list[str] | None = None, logprobs: list[float] | None = None)
Bases: TimestampedResult, SegmentResult
Timestamped segment recognition result.
Attributes:
| Name | Type | Description |
|---|---|---|
end |
float
|
Segment end time. |
logprobs |
list[float] | None
|
Tokens logprob list. |
start |
float
|
Segment start time. |
text |
str
|
Recognized text. |
timestamps |
list[float] | None
|
Tokens timestamp list. |
tokens |
list[str] | None
|
Tokens list. |
SampleRates
module-attribute
¶
SampleRates = Literal[8000, 11025, 16000, 22050, 24000, 32000, 44100, 48000]
Supported sample rates.
AudioLoadingError
¶
Bases: ValueError
Audio loading error.