ONNX ASR¶
onnx-asr is a Python package for Automatic Speech Recognition using ONNX models. It's a lightweight, fast, and easy-to-use pure Python package with minimal dependencies (no need for PyTorch, Transformers, or FFmpeg):
Key features of onnx-asr include:
- Supports many modern ASR models
- Runs on a wide range of devices, from small IoT/edge devices to servers with powerful GPUs (benchmarks)
- Works on Windows, Linux, and macOS on x86 and Arm CPUs, with support for CUDA, TensorRT, CoreML, DirectML, ROCm, and WebGPU
- Supports NumPy versions from 1.22 to 2.4+ and Python versions from 3.10 to 3.14
- Loads models from Hugging Face or local directories, including quantized versions
- Accepts WAV files or NumPy arrays, with built-in file reading and resampling
- Supports custom models (see the Conversion Guide for instructions)
- Supports batch processing
- Supports long-form recognition using VAD (Voice Activity Detection)
- Can return token-level timestamps and log probabilities
- Provides a fully typed and well-documented Python API
- Provides a simple command-line interface (CLI)
Note
Supports Parakeet v2 (En) / v3 (Multilingual), Canary v1/v2 (Multilingual) and GigaAM v2/v3 (Ru) models!
Warning
onnxruntime 1.24.1 has known compatibility issues with onnx-asr. Please use newer (or older) versions!
Quickstart¶
Install onnx-asr:
pip install onnx-asr[cpu,hub]
Load a model and recognize a WAV file:
import onnx_asr
# Load the Parakeet TDT v3 model from Hugging Face (may take a few minutes)
model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3")
# Recognize speech and print result
result = model.recognize("test.wav")
print(result)
Warning
The maximum audio length for most models is 20-30 seconds. For longer audio, VAD can be used.
For more examples, see the Usage Guide.
See the Installation Guide for detailed installation instructions.
Supported Model Architectures¶
The package supports the following modern ASR model architectures (see the supported model names for the full list of models and comparison with original implementations):
- NVIDIA NeMo Conformer/FastConformer/Parakeet/Canary (with CTC, RNN-T, TDT and Transformer decoders)
- GigaChat GigaAM v2/v3 (with CTC and RNN-T decoders, including E2E versions)
- Kaldi Icefall Zipformer (with stateless RNN-T decoder) including Alpha Cephei Vosk 0.52+
- T-Tech T-one (with CTC decoder, no streaming support yet)
- OpenAI Whisper
When saving these models in ONNX format, usually only the encoder and decoder are saved. To run them, the corresponding preprocessor and decoding must be implemented. Therefore, the package contains these implementations for all supported models:
- Log-mel spectrogram preprocessors
- Greedy search decoding
Benchmarks¶
Inverse Real-Time Factor (RTFx): the ratio of audio duration to processing time. RTFx > 1 means processing faster than real-time (higher RTFx values indicate better performance).
| Model | 9800X3D CPU (RTFx) | Cortex A53 CPU (RTFx) | T4 CUDA (RTFx) | RTX 5070 Ti TensorRT (RTFx) |
|---|---|---|---|---|
| NeMo Parakeet v2/v3 | 36 | 1.0 | 57 | 320 |
| NeMo Canary v2 | 8 | N/A | 21 | 36 |
| GigaAM v3 CTC | 59 | 1.6 | 84 | 1370 |
| GigaAM v3 RNN-T | 43 | 1.5 | 40 | 130 |
See the Benchmarks page for detailed performance benchmarks.
Troubleshooting / FAQ¶
See the Troubleshooting Guide for common issues and solutions.
For more help, check the GitHub Issues or open a new one.