Comparison with Original Implementations¶
Packages with original implementations:
gigaamfor GigaAM models (github)nemo-toolkitfor NeMo models (github)openai-whisperfor Whisper models (github)sherpa-onnxfor Vosk models (github, docs)T-onefor T-Tech T-one model (github)
Test Hardware¶
- CPU tests were run on a laptop with an Intel i7-7700HQ processor.
- GPU tests were run in Google Colab on Nvidia T4.
Russian ASR Models¶
Tests of Russian ASR models were performed on a test subset of the Russian LibriSpeech dataset.
| Model | Package / decoding | CER | WER | RTFx (CPU) | RTFx (GPU) |
|---|---|---|---|---|---|
| GigaAM v2 CTC | default | 1.06% | 5.23% | 7.2 | 44.2 |
| GigaAM v2 CTC | onnx-asr | 1.06% | 5.23% | 11.6 | 197.0 |
| GigaAM v2 RNN-T | default | 1.10% | 5.22% | 5.5 | 23.3 |
| GigaAM v2 RNN-T | onnx-asr | 1.10% | 5.22% | 10.7 | 84.1 |
| GigaAM v3 CTC | default | 0.98% | 4.72% | 12.2 | 73.3 |
| GigaAM v3 CTC | onnx-asr | 0.98% | 4.72% | 14.5 | 223.1 |
| GigaAM v3 RNN-T | default | 0.93% | 4.39% | 8.2 | 41.6 |
| GigaAM v3 RNN-T | onnx-asr | 0.93% | 4.39% | 13.3 | 92.1 |
| GigaAM v3 E2E CTC | default | 1.50% | 7.10% | N/A | 178.0 |
| GigaAM v3 E2E CTC | onnx-asr | 1.56% | 7.80% | N/A | 222.8 |
| GigaAM v3 E2E RNN-T | default | 1.61% | 6.94% | N/A | 47.6 |
| GigaAM v3 E2E RNN-T | onnx-asr | 1.67% | 7.60% | N/A | 98.5 |
| Nemo FastConformer CTC | default | 3.11% | 13.12% | 29.1 | 143.0 |
| Nemo FastConformer CTC | onnx-asr | 3.13% | 13.10% | 45.8 | 484.7 |
| Nemo FastConformer RNN-T | default | 2.63% | 11.62% | 17.4 | 111.6 |
| Nemo FastConformer RNN-T | onnx-asr | 2.62% | 11.57% | 27.2 | 119.4 |
| Nemo Parakeet TDT 0.6B V3 | default | 2.34% | 10.95% | 5.6 | 75.4 |
| Nemo Parakeet TDT 0.6B V3 | onnx-asr | 2.38% | 10.95% | 9.7 | 97.3 |
| Nemo Canary 1B V2 | default | 4.89% | 20.00% | N/A | 14.0 |
| Nemo Canary 1B V2 | onnx-asr | 5.00% | 20.03% | N/A | 18.6 |
| T-Tech T-one | default | 1.28% | 6.56% | 11.9 | N/A |
| T-Tech T-one | onnx-asr | 1.28% | 6.57% | 11.7 | 40.6 |
| Vosk 0.52 small | greedy_search | 3.64% | 14.53% | 48.2 | 71.4 |
| Vosk 0.52 small | modified_beam_search | 3.50% | 14.25% | 29.0 | 24.7 |
| Vosk 0.52 small | onnx-asr | 3.64% | 14.53% | 45.5 | 115.0 |
| Vosk 0.54 | greedy_search | 2.21% | 9.89% | 34.8 | 64.2 |
| Vosk 0.54 | modified_beam_search | 2.21% | 9.85% | 23.9 | 24 |
| Vosk 0.54 | onnx-asr | 2.21% | 9.89% | 33.6 | 97.6 |
| Whisper base | default | 10.61% | 38.89% | 5.4 | 17.3 |
| Whisper base | onnx-asr* | 10.64% | 38.33% | 6.6 | 58.0 |
| Whisper large-v3-turbo | default | 2.96% | 10.27% | N/A | 13.6 |
| Whisper large-v3-turbo | onnx-asr** | 2.63% | 10.13% | N/A | 19.5 |
English ASR Models¶
Tests of English ASR models were performed on a test subset of the Voxpopuli dataset.
| Model | Package / decoding | CER | WER | RTFx (CPU) | RTFx (GPU) |
|---|---|---|---|---|---|
| Nemo Parakeet CTC 0.6B | default | 4.09% | 7.20% | 8.3 | 107.7 |
| Nemo Parakeet CTC 0.6B | onnx-asr | 4.10% | 7.22% | 11.5 | 154.7 |
| Nemo Parakeet RNN-T 0.6B | default | 3.64% | 6.32% | 6.7 | 85.0 |
| Nemo Parakeet RNN-T 0.6B | onnx-asr | 3.64% | 6.33% | 8.7 | 69.7 |
| Nemo Parakeet TDT 0.6B V2 | default | 3.88% | 6.52% | 6.5 | 87.6 |
| Nemo Parakeet TDT 0.6B V2 | onnx-asr | 3.87% | 6.52% | 10.5 | 116.7 |
| Nemo Parakeet TDT 0.6B V3 | default | 3.97% | 6.76% | 6.1 | 90.0 |
| Nemo Parakeet TDT 0.6B V3 | onnx-asr | 3.97% | 6.75% | 9.5 | 106.2 |
| Nemo Canary 1B V2 | default | 4.62% | 7.42% | N/A | 17.5 |
| Nemo Canary 1B V2 | onnx-asr | 4.67% | 7.47% | N/A | 22.1 |
| Whisper base | default | 7.81% | 13.24% | 8.4 | 27.7 |
| Whisper base | onnx-asr* | 7.52% | 12.76% | 9.2 | 92.2 |
| Whisper large-v3-turbo | default | 6.85% | 11.16% | N/A | 20.4 |
| Whisper large-v3-turbo | onnx-asr** | 10.31% | 14.65% | N/A | 29.2 |
Notes¶
Note
- *
whisper-ortmodel. - **
whispermodel withfp16precision. - All other models were run with the default precision -
fp32on CPU andfp32orfp16(some of the original models) on GPU.