Spaces:
Runtime error
Runtime error
File size: 4,890 Bytes
0558aa4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
# Speech Classification
This directory contains example scripts to train speech classification and voice activity detection models. There are two types of VAD models: Frame-VAD and Segment-VAD.
## Frame-VAD
The frame-level VAD model predicts for each frame of the audio whether it has speech or not. For example, with the default config file (`../conf/marblenet/marblenet_3x2x64_20ms.yaml`), the model provides a probability for each frame of 20ms length.
### Training
```sh
python speech_to_frame_label.py \
--config-path=<path to directory of configs, e.g. "../conf/marblenet">
--config-name=<name of config without .yaml, e.g. "marblenet_3x2x64_20ms"> \
model.train_ds.manifest_filepath="[<path to train manifest1>,<path to train manifest2>]" \
model.validation_ds.manifest_filepath=["<path to val manifest1>","<path to val manifest2>"] \
trainer.devices=-1 \
trainer.accelerator="gpu" \
strategy="ddp" \
trainer.max_epochs=100
```
The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "label"] are required. An example of a manifest file is:
```
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "label": "0 1 0 0 1"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "label": "0 0 0 1 1 1 1 0 0"}
```
For example, if you have a 1s audio file, you'll need to have 50 frame labels in the manifest entry like "0 0 0 0 1 1 0 1 .... 0 1".
However, shorter label strings are also supported for smaller file sizes. For example, you can prepare the `label` in 40ms frame, and the model will properly repeat the label for each 20ms frame.
### Inference
python frame_vad_infer.py \
--config-path="../conf/vad" --config-name="frame_vad_infer_postprocess" \
dataset=<Path of manifest file containing evaluation data. Audio files should have unique names>
The manifest json file should have the following format (each line is a Python dictionary):
```
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000}
{"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000}
```
#### Evaluation
If you want to evaluate tne model's AUROC and DER performance, you need to set `evaluate: True` in config yaml (e.g., `../conf/vad/frame_vad_infer_postprocess.yaml`), and also provide groundtruth in label strings:
```
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000, "label": "0 1 0 0 0 1 1 1 0"}
```
or RTTM files:
```
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000, "rttm_filepath": "/path/to/rttm_file1.rttm"}
```
## Segment-VAD
Segment-level VAD predicts a single label for each segment of audio (e.g., 0.63s by default).
### Training
```sh
python speech_to_label.py \
--config-path=<path to dir of configs, e.g. "../conf/marblenet"> \
--config-name=<name of config without .yaml, e.g., "marblenet_3x2x64"> \
model.train_ds.manifest_filepath="[<path to train manifest1>,<path to train manifest2>]" \
model.validation_ds.manifest_filepath=["<path to val manifest1>","<path to val manifest2>"] \
trainer.devices=-1 \
trainer.accelerator="gpu" \
strategy="ddp" \
trainer.max_epochs=100
```
The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "label"] are required. An example of a manifest file is:
```
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 0.63, "label": "0"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 0.63, "label": "1"}
```
### Inference
```sh
python vad_infer.py \
--config-path="../conf/vad" \
--config-name="vad_inference_postprocessing.yaml"
dataset=<Path of json file of evaluation data. Audio files should have unique names>
```
The manifest json file should have the following format (each line is a Python dictionary):
```
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000}
{"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000}
```
## Visualization
To visualize the VAD outputs, you can use the `nemo.collections.asr.parts.utils.vad_utils.plot_sample_from_rttm` function, which takes an audio file and an RTTM file as input, and plots the audio waveform and the VAD labels. Since the VAD inference script will output a json manifest `manifest_vad_out.json` by default, you can create a Jupyter Notebook with the following script and fill in the paths using the output manifest:
```python
from nemo.collections.asr.parts.utils.vad_utils import plot_sample_from_rttm
plot_sample_from_rttm(
audio_file="/path/to/audio_file.wav",
rttm_file="/path/to/rttm_file.rttm",
offset=0.0,
duration=1000,
save_path="vad_pred.png"
)
```
|