File size: 4,890 Bytes
0558aa4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# Speech Classification

This directory contains example scripts to train speech classification and voice activity detection models. There are two types of VAD models: Frame-VAD and Segment-VAD.

## Frame-VAD

The frame-level VAD model predicts for each frame of the audio whether it has speech or not. For example, with the default config file (`../conf/marblenet/marblenet_3x2x64_20ms.yaml`), the model provides a probability for each frame of 20ms length.

### Training
```sh
python speech_to_frame_label.py \
    --config-path=<path to directory of configs, e.g. "../conf/marblenet">
    --config-name=<name of config without .yaml, e.g. "marblenet_3x2x64_20ms"> \
    model.train_ds.manifest_filepath="[<path to train manifest1>,<path to train manifest2>]" \
    model.validation_ds.manifest_filepath=["<path to val manifest1>","<path to val manifest2>"] \
    trainer.devices=-1 \
    trainer.accelerator="gpu" \
    strategy="ddp" \
    trainer.max_epochs=100
```

The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration",  "label"] are required. An example of a manifest file is:
```
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000,  "label": "0 1 0 0 1"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000,  "label": "0 0 0 1 1 1 1 0 0"}
```
For example, if you have a 1s audio file, you'll need to have 50 frame labels in the manifest entry like "0 0 0 0 1 1 0 1 .... 0 1".
However, shorter label strings are also supported for smaller file sizes. For example, you can prepare the `label` in 40ms frame, and the model will properly repeat the label for each 20ms frame.


### Inference
python frame_vad_infer.py \
    --config-path="../conf/vad" --config-name="frame_vad_infer_postprocess" \
    dataset=<Path of manifest file containing evaluation data. Audio files should have unique names>

The manifest json file should have the following format (each line is a Python dictionary):
```
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000}  
{"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000}  
```

#### Evaluation
If you want to evaluate tne model's AUROC and DER performance, you need to set `evaluate: True` in config yaml (e.g., `../conf/vad/frame_vad_infer_postprocess.yaml`), and also provide groundtruth in label strings:
```
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000, "label": "0 1 0 0 0 1 1 1 0"}
```
or RTTM files:
```
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000, "rttm_filepath": "/path/to/rttm_file1.rttm"}
```


## Segment-VAD

Segment-level VAD predicts a single label for each segment of audio (e.g., 0.63s by default).

### Training
```sh
python speech_to_label.py \
    --config-path=<path to dir of configs, e.g. "../conf/marblenet"> \
    --config-name=<name of config without .yaml, e.g., "marblenet_3x2x64"> \
    model.train_ds.manifest_filepath="[<path to train manifest1>,<path to train manifest2>]" \
    model.validation_ds.manifest_filepath=["<path to val manifest1>","<path to val manifest2>"] \
    trainer.devices=-1 \
    trainer.accelerator="gpu" \
    strategy="ddp" \
    trainer.max_epochs=100
```

The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration",  "label"] are required. An example of a manifest file is:
```
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 0.63,  "label": "0"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 0.63,  "label": "1"}
```


### Inference
```sh
python vad_infer.py \
    --config-path="../conf/vad" \
    --config-name="vad_inference_postprocessing.yaml"
    dataset=<Path of json file of evaluation data. Audio files should have unique names>
```
The manifest json file should have the following format (each line is a Python dictionary):
```
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000}  
{"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000}  
```


## Visualization

To visualize the VAD outputs, you can use the `nemo.collections.asr.parts.utils.vad_utils.plot_sample_from_rttm` function, which takes an audio file and an RTTM file as input, and plots the audio waveform and the VAD labels. Since the VAD inference script will output a json manifest `manifest_vad_out.json` by default, you can create a Jupyter Notebook with the following script and fill in the paths using the output manifest:
```python
from nemo.collections.asr.parts.utils.vad_utils import plot_sample_from_rttm

plot_sample_from_rttm(
    audio_file="/path/to/audio_file.wav",
    rttm_file="/path/to/rttm_file.rttm",
    offset=0.0,
    duration=1000,
    save_path="vad_pred.png"
)
```