# Speech Classification This directory contains example scripts to train speech classification and voice activity detection models. There are two types of VAD models: Frame-VAD and Segment-VAD. ## Frame-VAD The frame-level VAD model predicts for each frame of the audio whether it has speech or not. For example, with the default config file (`../conf/marblenet/marblenet_3x2x64_20ms.yaml`), the model provides a probability for each frame of 20ms length. ### Training ```sh python speech_to_frame_label.py \ --config-path= --config-name= \ model.train_ds.manifest_filepath="[,]" \ model.validation_ds.manifest_filepath=["",""] \ trainer.devices=-1 \ trainer.accelerator="gpu" \ strategy="ddp" \ trainer.max_epochs=100 ``` The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "label"] are required. An example of a manifest file is: ``` {"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "label": "0 1 0 0 1"} {"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "label": "0 0 0 1 1 1 1 0 0"} ``` For example, if you have a 1s audio file, you'll need to have 50 frame labels in the manifest entry like "0 0 0 0 1 1 0 1 .... 0 1". However, shorter label strings are also supported for smaller file sizes. For example, you can prepare the `label` in 40ms frame, and the model will properly repeat the label for each 20ms frame. ### Inference python frame_vad_infer.py \ --config-path="../conf/vad" --config-name="frame_vad_infer_postprocess" \ dataset= The manifest json file should have the following format (each line is a Python dictionary): ``` {"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000} {"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000} ``` #### Evaluation If you want to evaluate tne model's AUROC and DER performance, you need to set `evaluate: True` in config yaml (e.g., `../conf/vad/frame_vad_infer_postprocess.yaml`), and also provide groundtruth in label strings: ``` {"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000, "label": "0 1 0 0 0 1 1 1 0"} ``` or RTTM files: ``` {"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000, "rttm_filepath": "/path/to/rttm_file1.rttm"} ``` ## Segment-VAD Segment-level VAD predicts a single label for each segment of audio (e.g., 0.63s by default). ### Training ```sh python speech_to_label.py \ --config-path= \ --config-name= \ model.train_ds.manifest_filepath="[,]" \ model.validation_ds.manifest_filepath=["",""] \ trainer.devices=-1 \ trainer.accelerator="gpu" \ strategy="ddp" \ trainer.max_epochs=100 ``` The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "label"] are required. An example of a manifest file is: ``` {"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 0.63, "label": "0"} {"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 0.63, "label": "1"} ``` ### Inference ```sh python vad_infer.py \ --config-path="../conf/vad" \ --config-name="vad_inference_postprocessing.yaml" dataset= ``` The manifest json file should have the following format (each line is a Python dictionary): ``` {"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000} {"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000} ``` ## Visualization To visualize the VAD outputs, you can use the `nemo.collections.asr.parts.utils.vad_utils.plot_sample_from_rttm` function, which takes an audio file and an RTTM file as input, and plots the audio waveform and the VAD labels. Since the VAD inference script will output a json manifest `manifest_vad_out.json` by default, you can create a Jupyter Notebook with the following script and fill in the paths using the output manifest: ```python from nemo.collections.asr.parts.utils.vad_utils import plot_sample_from_rttm plot_sample_from_rttm( audio_file="/path/to/audio_file.wav", rttm_file="/path/to/rttm_file.rttm", offset=0.0, duration=1000, save_path="vad_pred.png" ) ```