Spaces:

subhankarg
/

MagpieTTS_Internal_Demo

Runtime error

App Files Files Community

MagpieTTS_Internal_Demo / examples /asr /speech_classification /README.md

subhankarg

Upload folder using huggingface_hub

0558aa4 verified 13 days ago

preview code

raw

history blame contribute delete

4.89 kB

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

Speech Classification

This directory contains example scripts to train speech classification and voice activity detection models. There are two types of VAD models: Frame-VAD and Segment-VAD.

Frame-VAD

The frame-level VAD model predicts for each frame of the audio whether it has speech or not. For example, with the default config file (../conf/marblenet/marblenet_3x2x64_20ms.yaml), the model provides a probability for each frame of 20ms length.

Training

python speech_to_frame_label.py \
    --config-path=<path to directory of configs, e.g. "../conf/marblenet">
    --config-name=<name of config without .yaml, e.g. "marblenet_3x2x64_20ms"> \
    model.train_ds.manifest_filepath="[<path to train manifest1>,<path to train manifest2>]" \
    model.validation_ds.manifest_filepath=["<path to val manifest1>","<path to val manifest2>"] \
    trainer.devices=-1 \
    trainer.accelerator="gpu" \
    strategy="ddp" \
    trainer.max_epochs=100

The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "label"] are required. An example of a manifest file is:

{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000,  "label": "0 1 0 0 1"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000,  "label": "0 0 0 1 1 1 1 0 0"}

For example, if you have a 1s audio file, you'll need to have 50 frame labels in the manifest entry like "0 0 0 0 1 1 0 1 .... 0 1". However, shorter label strings are also supported for smaller file sizes. For example, you can prepare the label in 40ms frame, and the model will properly repeat the label for each 20ms frame.

Inference

python frame_vad_infer.py
--config-path="../conf/vad" --config-name="frame_vad_infer_postprocess"
dataset=

The manifest json file should have the following format (each line is a Python dictionary):

{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000}  
{"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000}

Evaluation

If you want to evaluate tne model's AUROC and DER performance, you need to set evaluate: True in config yaml (e.g., ../conf/vad/frame_vad_infer_postprocess.yaml), and also provide groundtruth in label strings:

{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000, "label": "0 1 0 0 0 1 1 1 0"}

or RTTM files:

{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000, "rttm_filepath": "/path/to/rttm_file1.rttm"}

Segment-VAD

Segment-level VAD predicts a single label for each segment of audio (e.g., 0.63s by default).

Training

python speech_to_label.py \
    --config-path=<path to dir of configs, e.g. "../conf/marblenet"> \
    --config-name=<name of config without .yaml, e.g., "marblenet_3x2x64"> \
    model.train_ds.manifest_filepath="[<path to train manifest1>,<path to train manifest2>]" \
    model.validation_ds.manifest_filepath=["<path to val manifest1>","<path to val manifest2>"] \
    trainer.devices=-1 \
    trainer.accelerator="gpu" \
    strategy="ddp" \
    trainer.max_epochs=100

The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "label"] are required. An example of a manifest file is:

{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 0.63,  "label": "0"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 0.63,  "label": "1"}

Inference

python vad_infer.py \
    --config-path="../conf/vad" \
    --config-name="vad_inference_postprocessing.yaml"
    dataset=<Path of json file of evaluation data. Audio files should have unique names>

The manifest json file should have the following format (each line is a Python dictionary):

{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000}  
{"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000}

Visualization

To visualize the VAD outputs, you can use the nemo.collections.asr.parts.utils.vad_utils.plot_sample_from_rttm function, which takes an audio file and an RTTM file as input, and plots the audio waveform and the VAD labels. Since the VAD inference script will output a json manifest manifest_vad_out.json by default, you can create a Jupyter Notebook with the following script and fill in the paths using the output manifest:

from nemo.collections.asr.parts.utils.vad_utils import plot_sample_from_rttm

plot_sample_from_rttm(
    audio_file="/path/to/audio_file.wav",
    rttm_file="/path/to/rttm_file.rttm",
    offset=0.0,
    duration=1000,
    save_path="vad_pred.png"
)