mnhatdaous commited on
Commit
421543d
·
1 Parent(s): 1c43d7b

Add Hugging Face Space configuration with Docker support

Browse files
Files changed (5) hide show
  1. .dockerignore +45 -0
  2. Dockerfile +30 -0
  3. README_HF.md +53 -0
  4. app.py +127 -0
  5. requirements-hf.txt +14 -0
.dockerignore ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__/
2
+ *.pyc
3
+ *.pyo
4
+ *.pyd
5
+ .Python
6
+ env/
7
+ venv/
8
+ .venv/
9
+ pip-log.txt
10
+ .tox/
11
+ .coverage
12
+ .coverage.*
13
+ .cache
14
+ nosetests.xml
15
+ coverage.xml
16
+ *.cover
17
+ *.log
18
+ .git/
19
+ .mypy_cache/
20
+ .pytest_cache/
21
+ .hypothesis/
22
+ .DS_Store
23
+ *.egg-info/
24
+ .installed.cfg
25
+ *.egg
26
+ MANIFEST
27
+
28
+ # Large model files (download separately)
29
+ *.pt
30
+ *.pth
31
+ *.bin
32
+ *.safetensors
33
+ *.ckpt
34
+
35
+ # Dataset files
36
+ *.wav
37
+ *.mp3
38
+ *.flac
39
+ *.parquet
40
+
41
+ # Logs and temporary files
42
+ logs/
43
+ wandb/
44
+ tmp/
45
+ temp/
Dockerfile ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10-slim
2
+
3
+ WORKDIR /app
4
+
5
+ # Install system dependencies
6
+ RUN apt-get update && apt-get install -y \
7
+ build-essential \
8
+ ffmpeg \
9
+ git \
10
+ && rm -rf /var/lib/apt/lists/*
11
+
12
+ # Copy requirements first for better caching
13
+ COPY requirements-hf.txt ./requirements.txt
14
+
15
+ # Install Python dependencies
16
+ RUN pip install --no-cache-dir -r requirements.txt
17
+
18
+ # Copy the entire project
19
+ COPY . .
20
+
21
+ # Set environment variables
22
+ ENV PYTHONPATH=/app
23
+ ENV GRADIO_SERVER_NAME=0.0.0.0
24
+ ENV GRADIO_SERVER_PORT=7860
25
+
26
+ # Expose the port
27
+ EXPOSE 7860
28
+
29
+ # Run the application
30
+ CMD ["python", "app.py"]
README_HF.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Learnable Speech
3
+ emoji: 🎤
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: docker
7
+ pinned: false
8
+ license: apache-2.0
9
+ app_port: 7860
10
+ ---
11
+
12
+ # Learnable-Speech: High-Quality 24kHz Speech Synthesis
13
+
14
+ An unofficial implementation based on improvements of CosyVoice with learnable encoder and DAC-VAE.
15
+
16
+ ## Demo
17
+
18
+ This Space provides a demo interface for the Learnable-Speech model. Currently, it shows a placeholder implementation. To use the actual trained model, you would need to:
19
+
20
+ 1. Train the model using the provided training pipeline
21
+ 2. Upload the trained checkpoints
22
+ 3. Replace the placeholder inference code with actual model loading and inference
23
+
24
+ ## Features
25
+
26
+ - **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate
27
+ - **Flow matching AE**: Flow matching training for autoencoders
28
+ - **Immiscible assignment**: Support immiscible adding noise while training
29
+ - **Contrastive Flow matching**: Support Contrastive training
30
+
31
+ ## Architecture
32
+
33
+ ### Stage 1: Audio to Discrete Tokens
34
+
35
+ Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.
36
+
37
+ ### Stage 2: Discrete Tokens to Continuous Latent Space
38
+
39
+ Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).
40
+
41
+ ## Links
42
+
43
+ - [GitHub Repository](https://github.com/primepake/learnable-speech)
44
+ - [Technical Paper](https://arxiv.org/pdf/2505.07916)
45
+ - [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)
46
+
47
+ ## Usage
48
+
49
+ 1. Enter text in the text box
50
+ 2. Select a speaker ID (0-10)
51
+ 3. Click "Generate Speech" to synthesize audio
52
+
53
+ **Note**: This is currently a placeholder demo. The actual model requires training first.
app.py ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import numpy as np
3
+
4
+ def synthesize_speech(text, speaker_id=0):
5
+ """
6
+ Placeholder function for speech synthesis
7
+ Replace this with actual model inference when you have trained models
8
+ """
9
+ if not text.strip():
10
+ return None
11
+
12
+ # This is a placeholder - replace with actual model inference
13
+ sample_rate = 24000
14
+ duration = max(1.0, len(text) * 0.08) # rough estimate
15
+ samples = int(sample_rate * duration)
16
+
17
+ # Generate simple sine wave as placeholder
18
+ t = np.linspace(0, duration, samples)
19
+ frequency = 440 + (speaker_id * 50) # vary frequency by speaker
20
+
21
+ # Create a more interesting waveform
22
+ audio = (
23
+ 0.3 * np.sin(2 * np.pi * frequency * t) * np.exp(-t/(duration*0.8)) +
24
+ 0.1 * np.sin(2 * np.pi * frequency * 2 * t) * np.exp(-t/duration) +
25
+ 0.05 * np.random.randn(samples) # add some noise
26
+ )
27
+
28
+ # Apply fade in/out
29
+ fade_samples = int(0.1 * sample_rate)
30
+ audio[:fade_samples] *= np.linspace(0, 1, fade_samples)
31
+ audio[-fade_samples:] *= np.linspace(1, 0, fade_samples)
32
+
33
+ return (sample_rate, audio.astype(np.float32))
34
+
35
+ def create_demo():
36
+ with gr.Blocks(title="Learnable-Speech Demo", theme=gr.themes.Soft()) as demo:
37
+ gr.Markdown(
38
+ """
39
+ # 🎤 Learnable-Speech: High-Quality 24kHz Speech Synthesis
40
+
41
+ An unofficial implementation based on improvements of CosyVoice with learnable encoder and DAC-VAE.
42
+
43
+ > **Note**: This is a demo interface. To use the actual model, you need to train it first using the provided training pipeline.
44
+ """
45
+ )
46
+
47
+ with gr.Row():
48
+ with gr.Column():
49
+ text_input = gr.Textbox(
50
+ label="Text to synthesize",
51
+ placeholder="Enter text here...",
52
+ lines=3,
53
+ value="Hello, this is a demo of Learnable-Speech synthesis."
54
+ )
55
+
56
+ with gr.Row():
57
+ speaker_slider = gr.Slider(
58
+ minimum=0,
59
+ maximum=10,
60
+ value=0,
61
+ step=1,
62
+ label="Speaker ID"
63
+ )
64
+
65
+ generate_btn = gr.Button("🎵 Generate Speech", variant="primary", size="lg")
66
+
67
+ with gr.Column():
68
+ audio_output = gr.Audio(
69
+ label="Generated Speech",
70
+ type="numpy"
71
+ )
72
+
73
+ with gr.Accordion("📋 Project Information", open=False):
74
+ gr.Markdown(
75
+ """
76
+ ### Key Features
77
+ - **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate
78
+ - **Flow matching AE**: Flow matching training for autoencoders
79
+ - **Immiscible assignment**: Support immiscible adding noise while training
80
+ - **Contrastive Flow matching**: Support Contrastive training
81
+
82
+ ### Architecture
83
+ **Stage 1**: Audio to Discrete Tokens - Converts raw audio into discrete representations using FSQ (S3Tokenizer)
84
+
85
+ **Stage 2**: Discrete Tokens to Continuous Latent Space - Maps discrete tokens to continuous latent space using VAE
86
+
87
+ ### Training Pipeline
88
+ 1. Extract discrete tokens using trained FSQ S3Tokenizer
89
+ 2. Generate continuous latent representations using trained DAC-VAE
90
+ 3. Train Stage 1: BPE tokens → Discrete FSQ
91
+ 4. Train Stage 2: Discrete FSQ → DAC-VAE Continuous latent space
92
+
93
+ ### Links
94
+ - [GitHub Repository](https://github.com/primepake/learnable-speech)
95
+ - [Technical Paper](https://arxiv.org/pdf/2505.07916)
96
+ """
97
+ )
98
+
99
+ # Example inputs
100
+ gr.Examples(
101
+ examples=[
102
+ ["Hello everyone! I am here to tell you that Learnable-Speech is amazing!", 0],
103
+ ["The Secret Service believed that it was very doubtful that any President would ride regularly in a vehicle.", 1],
104
+ ["We propose Learnable-Speech, a new approach to neural text-to-speech synthesis.", 2],
105
+ ["This implementation uses flow matching for high-quality 24kHz audio generation.", 3],
106
+ ],
107
+ inputs=[text_input, speaker_slider],
108
+ outputs=audio_output,
109
+ fn=synthesize_speech,
110
+ cache_examples=False,
111
+ )
112
+
113
+ generate_btn.click(
114
+ fn=synthesize_speech,
115
+ inputs=[text_input, speaker_slider],
116
+ outputs=audio_output
117
+ )
118
+
119
+ return demo
120
+
121
+ if __name__ == "__main__":
122
+ demo = create_demo()
123
+ demo.launch(
124
+ server_name="0.0.0.0",
125
+ server_port=7860,
126
+ share=False
127
+ )
requirements-hf.txt ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ gradio==4.44.0
2
+ torch==2.1.0
3
+ torchaudio==2.1.0
4
+ numpy==1.24.3
5
+ soundfile==0.12.1
6
+ librosa==0.10.1
7
+ transformers==4.36.0
8
+ omegaconf==2.3.0
9
+ hydra-core==1.3.2
10
+
11
+ # Optional: Add these if you need the full training pipeline
12
+ # deepspeed==0.12.6
13
+ # tensorboard==2.14.0
14
+ # matplotlib==3.7.2