ai - comfyui 使用 vibevoice tts text-to-speech 文字转换为语音

2026-04-23 06:34

访问量: 28

refer to: https://www.doubao.com/thread/w8081b1904380241d

1. 进入到comfyui

2. 点击右上角的管理扩展功能

3. 搜索 tts, 左数第二个（第一行）会出现：vibevoice comfyui ，点击开关为打开即可。

（或者，手动下载 gtihub repo: git clone https://github.com/Enemyx-net/VibeVoice-ComfyUI

保存到 comfyui的 custom_nodes 目录下，例如，我的就是在 F:\000_comfyui_files\custom_nodes> )

4. cd 该文件夹

5. pip install -r requirements.txt

6. 重启comfyui

7. 点击左侧模板 -> 新弹出窗口，滚动到底部， -> 扩展 -> vibe voice comfyui -> single speaker

然后就可以看到对应的工作流出现了。

8. 注意左侧，有2个注释型的方框。里面是文字。

8.1 download models .

8.2 download tokenizers

也就是说，需要保证存在这样的目录：

models/vibevoice/

-> VibeVoice-1.5b / ... 里面保存所有的 https://huggingface.co/microsoft/VibeVoice-1.5B/tree/main 的文件（所有文件都下载下来，包括各种 json , 否则无法使用）

-> VibeVoice-Large / ... 对应的所有文件为：https://hf-mirror.com/vibevoice/VibeVoice-7B/tree/main

-> tokenizer 保存所有的：https://huggingface.co/Qwen/Qwen2.5-1.5B/tree/main

( 7b 就是 Large )

9. 关闭comfyui

10. 右键我的电脑 -> 属性 -> 环境变量，增加： HF_ENDPOINT 值：https://hf-mirror.com

11. 重启comfyui, 重新进入该工作流. 进入到该工作流，注释掉 “加载音频” 这个方块。（它的存在表示用户可以自行设置音色）

12. 运行。

Loading checkpoint shards: 100%|██████████| 3/3 [00:02<00:00, 1.39it/s]
[VibeVoice] Model loaded in 2.34 seconds
[VibeVoice] Loading VibeVoice processor...
[VibeVoice] Found Qwen tokenizer in: F:\000_comfyui_files\models\vibevoice\tokenizer
[VibeVoice] Found complete tokenizer at: F:\000_comfyui_files\models\vibevoice\tokenizer
[VibeVoice] Standard from_pretrained failed: expected str, bytes or os.PathLike object, not NoneType
[VibeVoice] Trying with allow remote files...
[VibeVoice] Processing text segment 1 (10 words)
[VibeVoice] Starting audio generation with 20 diffusion steps...
[VibeVoice] Generating audio with 20 diffusion steps...
[VibeVoice] Note: Progress bar shows max possible tokens, not actual needed (~30 estimated)
[VibeVoice] The generation will stop automatically when audio is complete
[VibeVoice] Concatenating 1 audio segments (including pauses)...
[VibeVoice] Successfully generated audio with 1 segments
[VibeVoice] Model and processor memory freed successfully
Prompt executed in 21.59 seconds
FETCH ComfyRegistry Data [DONE]
[ComfyUI-Manager] default cache updated: https://api.comfy.org/nodes
FETCH DATA from: c:\ComfyUI\user\__manager\cache\1514988643_custom-node-list.json [DONE]
[ComfyUI-Manager] All startup tasks have been completed.

使用：

1. mp3 人声参考，可选

2. seed: 决定了音色

diffusion_steps: 越小越收敛，越大越发现（出现了背景音乐啥的）默认20

cfg_scale: 越大，背景音越固定（变成了杂音），越小，则越有背景音乐、歌声。默认 1.35

目测比较好的是：steps 15, seed 4, fg_scale 1.45

决定不用了。

1. 不好用，无法选择性别，只能修改 seed 来乱碰

2. 背景音无法被弄掉。

3. 质量不高。会出现莫名其妙的断句

4. 无法带有感情色彩，只是特别机械的朗读。

决定不用了。

订阅/RSS Feed