ai - comfyui 使用 vibevoice tts text-to-speech 文字转换为语音

访问量: 4

refer to: https://www.doubao.com/thread/w8081b1904380241d

1. 进入到comfyui 

2. 点击右上角的 管理扩展功能

3. 搜索 tts,  左数第二个(第一行)会出现:vibevoice comfyui ,点击 开关为 打开即可。

(或者,手动下载 gtihub repo:  git clone https://github.com/Enemyx-net/VibeVoice-ComfyUI

保存到 comfyui的 custom_nodes 目录下,例如,我的就是在 F:\000_comfyui_files\custom_nodes> )

4. cd 该文件夹

5. pip install -r requirements.txt 

6. 重启comfyui

7. 点击左侧模板 -> 新弹出窗口,滚动到底部, -> 扩展 -> vibe voice comfyui  -> single speaker 

然后就可以看到对应的工作流出现了。

8. 注意左侧,有2个注释型的 方框。里面是文字。

8.1 download models .  

8.2 download tokenizers 

也就是说,需要保证存在这样的目录:

models/vibevoice/

    ->   VibeVoice-1.5b / ... 里面保存 所有的 https://huggingface.co/microsoft/VibeVoice-1.5B/tree/main 的文件(所有文件都下载下来,包括各种 json , 否则无法使用)

    ->   VibeVoice-Large / ...   对应的所有文件为:https://hf-mirror.com/vibevoice/VibeVoice-7B/tree/main    

    ->  tokenizer    保存所有的:https://huggingface.co/Qwen/Qwen2.5-1.5B/tree/main

( 7b 就是 Large  )

9. 关闭comfyui

10.  右键 我的电脑 -> 属性 -> 环境变量,增加: HF_ENDPOINT  值:https://hf-mirror.com

11. 重启comfyui, 重新进入该工作流. 进入到该工作流,注释掉 “加载音频” 这个方块。(它的存在表示用户可以自行设置音色)

12. 运行。

Loading checkpoint shards: 100%|██████████| 3/3 [00:02<00:00, 1.39it/s]
[VibeVoice] Model loaded in 2.34 seconds
[VibeVoice] Loading VibeVoice processor...
[VibeVoice] Found Qwen tokenizer in: F:\000_comfyui_files\models\vibevoice\tokenizer
[VibeVoice] Found complete tokenizer at: F:\000_comfyui_files\models\vibevoice\tokenizer
[VibeVoice] Standard from_pretrained failed: expected str, bytes or os.PathLike object, not NoneType
[VibeVoice] Trying with allow remote files...
[VibeVoice] Processing text segment 1 (10 words)
[VibeVoice] Starting audio generation with 20 diffusion steps...
[VibeVoice] Generating audio with 20 diffusion steps...
[VibeVoice] Note: Progress bar shows max possible tokens, not actual needed (~30 estimated)
[VibeVoice] The generation will stop automatically when audio is complete
[VibeVoice] Concatenating 1 audio segments (including pauses)...
[VibeVoice] Successfully generated audio with 1 segments
[VibeVoice] Model and processor memory freed successfully
Prompt executed in 21.59 seconds
FETCH ComfyRegistry Data [DONE]
[ComfyUI-Manager] default cache updated: https://api.comfy.org/nodes
FETCH DATA from: c:\ComfyUI\user\__manager\cache\1514988643_custom-node-list.json [DONE]
[ComfyUI-Manager] All startup tasks have been completed.

使用:

1. mp3 人声参考,可选

2. seed: 决定了音色

diffusion_steps:  越小越收敛,越大越发现(出现了背景音乐啥的)  默认20 

cfg_scale: 越大,背景音越固定(变成了杂音),越小,则越有背景音乐、歌声。  默认 1.35 

目测比较好的是:steps 15,  seed 4, fg_scale 1.45 

决定不用了。

1. 不好用,无法选择 性别,只能修改 seed 来乱碰

2. 背景音无法被弄掉。

3. 质量不高。会出现莫名其妙的断句

4. 无法带有感情色彩,只是特别机械的朗读。

订阅/RSS Feed

Subscribe