[AI] 语音转字幕模型Whisper使用记录

文章目录[隐藏]

前言
方案思路
实践
最后

前言

最近，群友推荐了一个“小电视空降助手”的插件，后续整合进了“bilibili-linux”；

使用过程中明显感知到，对于新视频或者一些冷门视频，一般还没有人提交，所以不能跳过一些赞助片段；

于是，想着能不能借助AI模型来识别。

方案思路

先简单粗暴，想着让AI识别视频内容，分辨出广告；

视频由画面和音频组成，画面检测感觉不太现实，就算有估计花费也很大；

那就通过语音处理，语音直接喂给AI，让它识别出广告的时间点，然后再视频播放时自动跳过。

在一般情况下，AI模型会有一定能力偏向性，比如Claude适合编程，Gemini上下文记性好……

所以，得把识别广告分成两步：

1. 使用语音模型，进行音频转字幕；
2. 使用对话模型，进行字幕中广告的识别；通过提示词，让AI返回特定的格式。

熟悉哔哩哔哩的可能很快就发现，有时候第一步是可以省略的，因为一些视频官方会有AI字幕，这些视频就不需要模型转换。

实践

对于第二步，十分简单，在群友推荐下，使用智谱的GLM-4.5-Flash免费模型，速度能接受，效果好。

对于第一步，一般API都是收费，后面找了一个免费试用的网页（使用js生成的随机设备ID做认证，轻松破除限制）；

不过，这个免费使用的识别效果不是很好，断句有问题；后面又找了一个本地模型whisper-large-v3-turbo做尝试。

但是，总会报错，设置return_timestamps参数也没用；

'You have passed more than 3000 mel input features (> 30 seconds) which automatically enables long-form generation which requires the model to predict timestamp tokens. Please either pass `return_timestamps=True` or make sure to pass no more than 3000 mel input features.', 'warnings': ['There was an inference error: You have passed more than 3000 mel input features (> 30 seconds) which automatically enables long-form generation which requires the model to predict timestamp tokens. Please either pass `return_timestamps=True` or make sure to pass no more than 3000 mel input features.

最后，在一番搜寻之下，找到了faster_whisper库，直接调用就能处理文件了。

#!/usr/bin/env python3
from faster_whisper import WhisperModel
import torch
import sys

model_size = "large-v3-turbo"

device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = "float16" if torch.cuda.is_available() else "int8"
print("Using device: %s, dtype: %s" % (device, torch_dtype))
# Run on GPU with FP16
model = WhisperModel(model_size, device=device, compute_type=torch_dtype)

# or run on GPU with INT8
# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_size, device="cpu", compute_type="int8")

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("使用方法: python transcribe.py <音频文件路径>")
        print("示例: python transcribe.py audio.wav")
        sys.exit(1)
    
    audio_file = sys.argv[1]
    
    segments, info = model.transcribe(audio_file, beam_size=5)
    
    print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
    
    for segment in segments:
        print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

最后

已整合于bilibili-linux项目：https://github.com/msojocs/bilibili-linux/blob/master/res/scripts/transcribe.py

前言

方案思路

实践

最后

Hi，您需要填写昵称和邮箱！