網(wǎng)易首頁(yè) > 網(wǎng)易號(hào) > 正文申請(qǐng)入駐

語(yǔ)音轉(zhuǎn)文本，文本轉(zhuǎn)語(yǔ)音：OpenAI 發(fā)布了 2 套新模型，1 個(gè)新網(wǎng)站

2025-03-21 03:16:00　來(lái)源: 賽博禪心

北京舉報(bào)

分享至

凌晨 1 點(diǎn)的時(shí)候，OpenAI 突然做了三項(xiàng)發(fā)布：

語(yǔ)音轉(zhuǎn)文本（STT）模型
文本轉(zhuǎn)語(yǔ)音（TTS）模型
一個(gè)體驗(yàn)網(wǎng)站：OpenAI.fm

結(jié)論前置：

不大的發(fā)布，實(shí)用的東西，不錯(cuò)的 PlayGround

剩下的，容我逐個(gè)道來(lái)。

語(yǔ)音轉(zhuǎn)文本（STT）模型

兩款模型：gpt-4o-transcribe 和 gpt-4o-mini-transcribe，比之前的 Whisper 價(jià)格更優(yōu)，性能更好，尤其在處理口音、噪音和不同語(yǔ)速方面表現(xiàn)更佳。

先是價(jià)格對(duì)比

Whisper: ~ $0.006/min
gpt-4o-transcribe: ~ $0.006/min
gpt-4o-mini-transcribe: ~ $0.003/min

再是錯(cuò)誤率對(duì)比（越低越好）

對(duì)比自家的 Whisper

對(duì)比競(jìng)品模型

這倆 endpoint，一個(gè)是 transcriptions，另一個(gè)是translations，同樣可以用于新模型。前者是純轉(zhuǎn)文字，簡(jiǎn)單調(diào)用起來(lái)是這樣：

from openai import OpenAI client = OpenAI() audio_file = open("/path/to/file/audio.mp3", "rb") transcription = client.audio.transcriptions.create(   model="whisper-1",    file=audio_file ) print(transcription.text)

后者是轉(zhuǎn)文字+翻譯（僅限翻譯成英文），調(diào)用大概這樣。

from openai import OpenAI client = OpenAI() audio_file = open("/path/to/file/speech.mp3", "rb") transcription = client.audio.transcriptions.create(   model="whisper-1",    file=audio_file,    response_format="text" ) print(transcription.text)

剩下的，是一些接口參數(shù)更新：

時(shí)間戳（Timestamps）：通過(guò)設(shè)置 timestamp_granularities 參數(shù)，可以獲取帶有時(shí)間戳的 JSON 輸出，精確到句子片段或單詞級(jí)別。
流式轉(zhuǎn)錄（Streaming transcriptions）：通過(guò)設(shè)置 stream=True，可以在模型完成音頻片段的轉(zhuǎn)錄后立即接收到 transcript.text.delta 事件，最終會(huì)收到包含完整轉(zhuǎn)錄的 transcript.text.done 事件。
實(shí)時(shí) API （Realtime API）：對(duì)于正在進(jìn)行的音頻流（例如實(shí)時(shí)會(huì)議或語(yǔ)音輸入），可以通過(guò) WebSocket 連接實(shí)時(shí)發(fā)送音頻數(shù)據(jù)并接收轉(zhuǎn)錄事件。

詳細(xì)文檔：

https://platform.openai.com/docs/guides/speech-to-text

語(yǔ)音轉(zhuǎn)文本（TTS）模型

模型名稱(chēng)是 gpt-4o-mini-tts 可控性很強(qiáng)的 TTS：

可以指定要說(shuō)的內(nèi)容，如：“我是練習(xí)時(shí)長(zhǎng)兩年半的個(gè)人練習(xí)生”
可以指定說(shuō)話的風(fēng)格，如：“用嬌滴滴的語(yǔ)氣”

中文示例

英文示例

我個(gè)人感覺(jué)效果不是很好（但可以 roll 點(diǎn)音色）；

長(zhǎng)度方面，最大支持 2000 token 的內(nèi)容；

價(jià)格方面，是 $0.015/min，示例代碼如下：

import asyncio from openai import AsyncOpenAI from openai.helpers import LocalAudioPlayer openai = AsyncOpenAI() input = """大家好，我是練習(xí)時(shí)長(zhǎng)兩年半的個(gè)人練習(xí)生，你坤坤，喜歡唱、跳、Rap和籃球，music~\n\n在今后的節(jié)目中，有我很多作詞，作曲，編舞的原創(chuàng)作品，期待的話多多投票吧！""" instructions = """用嬌滴滴的語(yǔ)氣，蘿莉音""" asyncdefmain() -> None:     asyncwith openai.audio.speech.with_streaming_response.create(         model="gpt-4o-mini-tts",         voice="alloy",         input=input,         instructions=instructions,         response_format="pcm",     ) as response:         await LocalAudioPlayer().play(response) if __name__ == "__main__":     asyncio.run(main())

詳細(xì)文檔：

https://platform.openai.com/docs/guides/text-to-speech

新網(wǎng)站：OpenAI.fm

這是一個(gè)調(diào)試語(yǔ)音的 PlayGround，挺好玩的

還可以在右上角，一鍵導(dǎo)出代碼

結(jié)論

不大的發(fā)布，實(shí)用的東西：

STT 很實(shí)用，Whisper 可以退役了
TTS 效果一般，不推薦用
PlayGround 的設(shè)計(jì)很不錯(cuò)，代碼導(dǎo)出很方便

特別聲明：以上內(nèi)容(如有圖片或視頻亦包括在內(nèi))為自媒體平臺(tái)“網(wǎng)易號(hào)”用戶(hù)上傳并發(fā)布，本平臺(tái)僅提供信息存儲(chǔ)服務(wù)。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.