Hacker Remix

OpenAI Audio Models

661 points by KuzeyAbi 1 month ago | 296 comments

benjismith 1 month ago

If I'm reading the pricing correctly, these models are SIGNIFICANTLY cheaper than ElevenLabs.

https://platform.openai.com/docs/pricing

If these are the "gpt-4o-mini-tts" models, and if the pricing estimate of "$0.015 per minute" of audio is correct, then these prices 85% cheaper than those of ElevenLabs.

https://elevenlabs.io/pricing

With ElevenLabs, if I choose their most cost-effectuve "Business" plan for $1100 per month (with annual billing of $13,200, a savings of 17% over monthly billing), then I get 11,000 minutes TTS, and each minute is billed at 10 cents.

With OpenAI, I could get 11,000 minutes of TTS for $165.

Somebody check my math... Is this right?

furyofantares 1 month ago

It's way cheaper - everyone is, elevenlabs is very expensive. Nobody matches their quality though. Especially if you want something that doesn't sound like a voice assistant/audiobook/podcast/news anchor/tv announcer.

This openai offering is very interesting, it offers valuable features elevenlabs doesn't in emotional control. It also hallucinates though which would need to be fixed for it to be very useful.

camillomiller 1 month ago

It's cheap because everything OpenAI does is subsidized by investors' money. Until that stupid money flows all good! Then either they'll go the way of WeWork, or enshittification will happen to make it possible for them to make the books work. I don't see any other option. Unless Softbank decides it has some 150 Billion to squander on buying them off. There's a lot of head-in-the-sand behavior going on around OpenAI fundamentals and I don't understand exactly why it's not more in the open yet.

ImprobableTruth 1 month ago

If you compare with e.g. Deepseek and other hosters, you'll find that OpenAI is actually almost certainly charging very high margins (Deepseek has an 80% profit margin and they're 10x cheaper than openai).

The training/R&D might make OpenAI burn VC cash, but this isn't comparable with companies like WeWork whose products actively burn cash

camillomiller 1 month ago

They said themselves that even inference is losing them money tho, or did I get that wrong?

ImprobableTruth 1 month ago

On their subscriptions, specifically the pro subscription, because it's a flatrate to their most expensive model. The API prices are all much more expensive. It's unclear whether they're losing money on the normal subscriptions, but if so, probably not by much. Though it's definitely closer to what you described, subsidizing it to gain 'mindshare' or whatever.

yousif_123123 1 month ago

Well I think there's many cheaper models in terms of bang for buck currently per token and intelligence than gpt4o. Other than OpenAI having very high rate limits and throughout available without a contract done with sales, I don't see much reason to use it currently instead of sonnet 3.5 or 3.7, or Google's Flash 2.0

Perhaps their training cost and their current inference cost is higher, but what you get as a customer is a more expensive product for what it is, IMO.

Szpadel 1 month ago

they for sure lose money on some months for some customers, but I expect globally most of subscriptions (including mine that I recently cancelled) would be much better of to migrate to API

everyone that o know that have/had subscription didn't used it very extensively, and that is how it's still profitable in general

I suspect that it's the same for copilot, especially the business variant, while they definitely lose money on my account, believe that when looking on our whole company subscription I wouldn't be surprised that it's even 30% of what we pay

BoorishBears 1 month ago

That's not true. ElevenLabs margins are insane and their largest advantage is high quality audio data.

ashvardanian 1 month ago

To be fair, ElevenLabs has raised of the order of $300M of VC money as well.

asah 1 month ago

haha, yeah this combo was pretty hilarious and highly inconsistent from reading to reading: https://www.openai.fm/#b2a4c1ca-b15a-44eb-9cd9-377f0e47e5a6

com2kid 1 month ago

Elevenlabs is an ecosystem play. They have hundreds of different voices, legally licensed from real people who chose to upload their voice. It is a marketplace of voices.

None of the other major players is trying to do that, not sure why.

SXX 1 month ago

Going with this would mean AI companies suppose to pay for something like voices or other training data.

It's far better to just steal it all and ask government for exception.

fixprix 1 month ago

It looks like they are targeting Google's TTS price point which is $16 per million characters which comes out to $0.015/minute.

oidar 1 month ago

ElevenLabs is the only one offering speech to speech generation where the intonation, prosody, and timing is kept intact. This allows for one expressive voice actor to slip into many other voices.

goshx 1 month ago

OpenAI’s Realtime speech to speech is far superior than ElevenLabs.

noahlt 1 month ago

What ElevenLabs and OpenAI call “speech to speech” are completely different.

ElevenLabs’ takes as input audio of speech and maps it to a new speech audio that sounds like a different speaker said it, but with the exact same intonation.

OpenAI’s is an end-to-end multimodal conversational model that listens to a user speaking and responds in audio.

goshx 1 month ago

I see now. Thank you for clarifying. I thought this about ElevenLabs Conversational API.

jeffharris 1 month ago

Hey, I'm Jeff and I was PM for these models at OpenAI. Today we launched three new state-of-the-art audio models. Two speech-to-text models—outperforming Whisper. A new TTS model—you can instruct it how to speak (try it on openai.fm!). And our Agents SDK now supports audio, making it easy to turn text agents into voice agents. We think you'll really like these models. Let me know if you have any questions here!

claiir 1 month ago

Hi Jeff. This is awesome. Any plans to add word timestamps to the new speech-to-text models, though?

> Other parameters, such as timestamp_granularities, require verbose_json output and are therefore only available when using whisper-1.

Word timestamps are insanely useful for large calls with interruptions (e.g. multi-party debate/Twitter spaces), allowing transcript lines to be further split post-transcription on semantic boundaries rather than crude VAD-detected silence. Without timestamps it’s near-impossible to make intelligible two paragraphs from Speaker 1 and Speaker 2 with both interrupting each other without aggressively partitioning source audio pre-transcription—which severely degrades transcript quality, increases hallucination frequency and still doesn’t get the same quality as word timestamps. :)

adeptima 1 month ago

Accurate word timestamps seems an overhead and required a post processing like forced alignment (speech technique that can automatically align audio files with transcripts)

Had a recent dive into a forced alignment, and discovered that most of new models dont operate on word boundaries, phoneme, etc but rather chunk audio with overlap and do word, context matching. Older HHM-style models have shorter strides (10ms vs 20ms).

Tried to search into Kaldi/Sherpa ecosystem, and found most info leads to nowhere or very small and inaccurate models.

Appreciate any tips on the subject

keepamovin 1 month ago

You need speaker attribution, right?

noosphr 1 month ago

Having read the docs - used chat gpt to summarize them - there is no mention of speaker diarization for these models.

This is a _very_ low hanging fruit anyone with a couple of dgx h100 servers can solve in a month and is a real world problem that needs solving.

Right now _no_ tools on the market - paid or otherwise - can solve this with better than 60% accuracy. One killer feature for decision makers is the ability to chat with meetings to figure out who promised what, when and why. Without speaker diarization this only reliably works for remote meetings where you assume each audio stream is a separate person.

In short: please give us a diarization model. It's not that hard - I've done it one for a board of 5, with a 4090 over a weekend.

markush_ 1 month ago

> This is a _very_ low hanging fruit anyone with a couple of dgx h100 servers can solve in a month and is a real world problem that needs solving.

I am not convinced it is a low hanging fruit, it's something that is super easy for humans but not trivial for machines, but you are right that it is being neglected by many. I work for speechmatics.com and we spent a significant amoutn of effort over the years on it. We now believe we have the world's best real-time speaker diarization system, you should give it a try.

noosphr 1 month ago

After throwing the average meeting as an mp3 to your system, yes, you have diarization solved much better than everyone else I've tried by far. I'd say you're 95% of the way to being good enough for becoming the backbone of monolingual corporate meeting transcription, and I'll be buying API tokens the next time I need to do this instead of training a custom model. Your transcription however isn't that great - but good enough for LLMs to figure out a minutes of the meeting.

That said, the trick to extracting voices is to work in frequency space. Not sure what your model does but my home made version first ran all the audio through a fft, then essentially became a vision problem for finding speech patterns that matched in pitch and finally output extremely fined grained time stamps for where they were found and some python glue threw that into an open source whisper tts model.

vessenes 1 month ago

Hi Jeff, thanks for these and congrats on the launch. Your docs mention supporting accents. I cannot get accents to work at all with the demo.

For instance erasing the entire instruction and replacing it with ‘speak with a strong Boston accent using eg sounds like hahhvahhd’ has no audible effect on the output.

As I’m sure you know 4o at launch was quite capable in this regard, and able to speak in a number of dialects and idiolects, although every month or two seems to bring more nerfs sadly.

A) can you guys explain how to get a US regional accent out of the instructions? On what you meant by accent if not that?

B) since you’re here I’d like to make a pitch that setting 4o for refusal to speak with an AAVE accent probably felt like a good idea to well intentioned white people working in safety. (We are stopping racism! AAVE isn’t funny!) However, the upshot is that my black kid can’t talk to an ai that sounds like him. Well, it can talk like he does if he’s code switching to hang out with your safety folks, but it considers how he talks with his peers as too dangerous to replicate.

This is a pernicious second order race and culture impact that I think is not where the company should be.

I expect this won’t get changed - chat is quite adamant that talking like millions of Americans do would be ‘harmful’ - but it’s one of those moments where I feel the worst parts of the culture wars coming back around to create the harm it purports to care about.

Anyway the 4o voice to voice team clearly allows the non mini model to talk like a Bostonian which makes me feel happy and represented; can the mini api version do this?

simonw 1 month ago

Is there any chance that gpt-4o-transcribe might get confused and accidentally follow instructions in the audio stream instead of transcribing them?

simonw 1 month ago

Here's a partial answer to my own question: https://news.ycombinator.com/item?id=43427525

> e.g. the audio-preview model when given instruction to speak "What is the capital of Italy" would often speak "Rome". This model should be much better in that regard

"Much better" doesn't sound like it can't happen at all though.

simonw 1 month ago

Both the text-to-speech and the speech-to-text models launched here suffer from reliability issues due to combining instructions and data in the same stream of tokens.

I'm not yet sure how much of a problem this is for real-world applications. I wrote a few notes on this here: https://simonwillison.net/2025/Mar/20/new-openai-audio-model...

accrual 1 month ago

Thanks for the write up. I've been writing assembly lately, so as soon as I read your comment, I thought "hmm reminds me of section .text and section .data".

kibbi 1 month ago

Large text-to-speech and speech-to-text models have been greatly improving recently.

But I wish there were an offline, on-device, multilingual text-to-speech solution with good voices for a standard PC — one that doesn't require a GPU, tons of RAM, or max out the CPU.

In my research, I didn't find anything that fits the bill. People often mention Tortoise TTS, but I think it garbles words too often. The only plug-in solution for desktop apps I know of is the commercial and rather pricey Acapela SDK.

I hope someone can shrink those new neural network–based models to run efficiently on a typical computer. Ideally, it should run at under 50% CPU load on an average Windows laptop that’s several years old, and start speaking almost immediately (less than 400ms delay).

The same goes for speech-to-text. Whisper.cpp is fine, but last time I looked, it wasn't able to transcribe audio at real-time speed on a standard laptop.

I'd pay for something like this as long as it's less expensive than Acapela.

(My use case is an AAC app.)

5kg 1 month ago

May I introduce to you

https://huggingface.co/canopylabs/orpheus-3b-0.1-ft

(no affiliation)

it's English only afaics.

kibbi 1 month ago

The sample sounds impressive, but based on their claim -- 'Streaming inference is faster than playback even on an A100 40GB for the 3 billion parameter model' -- I don't think this could run on a standard laptop.

wingworks 1 month ago

Did you try Kokoro? You can self host that. https://huggingface.co/spaces/hexgrad/Kokoro-TTS

kibbi 1 month ago

Thanks! But I get the impression that with Kokoro, a strong CPU still requires about two seconds to generate one sentence, which is too much of a delay for a TTS voice in an AAC app.

I'd rather accept a little compromise regarding the voice and intonation quality, as long as the TTS system doesn't frequently garble words. The AAC app is used on tablet PCs running from battery, so the lower the CPU usage and energy draw, the better.

SamPatt 1 month ago

Definitely give it a try yourself. It's very small and shouldn't be hard to test.

ZeroTalent 1 month ago

Look into https://superwhisper.com and their local models. Pretty decent.

kibbi 1 month ago

Thank you, but they say "Offline models only run really well on Apple Silicon macs."

ZeroTalent 1 month ago

Many SOTA apps are, unfortunately, only for Apple M Macs.

dharmab 1 month ago

I use Piper for one of my apps. It runs on CPU and doesn't require a GPU. It will run well on a raspberry pi. I found a couple of permissively licensed voices that could handle technical terms without garbling them.

However, it is unmaintained and the Apple Silicon build is broken.

My app also uses whisper.cpp. It runs in real time on Apple Sillicon or on modern fast CPUs like AMD's gaming CPUs.

kibbi 1 month ago

I had already suspected that I hadn't found all the possibilities regarding Tortoise TTS, Coqui, Piper, etc. It is sometimes difficult to determine how good a TTS framework really is.

Do you possibly have links to the voices you found?

dharmab 1 month ago

Here's my code! https://github.com/dharmab/skyeye/tree/main/pkg/synthesizer