Source: OpenAI engineers earlier this month told some colleagues they had figured out a way to more than halve the cost of inference
We closely track efforts by Anthropic, Google and OpenAI to get access to more server chips to run their models. But we don't talk enough about the work …
OpenAI engineers earlier this month developed an optimization that cut inference costs in half for models it was applied to. After the optimization was applied to logged-out ChatGPT traffic, it reduced the number of GPUs needed to power that traffic to a couple hundred. [image]