2025-07-18
these results were eye-opening for me... chatgpt agent performed better than i expected on some pretty realistic investment banking tasks [image]
The Verge
OpenAI debuts ChatGPT Agent, which can control an entire computer and perform multi-step tasks, powered by a new dedicated model, rolling out to paid users
One employee uses it to automate his weekly parking requests at OpenAI's San Francisco office.
2025-06-19
new method to address and mitigate emergent misalignment in language models: we show activation monitoring and evals can help catch emergent misalignment early. then, we can re-align models via steering and training. surprisingly, re-aligning models is more data-efficient than
Axios
OpenAI warns that its upcoming models could pose a higher risk of helping create bioweapons and is partnering to build diagnostics, countermeasures, and testing
OpenAI cautioned Wednesday that upcoming models will head into a higher level of risk when it comes to the creation of biological weapons …
new method to address and mitigate emergent misalignment in language models: we show activation monitoring and evals can help catch emergent misalignment early. then, we can re-align models via steering and training. surprisingly, re-aligning models is more data-efficient than
TechCrunch
OpenAI details why “emergent misalignment”, where training models on wrong answers in one area can lead to issues in many others, happens and how to mitigate it
Maxwell Zeff / TechCrunch :
2025-02-19
Introducing SWE-Lancer: our most realistic coding benchmark to date. $1M in real-world, full-stack freelance SWE tasks, each taking freelancers >21 days to complete on avg. Still some limitations, but better than evals we had before. Congrats @samuelp1002 @michelelwang!
VentureBeat
OpenAI researchers build the SWE-Lancer benchmark and find that real-world freelance software engineering work remains challenging for frontier language models
Large language models (LLMs) may have changed software development, but enterprises will need to think twice …
2024-07-11
latest from preparedness: we're developing wet lab biology evals with @LosAlamosNatLab and @nickgenerous keen to learn how gpt-4o's vision and voice capabilities can assist scientists with real-world lab tasks (e.g., troubleshooting cell culture growth)
Bloomberg
OpenAI and Los Alamos National Laboratory announce a partnership to evaluate how multimodal AI models can be used safely by scientists in laboratory settings
Evan Gorelick / Bloomberg :
2024-04-02
https://chatgpt.com/ with no auth! improves model accessibility, so more of the world can grapple with the implications of AI
TechCrunch
OpenAI no longer requires an account to use ChatGPT, but with “slightly more restrictive content policies”, starting in a few markets and rolling out globally
OpenAI is making its flagship conversational AI accessible to everyone, even people who haven't bothered making an account.
2024-02-01
latest from preparedness @ openai: gpt4 at most mildly helps with biothreat creation. method: get bio PhDs in a secure monitored facility. half try biothreat creation w/ (experimental) unsafe gpt4. other half can only use the internet. so far, gpt4 ≈ internet... but we'll...
Bloomberg
OpenAI says GPT-4 poses “at most” a slight risk of helping people create biological threats, per the company's early tests to evaluate “catastrophic” LLM risks
Mark Zuckerberg; Struggling Startups Are Looking For the Exits Michael Nuñez / VentureBeat : OpenAI study reveals surprising role of AI in future biological threat creation Tom Car...