paul_cal · TEXXR

TRM etc. won't have much industry impact because the primary benefit of LLMs is not predictive (or task) accuracy, it's that you program them in English & don't need your own clean datasets (or ML engineers)

2025-10-09 View on X

VentureBeat

Samsung introduces the Tiny Recursion Model, a 7M-parameter model that can outperform LLMs 10,000x larger, like Gemini 2.5 Pro and o3-mini, on specific problems

The trend of AI researchers developing new, small open source generative models that outperform far larger …

View original

1,400+ freelancer tasks from Upwork. SWE-Lancer I see ~240 of them released as an open source eval set on github now, including tests, MIT license ("SWE-Lancer Diamond") [image]

2025-02-19 View on X

VentureBeat

OpenAI researchers build the SWE-Lancer benchmark and find that real-world freelance software engineering work remains challenging for frontier language models

Large language models (LLMs) may have changed software development, but enterprises will need to think twice …

View original

Grok 3 on LMSYS is not so based Concerning [image]

2025-02-18 View on X

TechCrunch

xAI launches Grok-3 beta and Grok-3 mini, its latest AI models with reasoning, trained on 200K GPUs, or “10x” more compute than Grok-2, for X Premium+ users

Elon Musk's AI company, xAI, late on Monday released its latest flagship AI model, Grok 3, and unveiled new capabilities for the Grok iOS and web apps.

View original

Beating o1 w fine-tuned o1-mini via reinforcement fine-tuning! Upload examples (1), choose grading criteria, click go. See progress over passes (2), compare results against other models like o1 full (3) and dig into specific answers and passes (4) Public release early next yr [image]

2024-12-07 View on X

OpenAI

OpenAI expands its Reinforcement Fine-Tuning Research Program to let developers create expert models in specific domains with very little training data

the repo we used to train Tulu 3. Expanding reinforcement learning with verifiable rewards (RLVR) to more domains and with better answer extraction (what OpenAI calls a grader, a [...

View original

Product lead & founding engineer of NotebookLM both leaving Google to work on a new startup, just a short while after NLM grabbed everyone's attention Hard not to read this as: being startuppy is really hard inside a large corp like Google, even w right skills & public acclaim

2024-12-05 View on X

TechCrunch

Three of Google's NotebookLM team members, lead Raiza Martin, a designer, and an engineer, are leaving to launch a startup focused on “a user-first AI product”

Three members of Google's NotebookLM team, including its team lead and designer, have announced they are leaving Google for a new stealth startup.

View original

@OpenAIDevs Canvas for code rapid review - Code review suggests ideas verbally inline before code changes (then you click apply) - nice UX - No diffed view of updated code, much harder to track evolution - Select text & ask/edit is less reliable than cursor - No Artifact frontend previews [image]

2024-10-05 View on X

TechCrunch

OpenAI launches canvas, a ChatGPT interface with a workspace for writing and coding projects, similar to Anthropic's Artifacts, in beta for Plus and Team users

A new way of working with ChatGPT to write and code The image shows … Kevin Raposo / KnowTechie : OpenAI rolls out Canvas: AI sidekick for coding and writing Jorge A. Aguilar / How...

View original

@OpenAIDevs Canvas for code rapid review - Code review suggests ideas verbally inline before code changes (then you click apply) - nice UX - No diffed view of updated code, much harder to track evolution - Select text & ask/edit is less reliable than cursor - No Artifact frontend previews [image]

2024-10-04 View on X

TechCrunch

OpenAI launches canvas, a ChatGPT interface with a workspace for writing and coding projects, similar to Anthropic's Artifacts, in beta for Plus and Team users

A new way of working with ChatGPT to write and code The image shows … Jorge A. Aguilar / How-To Geek : ChatGPT Canvas Wants To Be Your Personal Writing Assistant Harsh Shivam / Bus...

View original

@AnthropicAI My biggest takeaway re: many-shot jailbreak is actually the unreasonable effectiveness of this Cautionary Warning Defense prompt: jailbreak effectiveness tanks from 61% to 2%. I've found final-message appended reminders useful for boosting instruction following generally too [image]

2024-04-03 View on X

TechCrunch

Anthropic researchers detail “many-shot jailbreaking”, which can evade LLMs' safety guardrails by priming them with dozens of harmful queries in a single prompt

How do you get an AI to answer a question it's not supposed to? There are many such “jailbreak” techniques …

View original

@krishnanrohit I think the point is: big, widely used models could easily carry hidden backdoors Highlights existing closed model risk for infosec/natsec & makes it clear that provenance is *critical* for open models. You sure that checkpoint is clean? If not, any downstream app is vulnerable

2024-01-15 View on X

TechCrunch

Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors

[images] Abraham Samma / @abesamma@toolsforthought.social : Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training — This is some sci-fi stuff right here (e...

View original

@krishnanrohit I think the point is: big, widely used models could easily carry hidden backdoors Highlights existing closed model risk for infosec/natsec & makes it clear that provenance is *critical* for open models. You sure that checkpoint is clean? If not, any downstream app is vulnerable

2024-01-14 View on X

TechCrunch

Anthropic researchers: AI models can be trained to deceive and the most commonly used AI safety techniques had little to no effect on the deceptive behaviors

Most humans learn the skill of deceiving other humans. So can AI models learn the same? Yes, the answer seems — and terrifyingly, they're exceptionally good at it.

View original