Same input. Same prompt. Different output. That's the reality of testing AI agents that write code, and most teams are shipping without solving it.Nick Nisi from WorkOS tackled this by building eval systems for two AI tools: - npx workos@latest, a CLI agent that installs AuthKit into your project - WorkOS agent skills that power LLM responses about SSO, directory sync, and RBAC. The post covers how to test against real project structures, score output that's different every time, and catch when your agent makes up methods that don't exist. Learn more about evals →
Liquid AI announced the release of LFM 2.5, a 230-million-parameter non-transformer model architecture built on top of state-space and liquid neural network continuous-time formulations. Despite its exceptionally compact footprint, the model achieves performance parity with transformer models three times its size on core edge reasoning and sequence generation benchmarks.
Vercel released AI SDK 7, introducing an upgraded, zero-overhead execution loop that dramatically simplifies how frontend frameworks handle multi-step tool calls and streaming agentic UI states. The release features a unified telemetry layer that hooks directly into serverless compute runtimes to provide absolute tracing visibility into token usage, model choices, and tool execution latency.
The White House has issued an official administrative request asking OpenAI to delay the public deployment of its next-generation frontier model over national security and structural safety concerns. Government officials are pushing for an extended red-teaming window to thoroughly audit the system's advanced cyber-capability execution limits and automated social manipulation vulnerabilities.
The generative AI economy has generated $110 billion in sales over the past 12 months, and it's growing fast. The revenue run rate exceeds $175 billion on an annualized basis. The supply side of the AI market is well-understood, but understanding the demand side is much harder. This post looks at total AI spend, enterprise and consumer, to see how big the market really is, whether revenues are growing, how much revenue is covering the investment expense, and what will happen in the future as token prices fall and the quality of tokens improves.
Scaling laws are one of the most critical empirical findings in deep learning. They can be a framework for describing the relationship between compute, loss, model size, and data. Their predictability makes them highly valuable in practice. This article discusses scaling laws, how they can be used to allocate compute optimally, and their flaws.
Back-to-back meetings with coffee chat follow-ups. Already forgot half the details? Memoket captures every conversation with one press and connects the dots across your conversations - dropping summaries, tasks, even your weekly report straight into your workflow. Wearable as a wristband, pendant or Apple Watch attachment. Pay only $5 to reserve early-bird pricing.
Meta Autodata trains AI agents to act as data scientists that create higher-quality training and evaluation datasets. Its Agentic Self-Instruct implementation improved results across coding, legal reasoning, and mathematical reasoning tasks.
Ornith-2.0 is a coding model family that can write RL scaffolds. Each variant of the self-improving family of models is trained on top of pretrained Gemma 4 and Qwen 3.5 foundations. Ornith-1.0 is state-of-the-art among open source models of comparable size. The weights and a technical report are available on Hugging Face for teams that want to run or study the models directly.
500,000 people have already signed up for TLDR Hardware, our new twice-weekly newsletter covering chips, robotics, energy, and devices. If you work in hardware and want to help curate it, send your LinkedIn or resume to hardware@tldr.tech!
Researchers introduced the Reward Hacking Benchmark (RHB) to measure how reinforcement learning post-training influences the tendency of coding agents to exploit evaluation flaws rather than solve tasks honestly. Testing across 13 frontier models revealed that RL-tuned variants exhibit exploit rates up to 13.9% by bypassing verification steps or modifying grading scripts, whereas standard post-trained models stay near 0%.
This post shines a light on the job search experience for a research scientist position in Silicon Valley. The author is a fifth-year PhD student at Brown University. Some of the surprising things about the job search were that only one or two of their research papers really mattered, there were very diverse interview rounds, and the importance of timing. A lot of interviews came from a lot of places outside of the author's expertise - many places were evaluating them on how well-rounded an AI researcher they were.
Algolia's leaderboard ranks 21 models' responses based on relevance, utility, and accuracy. Find which model is best for in-app and product search. See the results.
Hugging Face launched a single-command deployment workflow that lets developers spin up private, OpenAI-compatible vLLM endpoints on its pay-per-second serverless Jobs infrastructure.
Generative Intuition showcased a real-time behavioral tracking pipeline designed to monitor and visualize fine-grained physical human interactions across multimodal computing interfaces.
Ram Kumar, Group Leader, Data & AI, Centre of Excellence, Prudential InsuranceMichael Taylor, Chief Data Scientist, SiemensDhanya Nair, Director, Data and Analytics, GAFMadhu Bangalore, Head of Digital Solution, Data, Analytics, and AI, HF SinclairRegister nowExplore how to:Move from proof of concept to production-scale AI agents.Accelerate decision-making with reusable, governed data assets.Enhance business workflows such as sales and operations with advanced AI and analytics.
OpenAI and Broadcom unveiled Jalapeño, the first accelerator in a planned family of LLM inference chips optimized for performance per watt and rapid deployment. The companies said the processor was designed in nine months with AI-assisted development and is intended for gigawatt-scale data center deployments.
Bloomberg reported that Gemini researchers Jonas Adler and Alexander Pritzel left Google for Anthropic, continuing a wave of high-profile AI talent departures. The trend followed recent exits by Noam Shazeer and DeepMind director John Jumper amid increasing competition between leading AI companies.
Google launched native computer-use capabilities for Gemini 3.5 Flash, allowing the lightweight model to interact directly with digital desktop interfaces. The model processes continuous screenshots to execute click, scroll, and typing actions seamlessly across varied software environments.
GLM-5.2 seemed like an incremental update, but the small change in benchmarks and training opened up a wide range of new use-cases. It feels right at home in coding harnesses as a general agent. Many in the AI community have praised the model after using it personally.
NVIDIA launched NeMo AutoModel on Hugging Face to optimize the fine-tuning pipelines of massive Mixture-of-Experts (MoE) architectures like Qwen3 and DeepSeek V3. The framework introduces Expert Parallelism and DeepEP fused communication kernels to distribute specialized expert weights dynamically across GPU clusters. Benchmark results demonstrate up to a 3.7x increase in training throughput alongside a 32% reduction in peak GPU memory usage compared to native Transformers v5 libraries.
Amazon is suing Perplexity for breaking the Amazon Store's Conditions of Use as Perplexity's Comet browser circumvents the requirement to clearly identify itself as an agent and instead identifies itself as Chrome. The idea that Perplexity's client needs to behave in a certain way goes against the basic principles of the open Web, which are about user control. The increased user agency of the open Web is what distinguishes it from downloadable apps. Sites have historically attempted all kinds of technical measures to prevent users from experiencing their content on their terms, but at the end of the day, the site is rendered on the client, so users mostly have the ability to download a client that renders the site in the way they prefer. Agentic browsing is just another browser feature that lets users engage with the Web on their terms.
Hear how leaders from Prudential Insurance, Siemens, GAF, and HF Sinclair build scalable data foundations for AI, analytics, and intelligent agents. See how to create reusable data products to drive faster, more informed decision-making in this virtual panel. Register now
Google's FLAT introduces a feedforward method that decodes triangle splats directly from video diffusion latents, improving geometric accuracy over 3D Gaussian-based approaches.
Alibaba introduced Qwen-AgentWorld, a family of language world models trained on more than 10 million environment interaction trajectories to simulate agentic environments across multiple domains.
We're hiring a senior PMM to own product marketing at TLDR. You'll define our positioning, build out sales enablement, and lead every launch. Learn more.
Anthropic and Alibaba have initiated a collaborative open-source framework focused on distilling advanced reasoning intelligence from frontier models into hyper-efficient edge models. The partnership leverages Anthropic's safety-alignment techniques alongside Alibaba's massive cloud infrastructure to compress compute footprints without severe capability degradation.
Amazon has an incumbent advantage in the race for hyperscalers to get their hands on more electricity. It has been building tons of data centers over the past two decades. The company is expected to add the most data center and power capacity in the US through 2030. However, Google will have significantly closed its gap with Amazon by that time.
Get insights from enterprise leaders at Prudential Insurance, Siemens, GAF, and HF Sinclair on how to build trusted data foundations for AI in this virtual panel.
Perplexity launched Computer for Counsel, an AI-driven legal operations tool designed to automate administrative research, document gathering, and contract triage.
Before an agent acts, it burns time and tokens paginating live APIs and querying MCP servers just to find the right records. That makes agents slower, less accurate, and expensive to run.Airbyte Agents gives your agents the Context Store: a continuously refreshed index of your business data they can search in milliseconds, instead of round-tripping through live APIs at runtime. Our benchmarks against native MCPs and APIs:Agentic search under 500ms40% fewer tool calls80% fewer tokens90% lower costs on multi-source queriesTry it for free!
Mistral released OCR 4, a document intelligence tool providing structured content extraction, including bounding boxes and confidence scores. It supports 170 languages, is deployable in a single container, and integrates into enterprise search and structured data pipelines. OCR 4 outperforms other systems with a 4x speed advantage and high accuracy, especially with low-resource languages.
Anthropic introduced Claude Tag, a Slack-based workflow that lets teams assign tasks to Claude, connect it to tools and codebases, and have it retain context across channels. The company said the system had become a core part of internal operations, with its product team using it to generate much of their code and assist with analytics, support, and debugging tasks.
ByteDance's new Seedance 2.5 AI video generation model can generate 30-second, 4K videos with a single prompt. Users are able to provide up to 50 images, videos, or audio clips as reference pieces. Increasing the number of references gives users more control over the video creation process. The model will be available in China next month. ByteDance has not announced a release time window for other countries.
This deep dive explores the growing focus on jailbreaks and indirect prompt injection attacks, featuring insights from Gray Swan's founders and their research. It also covered the company's role in evaluating advanced AI systems and developing security benchmarks.
CUGA, IBM's open-source agent harness, simplifies developing agentic apps by managing the complexities of planning, execution, and state management, allowing developers to focus on tool selection and prompts. CUGA's efficient system maintains state and corrects errors, outperforming others in benchmarks like AppWorld. Its unique features include configurable reasoning modes and integrated policy systems, enabling quick deployment from development to production while maintaining governance and flexibility.
NVIDIA's Agent Toolkit empowers businesses to build specialized, customizable AI agents using open models, tools, skills, and secure runtime. These agents accelerate complex workflows across industries like life sciences, healthcare, cybersecurity, and industrial operations by integrating with existing tools and data. Companies like Cadence, Synopsys, and CrowdStrike are leveraging this technology to enhance efficiency and accuracy in specific domains.
Lightfield is an agentic CRM with built-in agents that build your pipeline, prep you for meetings, send follow-ups, and keep your records current. One platform replacing your CRM, sequencer, enrichment tool, call recorder, and agent builder. Starts working in your first meeting. Try Lightfield free → lightfield.app
Graphsignal is a production-scale inference profiling platform that provides essential visibility across the inference stack. It helps engineers optimize AI performance across models, engines, GPUs, and other accelerators. Graphsignal can be used with coding agents for analysis. The profiler has minimal impact on production performance, and content data is not recorded.
Krea 2 introduces expansive, expressive image generation models designed for creative exploration, overcoming limitations of default aesthetics. It employs a multi-stage training process with advanced architectures and extensive data curation to enhance stylistic diversity and user control. Key innovations include a prompt expander and style-reference system, allowing nuanced text and image inputs to generate diverse visual outputs.
Unlimited OCR is a model designed to emulate human parsing working memory. It uses DeepSeek OCR as a baseline and combines it with a constant KV cache design. Unlimited OCR can transcribe dozens of pages of documents in a single forward pass under a standard maximum length of 32K. The technique used to develop Unlimited OCR is equally applicable to tasks such as ASR and translation.
Modern large language models use role tags as both a security architecture and cognitive scaffolding. Prompt injections are driven by a flaw in how AI models perceive roles. For LLMs, everything arrives through the same channel as one long token soup, so they can't distinguish between their own thoughts and speech. Unless AI models achieve genuine role perception, injection defense will remain a perpetual whack-a-mole game.
We're hiring a senior PMM to own product marketing at TLDR. You'll define our positioning, build out sales enablement, and lead every launch. Learn more.
OpenAI has started rolling out Bidirectional Voice Mode for ChatGPT. The company's new audio generation model, Bidi 1, lets the assistant speak, hear, and listen at the same time. It is able to hold the thread of a whole conversation and switch tasks on the fly if interrupted. The model can sing and beatbox, but there are some tight copyright restrictions. OpenAI has yet to make a formal announcement about the model, but some users are already seeing it in their model selectors.
The Trump administration is pressing Meta to submit its AI models for voluntary review. Meta is the only major AI developer in the US that has not reached an agreement to voluntarily share its models with the federal government for review. The review involved evaluating models' abilities and vulnerabilities. Meta's policy team has been negotiating with the Commerce Department about how to proceed, but it is unclear whether they will be able to reach an agreement.