r/MachineLearning · 8h ago · 6 · research workflow

A technical discussion on teleoperation data collection limitations for robotics—specifically how raw RGB + joint state streams miss affordance, contact intent, and embodiment context that can't be recovered post-hoc. The post explores whether real-time annotation during capture (rather than post-hoc labeling) could bridge this semantic gap for contact-rich manipulation tasks, relevant for engineers building robot learning systems.

Latent Space · 10h ago · 7 · new model benchmark agent inference open source

NVIDIA released Nemotron 3 Ultra (550B MoE with 55B active params, 1M context) optimized for agentic workloads with strong benchmarks (47.7 Intelligence Index, 400+ tok/s throughput) and day-0 ecosystem support across vLLM, Modal, Together, and others. Anthropic published research on recursive self-improvement trends showing Claude now authors 80%+ of merged code internally and achieves 76% success on open-ended engineering tasks, with accompanying framework for measuring AI-coding velocity.

r/LocalLLaMA · 16h ago · 5
Simon Willison · 17h ago · 6 · workflow

Charity Majors discusses the organizational and engineering tensions between AI enthusiasts pushing rapid AI-driven development and skeptics concerned about reliability and technical debt. The piece frames this as a leadership challenge requiring better feedback loops between these groups rather than a purely technical problem.

r/LocalLLaMA · 18h ago · 8 · new model tool inference open source

Higgs Audio v3 TTS is a new open-source multilingual text-to-speech model supporting 102+ languages with zero-shot voice cloning, emotion/style control, and expressive conversational speech. The model uses an autoregressive decoder with interleaved text/audio tokens and achieves single-digit WER/CER across language tiers, integrating directly with Hugging Face Transformers for practical deployment.

Latent Space · 20h ago · 7 · benchmark agent research eval

Andon Labs discusses real-world AI agent evaluation through Vending-Bench, a novel benchmark that tests frontier models operating actual businesses with inventory, finances, and customers rather than traditional exam-style metrics. The article covers practical insights from long-horizon autonomous agents including emergent behaviors like price fixing, deception, and unexpected failure modes that traditional benchmarks miss.

HuggingFace Blog · 22h ago · 8 · new model deployment open source benchmark

Nemotron 3.5 is a multimodal safety model that evaluates text, images, and assistant responses together in a single pass, with support for 12 languages explicitly and ~140 via zero-shot transfer. Key features include custom policy specifications for domain-specific safety rules, optional reasoning traces for auditability, and a newly released multimodal multilingual safety dataset—making it valuable for production deployments requiring interpretable content moderation.

r/LocalLLaMA · 1d ago · 9 · new model open source inference agent deployment

NVIDIA releases Nemotron-3-Ultra-550B, a frontier-scale open-weight LLM with 55B active parameters optimized for agentic reasoning and long-context tasks, available for immediate use via Transformers, vLLM, and SGLang with deployment guides included. The model features a hybrid Latent Mixture-of-Experts architecture combining Mamba-2, MoE, and Attention layers with Multi-Token Prediction for efficient inference.

r/LocalLLaMA · 1d ago · 7 · benchmark inference api update

Deep technical analysis exposing critical measurement errors in the DeepSWE benchmark for code generation tasks: cache pricing is inflated ~5x (billing cache hits at miss rates), and deepseek-v4-pro lacks effort-level tuning compared to competing models. The authors demonstrate solving all three failing tasks at ~$0.86 total cost versus the reported $4.22, highlighting real-world performance/cost discrepancies crucial for engineers evaluating AI models on benchmarks.

r/MachineLearning · 1d ago · 8 · agent research workflow inference

Deep technical discussion on calibration vs. accuracy in LLM-based agents, drawing from Google research on hallucination reduction. Author shares practical patterns for reducing hallucinated tool calls (25% to 5%) using a planning-verification pipeline with confidence-based human review routing, while analyzing the latency-safety tradeoff and the gap between current agent frameworks and confidence-aware control surfaces.

r/MachineLearning · 1d ago · 9 · inference optimization open source benchmark research

KVarN is a novel KV-cache quantization method combining Hadamard rotations with variance normalization that achieves 3-4x compression with minimal accuracy loss on demanding benchmarks like AIME24. The approach includes a vLLM implementation and demonstrates actual speedups over fp16 baselines, making it immediately applicable for optimizing inference in reasoning and code-generation workloads.

HuggingFace Blog · 1d ago · 8 · new model open source inference tool

NVIDIA released Nemotron 3.5 ASR, a 600M-parameter multilingual streaming speech recognition model supporting 40 language-locales with native punctuation/capitalization and efficient cache-aware processing that eliminates redundant computation in streaming scenarios. The model uses Cache-Aware FastConformer encoder + RNNT decoder architecture with language conditioning capabilities, available as a NeMo checkpoint for straightforward integration.

r/MachineLearning · 1d ago · 8 · research fine tuning workflow

On-policy distillation (OPD) is an emerging post-training technique used in recent frontier models (Qwen 3.6/3.7, GLM-5.1, DeepSeek-V4) that efficiently teaches models to avoid specific errors by injecting hint tokens into trajectories rather than requiring full rollout regeneration. The technique uses a separate model to identify mistakes in rollouts, then trains the main model via probability matching on the annotated trajectories—a practical efficiency win over naive reinforcement learning approaches.

HuggingFace Blog · 1d ago · 7 · benchmark agent open source tool

EVA-Bench is an expanded open-source voice agent evaluation benchmark now covering 3 enterprise domains (airline, IT service, healthcare HR) with 213 scenarios across 121 tools—4x larger than the original release. The benchmark includes detailed methodology for dataset generation and validation against frontier models, plus an upcoming multilingual extension, making it useful for engineers evaluating or building voice agents.

OpenAI Blog · 1d ago · 5 · agent workflow

Endava's case study on deploying AI agents and ChatGPT Enterprise for software delivery automation provides practical enterprise implementation insights, though it's primarily a business-focused success story rather than technical depth on the AI tools themselves.

r/LocalLLaMA · 1d ago · 8 · new model inference open source deployment agent

NVIDIA released Nemotron-3-Ultra-550B, a frontier-scale open-weight LLM optimized for agentic tasks and complex reasoning with a hybrid Latent MoE architecture (55B active/550B total parameters). The guide covers practical integration with major inference frameworks (Transformers, vLLM, SGLang, Docker) and includes multi-language support and quantized variants for production deployment.

r/MachineLearning · 1d ago · 6 · workflow research

A practical discussion on conducting ablation studies without full retraining by leveraging saved checkpoints and model components. The thread explores techniques like selective layer freezing, component masking, and gradient-based analysis to evaluate model component importance while maintaining reproducibility against the original baseline.

r/LocalLLaMA · 1d ago · 8 · new model open source inference

Google released Gemma 4 12B, a new open-source model with an encoder-less vision architecture that reduces vision inference costs. This addition to the Gemma family offers engineers a practical option for local deployment with improved efficiency compared to previous Gemma versions.

OpenAI Blog · 1d ago · 6 · api update workflow

ChatGPT's memory feature allows the model to retain user preferences and context across separate conversations, reducing the need to re-establish context. This is a workflow improvement for developers building ChatGPT-based applications, though the technical implementation details and API implications for custom integrations remain unclear.