r/MachineLearning · 2h ago · 7 · benchmark tool open source

Open-source OCR benchmarking tool comparing flagship vs. smaller/older models for document extraction, showing cost-efficiency gains without accuracy loss. Includes 42 standardized documents, 7,560 test calls tracking pass reliability, cost-per-success, latency, and field accuracy with a public leaderboard and free testing tool.

r/MachineLearning · 4h ago · 7 · inference benchmark workflow

A new Kaggle competition for optimizing LLM inference costs by deciding whether to route questions to a 2B model or skip them entirely, using MMLU benchmark data with a weighted cost metric. This directly addresses practical token/compute cost reduction—a key concern for engineers building with LLMs at scale—and encourages exploration of routing strategies and model selection heuristics.

r/MachineLearning · 4h ago · 6 · research workflow open source

Engineer shares guardd, a host-based anomaly detection system using Isolation Forest on Linux exec/network events with 60-second windowing and unsupervised baseline training. Key challenges discussed: false positives from high-variance processes like browsers, sensitivity to training data distribution, and trade-offs between pure unsupervised approaches versus hybrid methods with time-based features and better normalization.

r/MachineLearning · 10h ago · 8 · research agent prompt engineering

Research analyzing 25,000 AI scientist experiments reveals critical flaws in how AI agents conduct scientific reasoning: 68% ignore gathered evidence, 71% never update beliefs, and only 26% revise hypotheses with contradictory data. The study demonstrates that popular agent architectures (ReAct, chain-of-thought, structured tool-calling) fail to instill proper scientific methodology, suggesting fundamental limitations in current prompting and scaffolding approaches that require architectural rethinking.

Latent Space · 12h ago · 7 · workflow deployment agent research

Shopify's CTO discusses internal AI infrastructure including Tangle (reproducible ML workflows), Tangent (auto-research optimization), and SimGym (customer behavior simulation), with practical insights on code review bottlenecks, deployment stability, and why AI coding's real constraint is now validation/deployment rather than generation.

r/MachineLearning · 13h ago · 7 · open source tool deployment

Open-source GPU pricing catalog that automatically aggregates real-time data from 20+ cloud providers, covering 50 GPU models and 2K+ offerings with spot and on-demand pricing. Useful infrastructure tool for engineers optimizing cloud costs and managing GPU resource allocation across multiple providers.

Simon Willison · 15h ago · 9 · new model open source inference benchmark

Qwen3.6-27B is a new 27B dense model claiming flagship-level coding performance while being 15x smaller than its predecessor (55.6GB vs 807GB), with practical demonstration of local inference using GGUF quantization and llama.cpp achieving strong coding generation at reasonable token throughput.

HuggingFace Blog · 16h ago · 7 · tutorial deployment open source tool

Tutorial for building a multimodal Voice Language Agent (VLA) with Gemma 4 on Jetson Orin Nano, enabling autonomous vision and audio interaction without hardcoded triggers. Covers practical setup with llama.cpp native compilation, STT/TTS integration via Hugging Face, and memory optimization techniques for edge deployment.

r/LocalLLaMA · 17h ago · 8 · new model inference open source deployment

Qwen3.6-27B open-weight model release with 262K context length, optimized for coding and real-world applications. Includes deployment guides for SGLang, vLLM, and other inference frameworks with support for tool use and multi-token prediction.

r/MachineLearning · 19h ago · 7 · benchmark inference tool

Discussion of a practical TTS benchmark that evaluates streaming text-to-speech models on real-world failure cases like dates, URLs, and phone numbers using 1000+ test sentences and Gemini evaluation. Identifies a genuine production challenge in TTS systems where models succeed on naturalness but fail on structured data normalization.

OpenAI Blog · 22h ago · 7 · agent workflow tutorial

Guide on building workspace agents in ChatGPT to automate workflows and integrate tools for team operations. Covers practical implementation of agent patterns for connecting external tools and scaling automation across teams.

OpenAI Blog · 22h ago · 6 · agent workflow api update

ChatGPT Workspace agents are cloud-based automation tools powered by Codex that handle multi-step workflows across integrated applications. This is relevant for engineers building AI workflows, though the details on actual capabilities, API integration patterns, and security architecture would determine practical value for daily development.

OpenAI Blog · 22h ago · 8 · agent inference workflow

Technical breakdown of optimization patterns in the Codex agent loop using WebSockets for persistent connections and connection-scoped caching to reduce API overhead and improve model latency. Practical architectural insights for engineers building with AI agents and managing inference performance at scale.

r/MachineLearning · 1d ago · 8 · tool inference open source deployment

Researcher shipped Spiral, a model compression tool using INT3 quantization (+0.14 nats) and custom 2-bit KV cache optimization with fused Metal kernels for M-series Macs. Includes Qwen 7B preview model, with Triton GPU kernels in development—directly applicable for engineers optimizing inference on consumer hardware.

Simon Willison · 1d ago · 7 · api update agent deployment

GitHub Copilot is restructuring pricing and usage limits due to agentic workflows consuming significantly more compute than originally anticipated, shifting from per-request to token-based pricing with restrictions on individual plans. This reflects the real infrastructure costs of AI agents in production and impacts developers using Copilot's expanding agentic capabilities across IDE integrations and CLI tools.

r/MachineLearning · 1d ago · 7 · benchmark inference deployment

Discussion on evaluating quantization impact for DeepSeek V3.2, covering practical benchmark selection for measuring quality degradation from runtime quantization. Relevant for engineers deploying quantized models in production and optimizing inference performance vs. accuracy tradeoffs.

Simon Willison · 1d ago · 6 · api update workflow deployment

Anthropic briefly tested moving Claude Code from the $20/month Pro plan to exclusive availability on $100+/month Max plans, sparking community backlash. The change was quickly reverted, but the incident reveals product strategy shifts around AI coding agent features and competitive positioning against OpenAI's Codex offerings.

Latent Space · 1d ago · 9 · new model api update benchmark agent

OpenAI released GPT-Image-2, a major image generation model now available via API and ChatGPT with significant improvements in text rendering, layout consistency, and multilingual support. The model achieves #1 on Arena leaderboards with a +242 Elo lead on text-to-image tasks and introduces thinking variants that enable web search and self-checking capabilities, positioning image generation as a front-end interface for coding agents.