r/MachineLearning · 8h ago · 7 · research architecture inference

SATFormer introduces a more efficient alternative to recent Transformer variants by replacing static cross-layer pathways with per-token, per-head gating that selectively reuses first-layer representations. The method achieves better efficiency-performance tradeoffs (1.75-1.82Ă— higher throughput than competitors) while improving validation loss at 130M-1.3B scale and showing strong results on retrieval-intensive tasks.

r/MachineLearning · 13h ago · 6 · agent benchmark

League of Robot Runners (LoRR) 2026 is a research competition focused on large-scale multi-robot coordination using ML/RL methods for task scheduling and path planning under uncertainty. The competition provides starter kits in C++/Python, automated evaluation with live leaderboards, and welcomes diverse technical approaches including RL, search, optimization, and hybrid techniques.

Latent Space · 13h ago · 6 · prompt engineering workflow research

Article explores the 'Jagged Frontier' concept where modern LLMs like GPT-5 show dramatic capability improvements at research/science frontiers while appearing incremental for everyday tasks. Features physicist Alex Lapskasky using AI (o3/GPT-5) to accelerate theoretical physics research, reproducing complex papers in minutes through prompt engineering techniques like 'priming' with textbook problems.

Anthropic Blog · 14h ago · 8 · agent tool api update deployment

Anthropic released 10 pre-built agent templates for financial services workflows (pitchbooks, KYC screening, month-end closing) deployable as Claude plugins or managed agents, plus native integrations with Microsoft 365 apps and expanded MCP/connector ecosystem for real-time data access. The templates package skills, data connectors, and subagents as reference architectures that teams can adapt and deploy in days, with Claude Opus 4.7 achieving 64.37% on Vals AI's Finance Agent benchmark.

r/LocalLLaMA · 16h ago · 7 · benchmark inference research

Comprehensive benchmark comparison of Qwen3.6 vs Qwen3.5 27B and Gemma 4 31B across accuracy, latency, and token efficiency metrics, with extended analysis on thinking-enabled modes. Results show Qwen3.6 excels on math/knowledge tasks but underperforms on instruction-following and some reasoning benchmarks, revealing task-specific trade-offs for practitioners choosing between models.

r/MachineLearning · 16h ago · 7 · deployment workflow

A software engineer shares production cost management challenges with LLM APIs, specifically difficulty tracking token usage and costs across features when moving from prototypes to scaled deployments. The core issue is lack of cost attribution granularity—OpenAI dashboards provide total spend but not per-feature breakdown, requiring manual reconciliation that doesn't scale and lacks confidence.

r/MachineLearning · 17h ago · 8 · library open source inference benchmark

TritonSigmoid is an open-source GPU kernel implementing sigmoid attention with native padding awareness, achieving 515 TFLOPS on H100 and outperforming softmax/FlashAttention on variable-length sequences. Designed for single-cell biology models where multi-token attention is semantically required, it demonstrates both computational efficiency and empirical improvements in loss and representation quality across benchmarks.

r/MachineLearning · 22h ago · 7 · fine tuning tool workflow open source

Engineer shares a practical approach using Qwen2-VL-2B-Instruct with LoRA fine-tuning for detecting obfuscated transaction patterns by converting graphs to 2D images and leveraging VLM visual understanding—demonstrates an interesting workflow alternative to standard GNNs, includes published LoRA weights and synthetic dataset methodology on AMD/ROCm hardware.

OpenAI Blog · 1d ago · 7 · new model api update

OpenAI released GPT-5.5 Instant as ChatGPT's default model, featuring improvements in reasoning accuracy and hallucination reduction. Engineers building with ChatGPT API should evaluate whether to migrate to this model for better performance on their applications.

r/MachineLearning · 1d ago · 7 · research fine tuning workflow

A software engineer is debugging an implementation of unsupervised hyperbolic contrastive learning on ImageNet-1k, where their hyperbolic version (57% 1-NN accuracy) significantly underperforms standard Euclidean cosine contrastive learning (64%). The issue likely involves manifold constraint enforcement, loss formulation design, or hyperparameter tuning specific to hyperbolic geometry.

Simon Willison · 1d ago · 7 · tool workflow open source

Datasette now supports configurable default options for LLM models in plugins, allowing users to specify model selection and parameters like temperature across enrichment operations. This workflow improvement addresses practical concerns for teams building LLM-integrated data tools.

Simon Willison · 1d ago · 7 · tool testing open source

A new testing plugin provides a fake LLM model ('echo') that echoes prompts without actual inference, enabling developers to write automated tests for LLM-based applications. The tool supports faking reasoning blocks and JSON responses, streamlining test development workflows.

Simon Willison · 1d ago · 7 · new model open source inference

IBM released Granite 4.1 LLMs (3B, 8B, 30B sizes) under Apache 2.0 license with detailed training documentation, and Unsloth published 21 GGUF quantized variants for the 3B model ranging from 1.2GB-6.34GB. The post documents an experimental evaluation of how quantization affects model performance on SVG generation tasks, providing practical insights into model size-quality tradeoffs for local deployment.

r/MachineLearning · 1d ago · 6 · workflow research

Reddit discussion on practical strategies for validating expensive diffusion model experiments, covering dataset reduction, batch size/learning rate tradeoffs, and early stopping. While not a formal resource, it discusses real engineering constraints relevant to researchers reproducing compute-heavy papers.

Simon Willison · 1d ago · 6 · tool library research

Explores TRE regex engine's superior handling of ReDoS attacks compared to Python's standard library, with Claude Code used to build experimental Python bindings and test malicious regex patterns. Demonstrates practical security benefits of backtracking-free regex implementations for AI engineers building systems that process untrusted regex inputs.

r/MachineLearning · 1d ago · 8 · fine tuning open source tool tutorial benchmark

A practical fine-tuning case study using QLoRA to adapt Qwen2.5-1.5B for CEFR English proficiency classification with 84.9% accuracy on 6 difficulty levels. The work includes synthetic dataset generation via Llama-3.3-70B, 4-bit quantization optimization, and FastAPI deployment—demonstrating efficient parameter-tuning (0.28%) for real-world educational NLP tasks.

r/MachineLearning · 1d ago · 7 · library open source tool

Parax is a generalized JAX library for parametric modeling that provides derived/constrained parameters, computed PyTrees, and abstract interfaces for parameter management with a focus on clean, extensible APIs and opt-in design rather than framework overhead.