Simon Willison · 13h ago · 7 · tool workflow prompt engineering

Developer built a complete web app entirely on mobile using Claude Code, demonstrating a practical AI-assisted workflow: created a Python CLI tool, set up Git scraping automation, and generated a JavaScript frontend with a single LLM prompt. Shows how Claude can handle multi-layer full-stack development from local tooling to cloud-hosted APIs.

r/MachineLearning · 15h ago · 8 · open source benchmark research

A researcher has assembled and open-sourced a 103.1B token Usenet corpus (1980-2013) with comprehensive metadata, deduplication, and cleaning—representing a rare, temporally-coherent pretraining dataset spanning 33 years of language evolution before modern web interference. The dataset includes 408M posts across diverse hierarchies with 96.6% English coverage plus 100+ other languages, complete with published data card and processing methodology on Hugging Face.

r/LocalLLaMA · 19h ago · 8 · library tool open source inference deployment

AutoRound is a mature quantization toolkit for LLMs/VLMs achieving 2-4 bit quantization with minimal accuracy loss using sign-gradient descent, now integrated into major frameworks like vLLM, SGLang, and Transformers. Recent updates include block-wise FP8, mixed-precision schemes, and GGUF format support, making it practical for production deployment with fast quantization times (~10 min for 7B models).

r/MachineLearning · 19h ago · 6 · tool workflow

A discussion thread about open-source PDF-to-Markdown conversion tools, with focus on handling complex tables in financial documents. User compares existing solutions (docling, marker, graphite-docling) against paid alternatives like LandingAI, seeking recommendations for robust table parsing.

r/MachineLearning · 1d ago · 8 · tool open source inference deployment

Phosphene is a free macOS desktop app that wraps Lightricks' LTX 2.3 video generation model on Apple Silicon, notable for synced audio-video generation in a single forward pass rather than post-processing. It features multiple generation modes (text→video, image→video, frame interpolation), three quality tiers with honest hardware gating based on RAM availability, and local prompt rewriting via Gemma 3 12B, making it a practical tool for engineers building video generation workflows on Apple Silicon.

Latent Space · 1d ago · 7 · new model agent tool inference deployment

OpenAI released GPT-5.5 with strong cyber task performance (71.4% pass rate on multi-step attack simulations) and expanded Codex into a general-purpose agent for non-coding computer work with 42% faster inference, dynamic UI routing, and integrations with Microsoft/Google/Salesforce/creative tools. Anthropic launched Claude Security for code review and expanded creative tool support, while the broader narrative shows AI agents increasingly capable of autonomous task execution across diverse domains.

r/LocalLLaMA · 1d ago · 9 · new model open source inference deployment

Google DeepMind released Gemma 4 26B IT, an open multimodal model supporting text, images, and video with a 256K context window and hybrid attention mechanism for efficient inference on consumer GPUs. The NVIDIA-quantized NVFP4 version enables frontier-level performance for reasoning, coding, and agentic workflows with commercial/non-commercial licensing under Apache 2.0.

Simon Willison · 1d ago · 7 · agent tool workflow

Codex CLI 0.128.0 introduces a /goal feature that implements agentic looping similar to the Ralph pattern, automatically re-prompting until goal completion or token budget exhaustion. The implementation uses injected continuation and budget-limit prompts, demonstrating a practical approach to autonomous agent workflows with built-in resource constraints.

Simon Willison · 1d ago · 7 · benchmark new model research

UK AI Security Institute evaluated GPT-5.5's cybersecurity capabilities, finding it comparable to Claude Mythos for vulnerability detection with broader availability. This is a direct model capability assessment relevant to engineers evaluating LLMs for security applications.

r/MachineLearning · 1d ago · 8 · agent open source tool api update deployment

A practical open-source project demonstrating autonomous AI agents using Llama 3, Qwen, and Gemma playing Pokémon Showdown with structured tool calling for decision-making. The architecture leverages LiteLLM to route through free API tiers (Groq, Cerebras, OpenRouter, Google AI Studio), making it cost-free to run locally with full observability via Langfuse.

r/MachineLearning · 1d ago · 9 · tutorial open source inference workflow

A practical reference ML compiler implementation in 5K lines of Python that demonstrates the complete lowering pipeline from PyTorch IR through six intermediate representations down to raw CUDA kernels. The walkthrough shows real compiler transformations (fusion, tiling, scheduling) on concrete examples like matmul+bias+relu, making compiler design accessible without the complexity of TVM/PyTorch Inductor.

r/MachineLearning · 1d ago · 8 · rag workflow tutorial

A practical approach to code-specific RAG using AST-derived typed graphs stored in SQLite with BM25 retrieval instead of embeddings, achieving ~5K tokens per query vs ~100K with naive chunking. The method leverages structural code relationships (imports, calls, inheritance) through graph traversal and uses lexical matching on distinctive identifiers, with hierarchical fallback for complex multi-file queries.

r/MachineLearning · 1d ago · 7 · agent research fine tuning workflow open source

A software engineer documents their experience adapting Andrej Karpathy's autoresearch LLM-driven training framework to a small transit industry dataset (33M tokens) on consumer hardware, exploring whether the autonomous experiment loop and single-scalar ratchet mechanism work at scale six orders of magnitude smaller than the design target. The post details practical challenges (FlashAttention availability, architecture constraints, overfitting on small held-out sets) and methodology validation insights for researchers applying agent-driven ML research loops to domain-specific, data-constrained scenarios.

DeepMind Blog · 1d ago · 6 · research agent benchmark

Google DeepMind announces an AI co-clinician research initiative exploring how AI agents can collaborate with physicians in clinical settings, building on prior work with MedPaLM and AMIE. The research demonstrates improved performance on evidence synthesis tasks using a NOHARM framework for evaluating medical AI safety, with physicians preferring the system's responses in blind evaluations.

r/MachineLearning · 1d ago · 7 · rag workflow research deployment

Technical deep-dive on the fundamental tension between vector database performance (ANN algorithms like HNSW/IVF) and privacy-preserving encryption (PHE), with practical architectural questions about hybrid approaches like metadata filtering, secure enclaves, and tiered search for million-scale embeddings. Covers real systems engineering challenges in building production privacy-aware RAG/semantic search infrastructure.

r/LocalLLaMA · 1d ago · 8 · new model tutorial prompt engineering

This article covers practical techniques for controlling and optimizing text generation with Qwen3, including parameter tuning, sampling strategies, and output steering methods that developers can apply to their AI applications.

Anthropic Research · 2d ago · 7 · benchmark research agent

Anthropic's research team released BioMysteryBench, a bioinformatics benchmark evaluating Claude's ability to analyze real-world biological datasets and tackle complex scientific workflows. The benchmark shows Claude's scientific reasoning improving across model generations, now performing on par with human experts in biology tasks that go beyond knowledge tests to include data analysis, hypothesis generation, and experimental design.

Latent Space · 2d ago · 6 · inference deployment research

Analysis of shifting compute infrastructure priorities as AI inference becomes central to production workloads—CPU demand is resurging due to agent systems, RL training, and code execution requirements alongside GPU-driven inference. While strategically important for understanding deployment infrastructure, this is primarily market/industry analysis rather than technical tooling or methodology directly applicable to daily AI engineering work.