Our Research
Active projects. Production-grounded, reproducible, openly published.
SynapseKit
A lightweight, production-focused LLM framework built from the ground up for speed and simplicity. 2 dependencies, 30× faster cold start than LangChain, async-native tool and retriever base classes. Designed for engineers who want batteries-included agent tooling without the middleware tax.
ChunkRank
A Python library for ranking and filtering document chunks by relevance before sending them to an LLM. Reduces token cost and improves answer quality in RAG pipelines by ensuring only the most informative chunks make it into the context window.
LLM Framework Showdown
A 30-notebook reproducible benchmark comparing SynapseKit, LangChain, and LlamaIndex across developer experience, RAG pipelines, agent capabilities, and production concerns. Every benchmark is runnable on Kaggle, every result is reproducible. We measure cold start time, dependency count, memory footprint, API abstraction depth, streaming latency, and 12 more dimensions that matter when you're shipping to production,not just running a demo.
Production Agent Reliability Study
A systematic study of failure modes in deployed LLM agents: uncontrolled loops, tool call errors, context window exhaustion, and cost overruns. We're building a structured taxonomy of 40+ failure patterns observed in real production systems, along with concrete mitigation strategies for each. The dataset includes traces from multi-step agents, ReAct loops, and tool-augmented pipelines,documenting exactly where and why they break.
Open Evals Harness
A provider-agnostic evaluation suite for LLM framework agent capabilities, portable across OpenAI, Anthropic, Google, and open-source models. Standardized metrics covering accuracy, latency, cost-per-task, and reliability,with a public leaderboard anyone can submit to. Designed so teams can benchmark their own agents against published baselines using the same harness we use internally.
