============================================================ nat.io // BLOG POST ============================================================ TITLE: Reimagining Grep: Searching Code by Meaning Instead of Text DATE: March 6, 2026 AUTHOR: Nat Currier TAGS: Developer Tools, AI Systems, Search, Software Engineering ------------------------------------------------------------ The fastest way to feel the gap between old and new retrieval is to watch a common debugging loop in real time. A developer is in a payment incident, a retry path is clearly failing, and they ask what should be a straightforward question: where is the payment retry logic? In most repos, the first muscle-memory response is still lexical search. You run `grep` or `rg`, try one phrase, miss, try a variant, miss again, and then start manually opening files that feel likely. Ten minutes later, you might have the answer, but you paid for it with switching overhead and unnecessary reconstruction work. That loop is not a developer failure and it is not a grep failure. It is a *primitive mismatch*. Grep is doing exactly what it was designed to do: match literal text quickly and precisely. The developer is doing exactly what modern software work demands: ask an intent-level question in natural language under time pressure. The mismatch appears because the query is conceptual while the retrieval primitive is lexical. For years, teams treated that mismatch as unavoidable friction. It no longer is. We can now run local semantic retrieval cheaply enough on everyday hardware that concept-level discovery can become a first-class workflow, without shipping code to external services and without wrapping the repo in a chat interface. In this post, you will get a concrete redesign model, a reproducible benchmarked demo path, and a practical adoption rule for real teams. If you're a maintainer, staff engineer, or tooling owner, the scope is intentionally narrow: improve first-pass code discovery for intent questions without introducing cloud dependencies. > **Thesis:** For concept-level code discovery, semantic retrieval should be a default local primitive, while lexical search remains a precision fallback. > **Why now:** Local embedding inference is fast enough on CPU for modest repositories, and the security cost of cloud-first code search is increasingly hard to justify. > **Who should care:** Engineers, staff-plus maintainers, and tooling teams who spend material time in code archaeology and incident recovery. > **Bottom line:** Keep grep. Change the default discovery path for intent questions. > **Series posture:** This is AI-enabling for a concrete retrieval purpose, not AI-everything product theater. [ The Outdated Assumption ] ------------------------------------------------------------ Classic code search tooling encodes a model that used to be rational: machine stores signals, human manually interprets them. That model matched a world where repositories were smaller, service boundaries were simpler, and compute constraints made richer retrieval expensive. In that world, asking humans to manually bridge the final semantic gap was acceptable. It was often the only practical option. In modern repositories, that assumption creates compounding tax. Services multiply. Terms diverge across teams. Equivalent concepts appear under different naming conventions. The exact string you need often does not exist, even when the logic clearly does. In other words, lexical precision is still valuable, but lexical-only discovery becomes a weak default for intent-heavy questions. What changed is not human cognition. What changed is that we can now do the first semantic pass locally, deterministically, and quickly enough that the retrieval step itself can carry more of the interpretation burden. [ The Constraint Contract ] ------------------------------------------------------------ This redesign only matters if it survives real operating constraints. The prototype therefore commits to local-first execution, secure-by-default data flow, CPU-viable inference, minimal dependencies, no cloud API requirement, and no full-repo prompt handoff pattern. That constraint contract is the point. If the architecture only works when you add remote orchestration, broad prompt plumbing, or hidden services, the workflow claim is weak. If it works under local constraints, it becomes a practical systems argument. [ What Actually Changes in the Interaction ] ------------------------------------------------------------ The interaction shift is subtle but consequential. With lexical-first retrieval, the developer must perform query translation manually: concept to token, token to file, file to intent. With semantic-first retrieval, the first translation step is delegated to local vector similarity, and the developer spends effort on verification rather than blind narrowing. This is not "AI answers your codebase." It is a narrower, more controllable move: retrieve candidate chunks by meaning, then let the developer inspect exact snippets and line spans in terminal output. A concise comparison makes the shift concrete: | Workflow Question | Lexical Default | Local Semantic Default | | --- | --- | --- | | "where is payment retry logic" | often no direct phrase match | payment recovery chunks rank near top | | "where do we back off after gateway failures" | query depends on exact naming | conceptually related chunks appear even with wording drift | | Verification step | manual file hunting first | inspect ranked snippets first | [ Semantic Grep v1 Architecture ] ------------------------------------------------------------ The implementation in `semantic-grep` stays deliberately small and auditable. There are two CLI commands: `index` and `search`. `index` walks files and writes a local index. `search` embeds the query locally and ranks chunks by cosine similarity. The pipeline is deterministic where it should be deterministic. File walking uses stable ordering and excludes noise paths. Chunking uses fixed line windows with configured overlap (`chunk_lines`, `chunk_stride`). This makes ranking behavior easier to inspect across revisions because boundaries do not shift unpredictably between runs. Embedding is local, using `all-MiniLM-L6-v2` through `fastembed`. This is not about maximizing model sophistication. It is about getting sufficient semantic signal with *predictable local performance* and low operational burden. The index format is JSON in v1 for inspectability, even though it is not the most compact storage option. The ranking path is linear scan over chunk vectors. For modest repository sizes, this is acceptable and keeps the implementation surface small. ANN acceleration and binary index formats are legitimate v2 paths, but they are not needed to validate the primitive change. [ Implementation Decisions (v1) ] ------------------------------------------------------------ | Decision | Why | Tradeoff | | --- | --- | --- | | deterministic line chunking | stable, debuggable index boundaries | chunk edges do not always align to semantic units | | local embedding model only | keeps source local and CPU-viable | lower ceiling than large hosted models | | linear cosine ranking | minimal complexity and transparent behavior | scaling pressure on very large corpora | | inspectable JSON index | easy artifact inspection and auditability | larger footprint than binary formats | These were deliberate engineering decisions, not temporary omissions. They optimize for legibility and practical workflow gain before scale complexity. [ Reproducible Behavior from the Demo Repo ] ------------------------------------------------------------ The companion repository (`semantic-grep-demo`) exists to make the claim falsifiable. It includes a compact fixture corpus, a script that compares text and semantic retrieval, and expected behavior notes. Canonical query: ```text where is the payment retry logic ``` Observed lexical step output: ```text No direct text hits for: 'where is the payment retry logic' ``` Observed semantic top results from the same run: ```text 1. score=0.3371 src/payments/recovery_plan.ts:11-23 2. score=0.3309 src/payments/charge_worker.ts:11-24 3. score=0.2888 src/payments/charge_worker.ts:1-20 4. score=0.2643 src/payments/recovery_plan.ts:1-20 5. score=0.2229 src/notifications/email_delivery.ts:1-14 ``` That ranking matters because the fixture deliberately includes a confounder (`email_delivery.ts`) that contains retry semantics outside payment processing. A weak setup might return any retry-related file. A stronger setup should prioritize payment-specific recovery paths for this query intent, which is what the current output does. At this point, the useful question is no longer \"does semantic retrieval ever work.\" The useful question is \"does it work reliably enough under strict local constraints to justify changing default workflow behavior.\" The demo is designed to answer that narrower question. It uses a small but nontrivial corpus, keeps the query natural-language rather than hand-tuned token strings, and exposes scores plus file spans so ranking quality can be inspected directly. One practical detail worth highlighting is *debuggability under disagreement*. When a developer says the ranking is wrong, the system gives concrete artifacts to inspect: chunk boundaries, indexed text, vector score ordering, and final terminal output. That makes the tool behavior legible in engineering terms. You do not need to speculate about hidden retrieval chains or remote prompts to explain why a file moved up or down. [ Performance Shape and Practicality ] ------------------------------------------------------------ On the demo corpus, local timings from `scripts/benchmark.sh` are in the following range on this machine: ```text index build: ~0.18s query run: ~0.06-0.07s ``` These are not portability claims for every machine and repository size. They are *operational shape* evidence. The prototype feels like a tool, not an asynchronous experiment, which is a prerequisite for workflow adoption. [ Security and Governance Implications ] ------------------------------------------------------------ The security argument is straightforward. Source code stays local during indexing and retrieval. There is no default remote prompt stream. There is no requirement for external telemetry to make the feature work. For teams with strict boundary constraints, that immediately changes adoption risk. This is also governance-friendly because the system is inspectable. You can open the index artifact, inspect chunk boundaries, and reason about why a result appeared. That transparency is hard to preserve in black-box remote architectures. The governance value is especially visible in mixed-regulation environments where repositories contain payment logic, customer data handling paths, or jurisdiction-specific policy code. A local retrieval primitive does not eliminate compliance work, but it materially reduces the blast radius of everyday developer queries by default. [ Interactive Behavior Surface ] ------------------------------------------------------------ The simulator below mirrors the canonical demo behavior for the baseline query, including the text-mode no-hit condition and the semantic ranking order. [Interactive component: Semantic Search Mode Simulator. This section is rendered visually on the full page.] The component is intentionally local and fixture-driven. It is there to explain interaction dynamics, not to hide retrieval complexity behind UI polish. [ What v1 Does Not Attempt ] ------------------------------------------------------------ This prototype does not attempt to be production-complete search infrastructure. It does not solve very large-corpus ANN tuning, language-specific reranking, or deep repository reasoning. Those are later concerns. The claim in v1 is narrower and more important: if the question is conceptual, local semantic retrieval materially improves first-pass discovery under strict constraints. So far, the most common misuse pattern is trying to evaluate this architecture as if it were a full assistant product. That framing usually leads to two bad conclusions: either \"the model is too small\" or \"it should just answer everything.\" Both miss the point. The objective is not broad narrative generation; the objective is retrieval primitive quality in terminal workflows. When that objective is explicit, the architecture and tradeoffs become coherent. [ Where This Goes Next ] ------------------------------------------------------------ A disciplined v2 should preserve the constraint contract while improving scale and specificity. The highest-leverage next steps are binary index loading, optional ANN paths for larger corpora, and hybrid lexical prefilter plus semantic rerank. None of those steps change the central design correction. They only improve throughput and relevance within the same local-first model. Next, we can move this from prototype validation to team adoption protocol. That means defining where semantic search is the default (incident triage, unfamiliar modules, dependency archaeology), where lexical search remains preferred (known API signatures, exact constant names, migration sweeps), and how to measure whether discovery latency actually drops over a month of real usage. Without that measurement loop, even good architecture can collapse back into habit-driven workflows. [ Adoption Checklist for Engineering Leads ] ------------------------------------------------------------ If a team wants to adopt this safely, the first step is not tooling rollout. The first step is query taxonomy. Separate retrieval tasks into three groups: exact-token lookups, concept lookups, and mixed lookups where both are useful. This avoids \"one mode for everything\" policy mistakes and keeps lexical precision visible where it is still optimal. Then define one narrow success metric for each group. For concept lookups, track time-to-first-relevant-file and number of query reformulations before success. For exact-token lookups, track whether semantic mode creates unnecessary noise compared with direct grep. For mixed lookups, track whether a hybrid workflow improves confidence and speed together rather than one at the expense of the other. The operational rule should remain explicit: semantic retrieval is for discovery, not final truth. Engineers should still verify snippets, inspect surrounding code, and reason about behavior in context. This is not bureaucratic caution; it is what keeps a retrieval primitive aligned with engineering accountability. Here is a practical rollout sequence that has worked well in similar internal tooling changes. Start with a small pilot group during normal maintenance work, not only incident response. Collect low-friction feedback on false positives and missing high-value matches. Iterate chunking and file filters before adding scale infrastructure. Only after ranking behavior is stable should you expand usage and document expected query patterns for the broader team. This sequence matters because retrieval trust is fragile. If first impressions feel noisy or inconsistent, teams revert to manual search habits and rarely return. If the first week consistently saves time on real tickets, the new primitive becomes sticky quickly. [ Why Grep Is Still Excellent ] ------------------------------------------------------------ None of this argument works if we pretend grep is obsolete. It is not. Grep remains one of the highest-leverage tools in software engineering because lexical precision solves a large class of problems better than semantic approximation can. If you already know the token, grep is almost always the right first move. API names, exact error strings, migration markers, feature flags, work-item anchors, configuration keys, protocol constants, and known telemetry identifiers are all lexical tasks. In those cases, semantic retrieval is unnecessary overhead and can even reduce confidence because it introduces score ordering where deterministic matching already gives certainty. Grep is also unbeatable for codebase-wide mechanical operations. If you are preparing a rename sweep, auditing references to a known deprecated symbol, or proving that a specific string does not exist, lexical search gives crisp answers with minimal ambiguity. Semantic ranking is not designed for that job and should not replace it. One useful way to frame this is: grep is a precision instrument, semantic retrieval is a discovery instrument. Precision and discovery are both essential, but they should not be confused. Teams that try to make one primitive do both jobs usually end up with weaker workflows. This is why the redesign is about default *sequence*, not tool replacement. For intent questions, semantic first and lexical second is often optimal. For token questions, lexical first and semantic optional is often optimal. Keeping that split explicit protects the strengths of both modes. [ Edge Cases Where Semantic Retrieval Can Mislead ] ------------------------------------------------------------ Semantic retrieval is not magic. It fails in specific, predictable ways, and those failure modes need to be operationalized rather than ignored. The first edge case is vocabulary drift with low-signal queries. A query like \"where is retry\" can legitimately map to many unrelated domains: payments, email delivery, queue workers, and client reconnect logic. If the query is broad, the ranking surface will be broad. The mitigation is not bigger models first; it is better query framing and domain hints, such as \"payment retry\" instead of \"retry.\" The second edge case is ambiguous architecture language. Some repositories encode behavior through abstractions that are conceptually distant from user language. For example, a team may describe a concept as \"resilience policy\" in code while operators call it \"retry logic.\" Semantic embeddings often bridge this gap better than lexical search, but they are still limited by training priors and context granularity. If chunk windows are too small or too large, relevant cues can be lost. The third edge case is generated or boilerplate-heavy code. Large generated files can pollute index surfaces with repetitive patterns that appear semantically similar to many queries. Without file filters or weighting, these chunks may outrank true business logic. Production-ready rollouts should treat generated paths, lockfiles, and vendored code as first-class exclusion candidates in indexing policy. The fourth edge case is stale indexing. If the index is old relative to working tree changes, ranking quality and trust collapse quickly. This is an engineering operations issue, not a model issue. A semantic search tool needs an explicit freshness contract: when to rebuild, how to detect staleness, and how to warn users when index state lags repository state. The fifth edge case is over-trusting similarity score values as absolute truth. Cosine scores are ordering signals, not semantic guarantees. A 0.33 score can be very useful in one corpus and weak in another. Teams should treat score thresholds as configurable heuristics tied to observed precision and recall, not universal constants copied from demos. These edge cases are not arguments against the approach. They are the conditions under which the approach must be engineered carefully. Good defaults, explicit exclusions, index freshness checks, and mode-selection guidance address most of them without introducing architectural bloat. > **Operational guardrail:** Treat semantic rank as candidate ordering, then validate with exact code reading before acting. [ Counterarguments Worth Taking Seriously ] ------------------------------------------------------------ A serious redesign should include strong counterarguments, especially from engineers who already get high value from grep. One counterargument is \"good engineers should just search better.\" There is truth here: experienced maintainers can often reformulate lexical queries quickly. But the systems question is not whether experts can compensate. The systems question is whether the default interaction imposes avoidable cognitive translation on every engineer, especially under incident pressure and team turnover. If a local retrieval primitive can reduce that burden while preserving inspectability, the improvement is still valid. Another counterargument is \"semantic search introduces fuzziness we do not want.\" Also true in part. Fuzziness is harmful in precision tasks and useful in discovery tasks. This is why taxonomy matters. Semantic retrieval should not become mandatory for all search behavior. It should become the default for concept lookup where lexical misses are common and costly. A third counterargument is operational complexity: \"now we need indexing jobs, model assets, and relevance tuning.\" This is the strongest practical objection. If rollout demands heavy orchestration and constant babysitting, the gain can evaporate. That is exactly why the v1 architecture constrains scope: local index files, deterministic chunking, CPU-viable embeddings, linear ranking for modest corpora, and explicit non-goals for large-scale optimization. A fourth counterargument is security skepticism: \"any AI layer near source code is a policy risk.\" This concern is often reasonable because many tools route code through remote pipelines by default. In this design, retrieval and embedding are local, data egress is zero by default, and artifacts are inspectable. The argument is not \"ignore risk\"; the argument is that local architecture materially changes the risk profile compared with cloud-default alternatives. A fifth counterargument is that better naming and better documentation would reduce the problem without new tooling. That is correct and should still be pursued. But naming quality and documentation completeness are uneven in real repositories, and even excellent teams face historical drift over time. A retrieval upgrade and better documentation are complementary interventions, not mutually exclusive choices. The right conclusion from these counterarguments is not \"semantic retrieval always wins.\" The right conclusion is that retrieval mode should align with task type, and teams should instrument that choice rather than relying on habit. [ A Concrete Mode-Selection Policy ] ------------------------------------------------------------ If you are introducing this in a real engineering organization, convert philosophy into policy quickly. A lightweight decision matrix prevents confusion and keeps adoption grounded. In practice, lexical-first should remain the default for exact-token work: known symbol lookup, deterministic migration sweeps, and hard existence checks where binary certainty matters more than discovery breadth. Semantic-first should become the default for concept questions: incident phrasing, unfamiliar module discovery, and cross-service language drift where exact wording likely differs from implementation vocabulary. Hybrid mode is usually best for high-risk edits: use semantic retrieval to find likely implementation zones quickly, then run lexical checks inside those files for exact symbol confirmation before changing behavior. This policy should be documented in team runbooks with two or three examples from your own codebase. Local examples matter more than generic advice because they align mode selection with real architecture patterns and naming conventions. > **Adoption rule:** Do not ask engineers to \"pick whatever feels right.\" Define mode defaults per task type in writing. [ Measuring Whether the Change Is Actually Better ] ------------------------------------------------------------ Teams should resist shipping this as a vibes-driven productivity project. Treat it like any other workflow change and attach measurable outcomes. For the first month, focus on four metrics: median time-to-first-relevant-file for intent queries, average reformulation count before first relevant hit, proportion of intent queries that still require manual file-hunting, and maintainer trust trend from real incident/review usage. Pair these with failure logging that captures query text, top results, and user disposition (useful/not useful). This gives you a feedback loop for chunking policy, indexing filters, and weighting adjustments without guessing. One practical warning: avoid over-optimizing to a tiny benchmark set. It is easy to produce great-looking demo rankings and poor production behavior if fixtures are too narrow. Include at least one noisy domain, one misleading confounder domain, and one intentionally ambiguous query class in internal evaluation datasets. A second warning is organizational rather than technical: do not classify success only by perceived speed. A faster query loop that increases false confidence is a regression. Track whether engineers can explain *why* a returned file is relevant, and whether post-change review quality improves alongside retrieval speed. > **Evaluation principle:** A better retrieval primitive reduces both search time and interpretation error rate. The goal is durable workflow utility, not benchmark theater. A semantic search layer that saves time only on curated examples will not survive real team adoption. To make that concrete, teams should also maintain a short anti-pattern checklist in rollout docs so drift is caught early: - do not treat top-1 rank as authority without snippet verification - do not index generated or vendored paths without explicit reason - do not ship semantic-first defaults without freshness checks and rebuild policy None of these are complicated, but they protect trust. Most failed internal search rollouts do not fail because vector math is wrong. They fail because workflow safeguards were implicit instead of explicit. A small checklist turns those safeguards into operating behavior and keeps the tool aligned with engineering standards. There is one more practical consideration: cost envelope clarity. Even local tools have operational cost in index rebuild time, disk footprint, and occasional relevance tuning. Teams should make that cost explicit during adoption so nobody confuses \"local\" with \"free.\" In practice, the cost is usually modest compared with repeated manual archaeology time, but it still deserves measurement. A useful framing for engineering managers is to treat semantic retrieval as a small internal platform capability, not a one-off script. Define ownership, define success metrics, define failure escalation, and keep the implementation inspectable. When this is done, the tool tends to remain stable and useful instead of decaying into unowned prototype debt. This is also where the argument becomes strategically durable. The benefit is not that one query returns a nicer ranking. The benefit is that teams can ask concept-level questions under pressure and get faster, more reliable first-pass discovery without expanding code exposure boundaries. That combination of speed, control, and inspectability is why this redesign is worth standardizing. [ Repositories ] ------------------------------------------------------------ Core tool: [github.com/ncurrier/semantic-grep](https://github.com/ncurrier/semantic-grep) Demo corpus: [github.com/ncurrier/semantic-grep-demo](https://github.com/ncurrier/semantic-grep-demo) [ Bottom Line ] ------------------------------------------------------------ Grep remains essential, but lexical-only retrieval should no longer be the discovery default for intent questions. The practical baseline today is local semantic retrieval first, lexical precision second, and human verification over explicit ranked snippets. That is not feature theater. It is a small but meaningful systems redesign.