Show HN: Codebased, an AI Search Engine for Code

17 points by maxconradt 10 months ago

Codebased combines Tree Sitter for code awareness (find functions, data structures, constants, etc. not just lines of code), full-text search using SQLite, and semantic search using OpenAI embeddings + FAISS. Despite being implemented in Python, supporting semantic search, making multiple API calls for embedding and re-ranking, it is faster than ripgrep for runng searches against the Linux kernel (takes ~1 second vs. ~2 seconds, obviously depends on system, temperature, time of day, tidal forces, etc.) Up next: - A Perplexity-like agent for interpreting results, making multiple follow-up searches, etc. - Custom embedding and re-ranking stack - Agent for running shell commands, editing code, etc. similar to SWe-Agent: https://arxiv.org/pdf/2405.15793.

tarasglek 10 months ago

It would be helpful to document your architecture. Eg what kind of text search you use, eg trigrams, some sort of benchmark for searching with/without expensive embeddings.

skeptrune 10 months ago

Really surprised that it's faster than ripgrep given its using OpenAI embeddings. Every API call is taking 300ms off the jump given that.

maxconradt 10 months ago

This timing is for making queries after the index is created, so it’s only two API calls for embedding / re-ranking. Overall, during testing Codebased took ~1 second to search the Linux kernel and ripgrep took ~2 seconds because it’s expensive to read gigabytes of text from disk. It’s slower for small projects, but I’m working on that.
- burntsushi 10 months ago
  
  Can you share the 2 second benchmark of the Linux kernel for ripgrep? On my workstation, ripgrep can search the Linux repo (using default settings) in under 100ms.
  ripgrep generally shouldn't be reading from disk. If you're benchmarking Codebased after an initial indexing step, then ripgrep should be benchmarked on a warm cache IMO.
  - maxconradt 10 months ago
    
    That's definitely fair. Running the command `rg "amd_pci_dev_to_node_id"` with default settings on the Linux kernel source tree 5 times takes: 2.48s, 1.39s, 0.816s, 0.448s, 0.438s. I'm using an M2 Max Macbook Pro.
    The first time is consistent with what I reported above (I only ran the command once), but the subsequent runs are definitely faster than what I had reported and the last two runs are definitely faster than Codebased. Sorry, that's my bad. With semantic search + re-ranking disabled, that query runs in 102ms on Codebased, but that's not the default.
    Thanks for making ripgrep!
    
    burntsushi 10 months ago
    
    I get 0.371s on my M2 mac mini, which is right in the neighborhood of that, so that makes sense. Now on my i9-12900K, I get 0.096ms.
    Thanks for checking!

robertnowell 10 months ago

what’s an example of a code search problem that vscode’s native search fails to solve?

maxconradt 10 months ago

Generally, the problem with the type of search VSCode does is: 1. It returns zero results because you made a typo or added a term that doesn’t exist in any document. Codebased uses semantic search so it’s better at this, but if you typed in something random, i.e. “argle bargle” it won’t return results. 2. It returns too many results and they’re not guaranteed to be in order of relevance. This can happen if you type in a commonly used function or class name. Codebased actually ranks code blocks using BM25 and L2 distance before sending them to the reranker for even better results.

sophiecheung 10 months ago

[flagged]