GitHub - anandtopu/vector-databases · GitHub
Skip to content

anandtopu/vector-databases

Repository files navigation

Vector Databases, From Scratch

A four-part blog series on vector databases, embeddings, ChromaDB, and Retrieval-Augmented Generation (RAG). Written while working through Real Python's "Vector Databases and Embeddings With ChromaDB" course, with extra detail on the math, model selection, and the production gotchas that the course glides past.

The series

  1. Part 1: Why Vectors, and What "Similar" Actually Means What an embedding is, the geometry of cosine similarity vs. dot product vs. L2 distance, and why all three rank identically once your vectors are unit-normalized.

  2. Part 2: From Words to Sentences (and How to Pick a Model) Why averaging word vectors fails, what sentence-transformers does instead, how to choose between all-MiniLM-L6-v2 and the asymmetric multi-qa-* family, and a practical take on chunking.

  3. Part 3: ChromaDB, Hands-On End-to-end walkthrough of the Edmunds car-reviews demo: Polars ETL, building a persistent Chroma collection, batched inserts, and metadata-filtered semantic search. The HNSW index and its three tuning knobs.

  4. Part 4: RAG, the Landscape, and Production Hygiene The smallest useful RAG loop with Gemini, the decisions that actually move quality (chunking, k, rerankers, hybrid retrieval, eval), an opinionated tour of ChromaDB vs. Pinecone vs. Weaviate vs. Qdrant vs. pgvector vs. Milvus, and a consolidated anti-patterns table.

Code

The project/ directory contains the two helper modules referenced in Part 3:

  • project/car_data_etl.py: Polars-based ETL that derives Vehicle_Year and Vehicle_Model from the raw CSV and reshapes the data into the (ids, documents, metadatas) shape that ChromaDB expects.
  • project/chroma_utils.py: build_chroma_collection helper wrapping PersistentClient, SentenceTransformerEmbeddingFunction, and batched collection.add(...) calls.

The Edmunds car-reviews dataset is not checked in. Download it separately and point DATA_PATH at the extracted CSVs.

Dependencies

The code in this repo and the snippets in the posts assume the following Python packages:

  • chromadb
  • sentence-transformers
  • polars
  • more-itertools
  • google-genai (only for the RAG loop in Part 4)

Install with:

pip install chromadb sentence-transformers polars more-itertools google-genai

Author

anand (anandtopu). Feedback and corrections welcome via GitHub Issues.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

Languages