A four-part blog series on vector databases, embeddings, ChromaDB, and Retrieval-Augmented Generation (RAG). Written while working through Real Python's "Vector Databases and Embeddings With ChromaDB" course, with extra detail on the math, model selection, and the production gotchas that the course glides past.
-
Part 1: Why Vectors, and What "Similar" Actually Means What an embedding is, the geometry of cosine similarity vs. dot product vs. L2 distance, and why all three rank identically once your vectors are unit-normalized.
-
Part 2: From Words to Sentences (and How to Pick a Model) Why averaging word vectors fails, what sentence-transformers does instead, how to choose between
all-MiniLM-L6-v2and the asymmetricmulti-qa-*family, and a practical take on chunking. -
Part 3: ChromaDB, Hands-On End-to-end walkthrough of the Edmunds car-reviews demo: Polars ETL, building a persistent Chroma collection, batched inserts, and metadata-filtered semantic search. The HNSW index and its three tuning knobs.
-
Part 4: RAG, the Landscape, and Production Hygiene The smallest useful RAG loop with Gemini, the decisions that actually move quality (chunking, k, rerankers, hybrid retrieval, eval), an opinionated tour of ChromaDB vs. Pinecone vs. Weaviate vs. Qdrant vs. pgvector vs. Milvus, and a consolidated anti-patterns table.
The project/ directory contains the two helper modules referenced in Part 3:
project/car_data_etl.py: Polars-based ETL that derivesVehicle_YearandVehicle_Modelfrom the raw CSV and reshapes the data into the(ids, documents, metadatas)shape that ChromaDB expects.project/chroma_utils.py:build_chroma_collectionhelper wrappingPersistentClient,SentenceTransformerEmbeddingFunction, and batchedcollection.add(...)calls.
The Edmunds car-reviews dataset is not checked in. Download it separately and point DATA_PATH at the extracted CSVs.
The code in this repo and the snippets in the posts assume the following Python packages:
chromadbsentence-transformerspolarsmore-itertoolsgoogle-genai(only for the RAG loop in Part 4)
Install with:
pip install chromadb sentence-transformers polars more-itertools google-genaianand (anandtopu). Feedback and corrections welcome via GitHub Issues.
