Sunbelt Computer Software

Vector Databases, From Scratch

A four-part blog series on vector databases, embeddings, ChromaDB, and Retrieval-Augmented Generation (RAG). Written while working through Real Python's "Vector Databases and Embeddings With ChromaDB" course, with extra detail on the math, model selection, and the production gotchas that the course glides past.

The series

Part 1: Why Vectors, and What "Similar" Actually Means What an embedding is, the geometry of cosine similarity vs. dot product vs. L2 distance, and why all three rank identically once your vectors are unit-normalized.
Part 2: From Words to Sentences (and How to Pick a Model) Why averaging word vectors fails, what sentence-transformers does instead, how to choose between all-MiniLM-L6-v2 and the asymmetric multi-qa-* family, and a practical take on chunking.
Part 3: ChromaDB, Hands-On End-to-end walkthrough of the Edmunds car-reviews demo: Polars ETL, building a persistent Chroma collection, batched inserts, and metadata-filtered semantic search. The HNSW index and its three tuning knobs.
Part 4: RAG, the Landscape, and Production Hygiene The smallest useful RAG loop with Gemini, the decisions that actually move quality (chunking, k, rerankers, hybrid retrieval, eval), an opinionated tour of ChromaDB vs. Pinecone vs. Weaviate vs. Qdrant vs. pgvector vs. Milvus, and a consolidated anti-patterns table.

Code

The project/ directory contains the two helper modules referenced in Part 3:

project/car_data_etl.py: Polars-based ETL that derives Vehicle_Year and Vehicle_Model from the raw CSV and reshapes the data into the (ids, documents, metadatas) shape that ChromaDB expects.
project/chroma_utils.py: build_chroma_collection helper wrapping PersistentClient, SentenceTransformerEmbeddingFunction, and batched collection.add(...) calls.

The Edmunds car-reviews dataset is not checked in. Download it separately and point DATA_PATH at the extracted CSVs.

Dependencies

The code in this repo and the snippets in the posts assume the following Python packages:

chromadb
sentence-transformers
polars
more-itertools
google-genai (only for the RAG loop in Part 4)

Install with:

pip install chromadb sentence-transformers polars more-itertools google-genai

Author

anand (anandtopu). Feedback and corrections welcome via GitHub Issues.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
project		project
README.md		README.md
part1-embeddings-and-cosine-similarity.md		part1-embeddings-and-cosine-similarity.md
part2-from-words-to-sentences.md		part2-from-words-to-sentences.md
part3-chromadb-hands-on.md		part3-chromadb-hands-on.md
part4-rag-and-the-landscape.md		part4-rag-and-the-landscape.md

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vector Databases, From Scratch

The series

Code

Dependencies

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Sunbelt Computer Software

PL/B Language Development and Support

Folders and files

Latest commit

History

Repository files navigation

Vector Databases, From Scratch

The series

Code

Dependencies

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages