Vector databases - Roman Kryvolapov

Machine learning and especially neural networks allow transforming “unstructured data” into fixed-length vectors (usually float32) that preserve the semantics of the original object. For example, two similar texts will have similar vectors (low Euclidean distance or high cosine proximity).

Vector databases:
These are specialized storages designed to efficiently search and compare vectors (usually embeddings) that represent objects such as text, images, audio, or video in numerical form.

What is usually stored in a vector database

Component	What is this	Example
ID	Unique record identifier	“doc-001” or 123
Vector	A numeric list representing an object	[0.12, -0.56, 0.44, …, -0.03] (usually float32)
Document	Source text or file (optional)	“How are you?”
Metadata	Additional fields for filtering, tags, context	{“language”: “ru”, “user”: “petya”, “tags”: [“faq”]}

Popular vector databases

Name	Developer	Indexing	Metrics	Pros	Cons
Milvus	Zilliz	IVF_FLAT, IVF_SQ8, IVF_PQ, HNSW, ANNOY, Flat	L2, IP, Cosine, Jaccard, Hamming	Scalability (billions of points) Many indexes gRPC/REST	Requires Docker or Standalone More difficult to deploy
Qdrant	Qdrant (in Rust)	HNSW (modified), Flat	Cosine, Dot, Euclidean (L2)	Fast, Rust engine Easy installation Filtering by metadata	There are fewer indices for now No built-in clustering
Weaviate	Semi.technologies	HNSW + Text (Hybrid Search)	Cosine, Dot, Euclidean	Hybrid Search (BM25 + Vector) GraphQL API Auto-injection of data	Requires more memory GraphQL is not always convenient
Chroma	Chroma.org	Flat (exact), HNSW (in roadmap/partial)	Cosine	Very easy installation Ideal for RAG and local running	Only Flat (for now) No metadata filtering (partially available)
FAISS	Facebook/Meta	Flat, IVF, PQ, OPQ, HNSW, LSH	L2, Dot, Cosine (via normalization)	Very flexible GPU support Best CPU/GPU performance	This is a library, not a server. Manual setup and coding required
OpenSearch	Amazon	HNSW, Faiss backend, ANN native plugin	L2, Dot, Cosine	Hybrid Search (BM25 + ANN) Integration with text Elasticsearch compatible	Complex ANN setup High memory requirements

Transforming data into vectors (Embedding)

For example, the text “How are you?” can be converted into a vector of 384 values

[0.12, -0.56, 0.44, ..., -0.03]

For this purpose, specialized embedding models are used, such as:

Name	What does it encode and how does it work?	Advantages
all-MiniLM-L6-v2	Lightweight and fast model based on Transformer Encodes phrases, questions, paragraphs	Compact (~80MB) Support in sentence-transformers Works out of the box
text-embedding-ada-002 (OpenAI)	Commercial model from OpenAI Requires API key Encodes any texts	High quality embeddings Multi-language support Perfect for RAG
bge-small-en	Modern model from BAAI Supports templates: “query:…”, “passage:…”	High precision Multilingual support (in M3E) Great for Qdrant, LangChain
e5-base / e5-large	Universal models from FlagAI Suitable for search, clustering, QA	Best performance on MTEB Support for multilingual tasks Works without fine-tune
Instructor-XL	Encodes text according to the task Uses instructions in the style: “Represent the … for …”	Increased accuracy Suitable for task-aware embedding Great for RAG/FAQ
mpnet-base-v2	From Microsoft Context sensitive model Good for similar phrases	Good balance between accuracy and speed Suitable for paraphrase and general search
LabSE	From Google Multilingual model Best with short sentences	Support 100+ languages Great choice for cross-language search

Typical vector sizes (embedding lengths) for different models:

Model	Vector length
all-MiniLM-L6-v2	384
text-embedding-ada-002 (OpenAI)	1536
bge-small-en	384
bge-base-en	768
bge-large-en	1024
e5-small-v2	384
e5-base-v2	768
e5-large-v2	1024
mpnet-base-v2	768
LabSE	768
Instructor-XL	768 or 1024

Indexing Vectors

Vectors corresponding to objects are indexed so that later it will be possible to quickly search for the closest ones stored in the database using the incoming vector.
Methods of vector indexing:

Types of indexes:

Name	How it works	Advantages	Flaws
Flat	Iterates through all vectors manually	The most accurate search Simple implementation Ideal for debugging and small kits	Very slow with large volume Requires a lot of calculations Doesn’t scale
HNSW	Search the network for similar vectors (starting from “centers”)	Very fast High precision Suitable for large bases	Requires a lot of memory Long time to build index Difficult to parameterize
IVF	Divides vectors into groups (clusters), searches only in them	Faster than Flat Flexible configuration (nprobe) Scales well	May skip similar vectors Requires prior training
PQ	Replaces parts of a vector with short codes	Saves memory a lot Quick search by table Ideal for large sets	Loss of precision Training required (codebook) Not for high precision tasks
OPQ	Improved version of PQ – first “corrects” the vector	Higher accuracy than PQ Works well in FAISS, Milvus Combined with IVF	More difficult to learn Still an approximate method
Annoy	Builds many random trees, searches through them	Easy to use Little dependent on resources Suitable for CPU and mobile	Less accurate than HNSW Long time to build index Cannot update after build

Similarity Search

When a user enters a query, it is converted into a vector and the database performs a nearest neighbor (KNN) search on the chosen metric.

Types of metrics:

Metric name	How it works	Advantages	Flaws
Cosine Similarity	Compare the angle between the vectors. The closer the angle is to 0°, the greater the similarity.	Considers only direction Works well with texts and embeddings Does not depend on the length of the vector	Does not take into account scale (length) Not suitable if vector length is important
Euclidean (L2)	We measure the “linear” distance between points. Closer means similar.	Simple and intuitive Suitable for coordinates, images	Does not normalize vectors (scale affects) Not always good for texts
Inner Product (Dot Product)	We add up the corresponding coordinates. The greater the sum, the higher the similarity.	It is calculated very quickly Works well with unnormalized vectors	Sensitive to vector length There may be difficult to interpret meanings.
Manhattan (L1)	The sum of the absolute values of the differences for each coordinate is the same as for the cells on the grid.	Resistant to emissions Works better with sparse vectors	Less commonly used Works worse with dense vectors
Hamming Distance	We count the number of bits in which two binary vectors differ.	Very fast for binary data Suitable for fingerprints and hashes	Works only with binary vectors Not applicable to float
Jacquard Similarity	The intersection relation to the union of sets or binary vectors.	Ideal for tags and binary features Clear metrics	For binary vectors only Doesn’t work with float vectors
Tanimoto	A generalized Jaccard metric that also applies to float vectors.	Suitable for chemical structures, fingerprint Works with both binary and real numbers	Rarely used Limited support in libraries

Example of working with a vector base

In the example of CRUD operations we will use Python, Milvus base, embedding model e5‑base

from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection
from pymilvus.model.dense import SentenceTransformerEmbeddingFunction

# 1. Connect to Milvus (default localhost:19530)
connections.connect("default", host="localhost", port="19530")

#2. Initialize the embedding function with the e5-base-v2 model
# This model requires:
# - Prefix "passage: " for documents
# - Prefix "query: " for search queries
ef = SentenceTransformerEmbeddingFunction("intfloat/e5-base-v2")

# 3. Defining the collection schema:
# - "id" — integer identifier (primary key)
# - "text" — the original text of the document (string)
# - "emb" — embedding vector of dimension 768
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=512),
    FieldSchema(name="emb", dtype=DataType.FLOAT_VECTOR, dim=768)
]
schema = CollectionSchema(fields, description="Collection with embeddings from e5-base-v2")

# 4. Creating a collection in Milvus with a given schema
collection = Collection("e5_collection", schema)

# 5. Preparing and inserting documents
# Important: Before submitting to the model, you need to add the prefix "passage: "
raw_docs = ["Hello world", "Milvus vector database", "Semantic search with e5 model"]
docs = [f"passage: {d}" for d in raw_docs] # add prefix
ids = [1, 2, 3]

# Calculate embeddings for documents using e5‑base
embs = ef.encode_documents(docs)

# Insert into the collection:
# - identifiers
# - original (clean) texts without prefixes
# - embeddings
collection.insert([ids, raw_docs, embs])

# 6. Create an index on the "emb" field to speed up searching
collection.create_index(
    field_name="emb",
    index_params={
        # index type
        "index_type": "IVF_FLAT",

        # clustering parameter
        "params": {"nlist": 128},

        # distance metric (euclidean distance)
        "metric_type": "L2"
    }
)

# 7. Loading the collection into RAM
# Without this the search will not work
collection.load()

# 8. Search query
# Similarly, we use the prefix "query: " before the query text
query_docs = ["query: vector database"]
q_emb = ef.encode_queries(query_docs)

# Performing semantic search by embeddings
results = collection.search(
    # search query embedding
    data=q_emb,

    # field by which the search is performed
    anns_field="emb",

    # search parameters
    param={"metric_type": "L2", "params": {"nprobe": 10}},

    # number of nearest neighbors
    limit=2,

    # additional fields to return
    output_fields=["text"]
)

# 9. Display search results
for i, hits in enumerate(results):
    print(f"Results for query: '{query_docs[i]}'")
    if not hits:
        print("Nothing found")
        continue
    for rank, hit in enumerate(hits, start=1):
        print(f"  {rank}:")
        print(f"    ID: {hit.id}")
        print(f" Text: {hit.entity.get('text')}")
        print(f" Distance: {hit.distance:.4f}")

# Results for query: 'query: vector database'
#  1:
#    ID: 2
# Text: Milvus vector database
# Distance: 2.8374
#  2:
#    ID: 3
# Text: Semantic search with e5 model
# Distance: 5.4931

# 10. Deleting a document by ID
# In this case, the document with id = 1 is deleted
collection.delete(expr="id in [1]")

# 11. Delete the entire collection (if no longer needed)
collection.drop()