Machine learning and especially neural networks allow transforming “unstructured data” into fixed-length vectors (usually float32) that preserve the semantics of the original object. For example, two similar texts will have similar vectors (low Euclidean distance or high cosine proximity).
Vector databases:
These are specialized storages designed to efficiently search and compare vectors (usually embeddings) that represent objects such as text, images, audio, or video in numerical form.
What is usually stored in a vector database
Component | What is this | Example |
---|---|---|
ID | Unique record identifier | “doc-001” or 123 |
Vector | A numeric list representing an object | [0.12, -0.56, 0.44, …, -0.03] (usually float32) |
Document | Source text or file (optional) | “How are you?” |
Metadata | Additional fields for filtering, tags, context | {“language”: “ru”, “user”: “petya”, “tags”: [“faq”]} |
Popular vector databases
Name | Developer | Indexing | Metrics | Pros | Cons |
---|---|---|---|---|---|
Milvus | Zilliz | IVF_FLAT, IVF_SQ8, IVF_PQ, HNSW, ANNOY, Flat | L2, IP, Cosine, Jaccard, Hamming | Scalability (billions of points) Many indexes gRPC/REST | Requires Docker or Standalone More difficult to deploy |
Qdrant | Qdrant (in Rust) | HNSW (modified), Flat | Cosine, Dot, Euclidean (L2) | Fast, Rust engine Easy installation Filtering by metadata | There are fewer indices for now No built-in clustering |
Weaviate | Semi.technologies | HNSW + Text (Hybrid Search) | Cosine, Dot, Euclidean | Hybrid Search (BM25 + Vector) GraphQL API Auto-injection of data | Requires more memory GraphQL is not always convenient |
Chroma | Chroma.org | Flat (exact), HNSW (in roadmap/partial) | Cosine | Very easy installation Ideal for RAG and local running | Only Flat (for now) No metadata filtering (partially available) |
FAISS | Facebook/Meta | Flat, IVF, PQ, OPQ, HNSW, LSH | L2, Dot, Cosine (via normalization) | Very flexible GPU support Best CPU/GPU performance | This is a library, not a server. Manual setup and coding required |
OpenSearch | Amazon | HNSW, Faiss backend, ANN native plugin | L2, Dot, Cosine | Hybrid Search (BM25 + ANN) Integration with text Elasticsearch compatible | Complex ANN setup High memory requirements |
Transforming data into vectors (Embedding)
For example, the text “How are you?” can be converted into a vector of 384 values
[0.12, -0.56, 0.44, ..., -0.03]
For this purpose, specialized embedding models are used, such as:
Name | What does it encode and how does it work? | Advantages |
---|---|---|
all-MiniLM-L6-v2 | Lightweight and fast model based on Transformer Encodes phrases, questions, paragraphs | Compact (~80MB) Support in sentence-transformers Works out of the box |
text-embedding-ada-002 (OpenAI) | Commercial model from OpenAI Requires API key Encodes any texts | High quality embeddings Multi-language support Perfect for RAG |
bge-small-en | Modern model from BAAI Supports templates: “query:…”, “passage:…” | High precision Multilingual support (in M3E) Great for Qdrant, LangChain |
e5-base / e5-large | Universal models from FlagAI Suitable for search, clustering, QA | Best performance on MTEB Support for multilingual tasks Works without fine-tune |
Instructor-XL | Encodes text according to the task Uses instructions in the style: “Represent the … for …” | Increased accuracy Suitable for task-aware embedding Great for RAG/FAQ |
mpnet-base-v2 | From Microsoft Context sensitive model Good for similar phrases | Good balance between accuracy and speed Suitable for paraphrase and general search |
LabSE | From Google Multilingual model Best with short sentences | Support 100+ languages Great choice for cross-language search |
Typical vector sizes (embedding lengths) for different models:
Model | Vector length |
---|---|
all-MiniLM-L6-v2 | 384 |
text-embedding-ada-002 (OpenAI) | 1536 |
bge-small-en | 384 |
bge-base-en | 768 |
bge-large-en | 1024 |
e5-small-v2 | 384 |
e5-base-v2 | 768 |
e5-large-v2 | 1024 |
mpnet-base-v2 | 768 |
LabSE | 768 |
Instructor-XL | 768 or 1024 |
Indexing Vectors
Vectors corresponding to objects are indexed so that later it will be possible to quickly search for the closest ones stored in the database using the incoming vector.
Methods of vector indexing:
Types of indexes:
Name | How it works | Advantages | Flaws |
---|---|---|---|
Flat | Iterates through all vectors manually | The most accurate search Simple implementation Ideal for debugging and small kits | Very slow with large volume Requires a lot of calculations Doesn’t scale |
HNSW | Search the network for similar vectors (starting from “centers”) | Very fast High precision Suitable for large bases | Requires a lot of memory Long time to build index Difficult to parameterize |
IVF | Divides vectors into groups (clusters), searches only in them | Faster than Flat Flexible configuration (nprobe) Scales well | May skip similar vectors Requires prior training |
PQ | Replaces parts of a vector with short codes | Saves memory a lot Quick search by table Ideal for large sets | Loss of precision Training required (codebook) Not for high precision tasks |
OPQ | Improved version of PQ – first “corrects” the vector | Higher accuracy than PQ Works well in FAISS, Milvus Combined with IVF | More difficult to learn Still an approximate method |
Annoy | Builds many random trees, searches through them | Easy to use Little dependent on resources Suitable for CPU and mobile | Less accurate than HNSW Long time to build index Cannot update after build |
Similarity Search
When a user enters a query, it is converted into a vector and the database performs a nearest neighbor (KNN) search on the chosen metric.
Types of metrics:
Metric name | How it works | Advantages | Flaws |
---|---|---|---|
Cosine Similarity | Compare the angle between the vectors. The closer the angle is to 0°, the greater the similarity. | Considers only direction Works well with texts and embeddings Does not depend on the length of the vector | Does not take into account scale (length) Not suitable if vector length is important |
Euclidean (L2) | We measure the “linear” distance between points. Closer means similar. | Simple and intuitive Suitable for coordinates, images | Does not normalize vectors (scale affects) Not always good for texts |
Inner Product (Dot Product) | We add up the corresponding coordinates. The greater the sum, the higher the similarity. | It is calculated very quickly Works well with unnormalized vectors | Sensitive to vector length There may be difficult to interpret meanings. |
Manhattan (L1) | The sum of the absolute values of the differences for each coordinate is the same as for the cells on the grid. | Resistant to emissions Works better with sparse vectors | Less commonly used Works worse with dense vectors |
Hamming Distance | We count the number of bits in which two binary vectors differ. | Very fast for binary data Suitable for fingerprints and hashes | Works only with binary vectors Not applicable to float |
Jacquard Similarity | The intersection relation to the union of sets or binary vectors. | Ideal for tags and binary features Clear metrics | For binary vectors only Doesn’t work with float vectors |
Tanimoto | A generalized Jaccard metric that also applies to float vectors. | Suitable for chemical structures, fingerprint Works with both binary and real numbers | Rarely used Limited support in libraries |
Example of working with a vector base
In the example of CRUD operations we will use Python, Milvus base, embedding model e5‑base
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection from pymilvus.model.dense import SentenceTransformerEmbeddingFunction # 1. Connect to Milvus (default localhost:19530) connections.connect("default", host="localhost", port="19530") #2. Initialize the embedding function with the e5-base-v2 model # This model requires: # - Prefix "passage: " for documents # - Prefix "query: " for search queries ef = SentenceTransformerEmbeddingFunction("intfloat/e5-base-v2") # 3. Defining the collection schema: # - "id" — integer identifier (primary key) # - "text" — the original text of the document (string) # - "emb" — embedding vector of dimension 768 fields = [ FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False), FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=512), FieldSchema(name="emb", dtype=DataType.FLOAT_VECTOR, dim=768) ] schema = CollectionSchema(fields, description="Collection with embeddings from e5-base-v2") # 4. Creating a collection in Milvus with a given schema collection = Collection("e5_collection", schema) # 5. Preparing and inserting documents # Important: Before submitting to the model, you need to add the prefix "passage: " raw_docs = ["Hello world", "Milvus vector database", "Semantic search with e5 model"] docs = [f"passage: {d}" for d in raw_docs] # add prefix ids = [1, 2, 3] # Calculate embeddings for documents using e5‑base embs = ef.encode_documents(docs) # Insert into the collection: # - identifiers # - original (clean) texts without prefixes # - embeddings collection.insert([ids, raw_docs, embs]) # 6. Create an index on the "emb" field to speed up searching collection.create_index( field_name="emb", index_params={ # index type "index_type": "IVF_FLAT", # clustering parameter "params": {"nlist": 128}, # distance metric (euclidean distance) "metric_type": "L2" } ) # 7. Loading the collection into RAM # Without this the search will not work collection.load() # 8. Search query # Similarly, we use the prefix "query: " before the query text query_docs = ["query: vector database"] q_emb = ef.encode_queries(query_docs) # Performing semantic search by embeddings results = collection.search( # search query embedding data=q_emb, # field by which the search is performed anns_field="emb", # search parameters param={"metric_type": "L2", "params": {"nprobe": 10}}, # number of nearest neighbors limit=2, # additional fields to return output_fields=["text"] ) # 9. Display search results for i, hits in enumerate(results): print(f"Results for query: '{query_docs[i]}'") if not hits: print("Nothing found") continue for rank, hit in enumerate(hits, start=1): print(f" {rank}:") print(f" ID: {hit.id}") print(f" Text: {hit.entity.get('text')}") print(f" Distance: {hit.distance:.4f}") # Results for query: 'query: vector database' # 1: # ID: 2 # Text: Milvus vector database # Distance: 2.8374 # 2: # ID: 3 # Text: Semantic search with e5 model # Distance: 5.4931 # 10. Deleting a document by ID # In this case, the document with id = 1 is deleted collection.delete(expr="id in [1]") # 11. Delete the entire collection (if no longer needed) collection.drop()