The rise of generative AI applications has heightened the necessity to implement semantic search and natural language search. These advanced search features help find and retrieve conceptually relevant documents from enterprise content repositories to serve as prompts for generative AI models. Raw data within various source repositories in the form of text, images, audio, video, and so on are converted, with the help of embedding models, to a standard numerical representation called vectors that powers the semantic and natural language search. As organizations harness more sophisticated large language and foundational models to power their generative AI applications, supplemental embedding models are also evolving to handle large, high-dimension vector embedding. As the vector volume expands, there is a proportional increase in memory usage and computational requirements, resulting in higher operational costs. To mitigate this issue, various compression techniques can be used to optimize memory usage and computational efficiency.
Quantization is a lossy data compression technique aimed to lower computation and memory usage leading to lower costs, especially for high-volume data workloads. There are various techniques to compress data depending on the type and volume of the data. The usual technique is to map infinite values (or a relatively large list of finites) to smaller more discrete values. Vector compression can be achieved through two primary techniques: product quantization and scalar quantization. In the product quantization technique, the original vector dimension array is broken into multiple sub-vectors and each sub-vector is encoded into a fixed number of bits that represent the original vector. This method requires that you only store and search across the encoded sub-vector instead of the original vector. In scalar quantization, each dimension of the input vector is mapped from a 32-bit floating-point representation to a smaller data type.
Amazon OpenSearch Service, as a vector database, supports scalar and product quantization techniques to optimize memory usage and reduce operational costs.
OpenSearch as a vector database
OpenSearch is a distributed search and analytics service. The OpenSearch k-nearest neighbor (k-NN) plugin allows you to index, store, and search vectors. Vectors are stored in OpenSearch as a 32-bit float array of type knn_vector
and that supports up to 16,000 dimensions per vector.
OpenSearch uses approximate nearest neighbor search to provide scalable vector search. The approximate k-NN algorithm retrieves results based on an estimation of the nearest vectors to a given query vector. Two main methods for performing approximate k-NN are the graph-based Hierarchical Navigable Small-World (HNSW) and the cluster-based Inverted File (IVF). These data structures are constructed and loaded into memory during the initial vector search operation. As vector volume grows, both the data structures and associated memory requirements for search operations scale proportionally.
For example, each HNSW graph with 32-bit float data takes approximately 1.1 * (4 * d + 8 * m) * num_vectors bytes of memory. Here, num_vectors represents the total quantity of vectors to be indexed, d is the number of dimensions determined by the embedding model you use to generate the vectors and m is the number of edges in the HSNW graphs, an index parameter that can be controlled to tune performance. Using this formula, memory requirements for vector storage for a configuration of 384 dimensions and an m value of 16 would be:
- 1 million vectors: 1.830 GB (1.1 * (4 * 384 + 8 * 16) * 1000,000 bytes)
- 1 billion vectors: 1830 GB (1.1 * (4 * 384 + 8 * 16) * 1,000,000,000 bytes)
Although approximate nearest neighbor search can be optimized to handle massive datasets with billions of vectors efficiently, the memory requirements for loading 32-bit full-precision vectors to memory during the search process can become prohibitively costly. To mitigate this, OpenSearch service supports the following four quantization techniques.
- Binary quantization
- Byte quantization
- FP16 quantization
- Product quantization
These techniques fall within the broader category of scalar and product quantization that we discussed earlier. In this post, you will learn quantization techniques for optimizing vector workloads on OpenSearch Service, focusing on memory reduction and cost-efficiency. It introduces the new disk-based vector search approach that enables efficient querying of vectors stored on disk without loading them into memory. The method integrates seamlessly with quantization techniques, featuring key configurations such as the on_disk
mode and compression_level
parameter. These settings facilitate built-in, out-of-the-box scalar quantization at the time of indexing.
Binary quantization (up to 32x compression)
Binary quantization (BQ) is a type of scalar quantization. OpenSearch leverages FAISS engine’s binary quantization, enabling up to 32x compression during indexing. This technique reduces the vector dimension from the default 32-bit float to a 1-bit binary by compressing the vectors into a 0s and 1s. OpenSearch supports indexing, storing and searching binary vectors. You can also choose to encode each vector dimension using 1, 2, or 4 bits, depending upon the desired compression factor as shown in the example below. The compression factor can be adjusted using bits
settings. A value of 2 yields 16x compression, while 4 results in 8x compression. The default setting is 1. In binary quantization, the training is handled natively at the time of indexing, allowing you to avoid an additional preprocessing step.
To implement binary quantization, define the vector type as knn_vector
and specify the encoder
name as binary
with the desired number of encoding bits
. Note, the encoder parameter refers to a method used to compress vector data before storing it in the index. Optimize performance by using space_type
, m
, and ef_construction
parameters. See the OpenSearch documentation for information about the underlying configuration of the approximate k-NN.
Memory requirements for implementing binary quantization with FAISS-HNSW:
1.1 * (bits * (d/8)+ 8 * m) * num_vectors bytes.
Compression | Encoding bits |
Memory required for 1 billion vector with d=384 and m=16 (in GB) |
32x | 1 | 193.6 |
16x | 2 | 246.4 |
8x | 4 | 352.0 |
For detailed implementation steps on binary quantization, see the OpenSearch documentation.
Byte-quantization (4x compression)
Byte quantization compresses 32-bit floating-point dimensions to 8-bit integers, ranging from –128 to +127, reducing memory usage by 75%. OpenSearch supports indexing, storing, and searching byte vectors, which must be converted to 8-bit format prior to ingestion. To implement byte vectors, specify the k-NN vector field data_type
as byte
in the index mapping. This feature is compatible with both Lucene and FAISS engines. An example of creating an index for byte-quantized vectors follows.
This method requires ingesting a byte-quantized vector into OpenSearch for direct storage in the k-NN vector field (of byte type). However, the recently introduced disk-based vector search feature eliminates the need for external vector quantization. This feature will be discussed in detail later in this blog.
Memory requirements for implementing byte quantization with FAISS-HNSW:
1.1 * (1 * d + 8 * m) * num_vectors bytes.
For detailed implementation steps, see to the OpenSearch documentation. For performance metrics regarding accuracy, throughput, and latency, see Byte-quantized vectors in OpenSearch.
FAISS FP16 quantization (2x compression)
FP16 quantization is a technique that uses 16-bit floating-point scalar representation, reducing the memory usage by 50%. Each vector dimension is converted from 32-bit to 16-bit floating-point, effectively halving the memory requirements. The compressed vector dimensions must be in the range [–65504.0, 65504.0]. To implement FP16 quantization, create the index with the k-NN vector field and configure the following:
- Set k-NN vector field method and engine to HNSW and FAISS, respectively.
- Define encoder parameter and set
name
tosq
andtype
tofp16
.
Upon uploading 32-bit floating-point vectors to OpenSearch, the scalar quantization FP16 (SQfp16) automatically quantizes them to 16-bit floating-point vectors during ingestion and stores them in the vector field. The following example demonstrates the creation of the index for quantizing and storing FP16-quantized vectors.
Memory requirements for implementing FP16 quantization with FAISS-HNSW:
(1.1 * (2 * d + 8 * m) * num_vectors) bytes.
The preceding FP16 example introduces an optional Boolean parameter called clip
, which defaults to false
. When false, vectors with out-of-range values (values not between –65504.0 and +65504.0) are rejected. Setting clip to true enables rounding of out-of-range vector values to fit within the supported range. For detailed implementation steps, see the OpenSearch documentation. For performance metrics regarding accuracy, throughput, and latency, see Optimizing OpenSearch with Faiss FP16 scalar quantization: Enhancing memory efficiency and cost-effectiveness.
Product quantization
Product quantization (PQ) is an advanced dimension-reduction technique that offers significantly higher levels of compression. While conventional scalar quantization methods typically achieve up to 32x compression, PQ can provide compression levels of up to 64x, making it a more efficient solution for optimizing storage and cost. OpenSearch supports PQ with both IVF and HNSW method from FAISS engine. Product quantization partitions vectors into m
sub-vectors, each encoded with a bit count determined by the code size. The resulting vector’s memory footprint is m * code_size bits.
FAISS product quantization involves three key steps:
- Create and populate a training index to build the PQ model, optimizing for accuracy.
- Execute the
_train
API on the training index to generate the quantizer model. - Construct the vector index, configuring the kNN field to use the prepared quantizer model.
The following example demonstrates the three steps to setting up product quantization.
Step1: Create the training index. Populate the training index with an appropriate dataset, making sure of dimensional alignment with train-index specifications. Note that the training index requires a minimum of 256 documents.
Step2: Create a quantizer model called my-model by running the _train
API on the training index you just created. Note that the encoder
with name
defined as pq
facilitates native vector quantization. Other parameters for encoder include code_size
and m
. FAISS-HNSW requires a code_size
of 8 and a training dataset of at least 256 (2^code_size) documents. For detailed parameter specifications, see the PQ parameter reference.
Step3: Map the quantizer model to your vector index.
Ingest the complete dataset into the newly created index, my-vector-index
. The encoder will automatically process the incoming vectors, applying encoding and quantization based on the compression parameters (code_size
and m
) specified in the quantizer model configuration.
Memory requirements for implementing product quantization with FAISS-HNSW:
1.1*(((code_size / 8) * m + 24 + 8 * m) * num_vectors bytes. Here the code_size
and m
are parameters within the encoder parameter, num_vectors
are the total number of vectors.
During quantization, each of the training vectors is broken down to multiple sub-vectors or sub-spaces, defined by a configurable value m. The number of bits to encode each of the sub-vector is controlled by parameter code_size
. Each of the sub-vectors is then compressed or quantized separately by running the k-means clustering with the value k defined as 2^code_size. In this technique, the vector is compressed roughly by m * code_size bits.
For detailed implementation guidelines and understanding of the configurable parameters during product quantization, see the OpenSearch documentation. For performance metrics regarding accuracy, throughput and latency using FAISS IVF for PQ, see Choose the k-NN algorithm for your billion-scale use case with OpenSearch.
Disk-based vector search
Disk-based vector search optimizes query efficiency by using compressed vectors in memory while maintaining full-precision vectors on disk. This approach enables OpenSearch to perform searches across large vector datasets without the need to load entire vectors into memory, thus improving scalability and resource utilization. Implementation is achieved through two new configurations at index creation: mode and compression level. As of OpenSearch 2.17, the mode parameter can be set to either in_memory
or on_disk
during indexing. The previously discussed methods default to an in-memory mode. In this configuration, the vector index is constructed using either a graph (HNSW) or bucket (IVF) structure, which is then loaded into native memory during search operations. While offering excellent recall, this approach could impact memory usage, and scalability for high volume vector workload.
The on_disk
mode optimizes vector search efficiency by storing full-precision vectors on disk while using real-time, native quantization during indexing. Coupled with adjustable compression levels, this approach allows only compressed vectors to be loaded into memory, thereby improving memory and resource utilization and search performance. The following compression levels correspond to various scalar quantization methods discussed earlier.
- 32x: Binary quantization (1-bit dimensions)
- 4x: Byte and integer quantization (8-bit dimensions)
- 2x: FP16 quantization (16-bit dimensions)
This method also supports other compression levels such as 16x and 8x that aren’t available with the in-memory mode. To enable disk-based vector search, create the index with mode set to on_disk
as shown in the following example.
Configuring just the mode as on_disk
employs the default configuration, which uses the FAISS engine and HNSW method with a 32x compression level (1-bit, binary quantization). The ef_construction
to optimize index time latency defaults to 100. For more granular fine-tuning, you can override these k-NN parameters as shown in the example that follows.
Because quantization is a lossy compression technique, higher compression levels typically result in lower recall. To improve recall during quantization, you can configure the disk-based vector search to run in two phases using the search time configuration parameter ef_search
and the oversample_factor
as shown in the following example.
In the first phase, oversample_factor * k results are retrieved from the quantized vectors in memory and the scores are approximated. In the second phase, the full-precision vectors of those oversample_factor * k results are loaded into memory from disk, and scores are recomputed against the full-precision query vector. The results are then reduced to the top k.
The oversample_factor
for rescoring is determined by the configured dimension and compression level at indexing. For dimensions below 1,000, the factor is fixed at 5. For dimensions exceeding 1,000, the default factor varies based on the compression level, as shown in the following table.
Compression level | Default oversample_factor for rescoring |
32x (default) | 3 |
16x | 2 |
8x | 2 |
4x | No default rescoring |
2x | No default rescoring |
As previously discussed, the oversample_factor
can be dynamically adjusted at search time. This value presents a critical trade-off between accuracy and search efficiency. While a higher factor improves accuracy, it proportionally increases memory usage and reduces search throughput. See the OpenSearch documentation to learn more about disk-based vector search and understand the right usage for oversample_factor.
Performance assessment of quantization methods: Reviewing memory, recall, and query latency.
The OpenSearch documentation on approximate k-NN search provides a starting point for implementing vector similarity search. Additionally, Choose the k-NN algorithm for your billion-scale use case with OpenSearch offers valuable insights into designing efficient vector workloads for handling billions of vectors in production environments. It introduces product quantization techniques as a potential solution to reduce memory requirements and associated costs by scaling down the memory footprint.
The following table illustrates the memory requirements for storing and searching through 1 billion vectors using various quantization techniques. The table compares the default memory consumption of full-precision vector using the HNSW method against memory consumed by quantized vectors. The model employed in this analysis is the sentence-transformers/all-MiniLM-L12-v2, which operates with 384 dimensions. The raw metadata is assumed to be not more than 100Gb.
Without quantization (in GB) |
Product quantization (in GB) |
Scalar quantization (in GB) |
|||
FP16 vectors | Byte vectors | Binary vectors | |||
m value | 16 | 16 | 16 | 16 | 16 |
pq_m, code_size | – | 16, 8 | – | – | – |
Native memory consumption (GB) | 1830.4 | 184.8 | 985.6 | 563.2 | 193.6 |
Total storage = 100 GB+vector |
1930.4 | 284.8 | 1085.6 | 663.2 | 293.6 |
Reviewing the preceding table reveals that for a dataset comprising 1 billion vectors, the HNSW graph with 32-bit full-precision vector requires approximately 1830 GB of memory. Compression techniques such as product quantization can reduce this to 184.8 GB, while scalar quantization offers varying levels of compression. The following table summarizes the correlation between compression techniques and their impact on key performance indicators including cost savings, recall rate, and query latency. This analysis builds upon our previous assessment of memory usage to aid in selecting compression technique that meets your requirement.
The table presents two key search metrics: search latency at the 90th percentile (p90) and recall at 100.
- Search latency @p90 indicates that 90% of search queries will be completed within that specific latency time.
- recall@100 – The fraction of the top 100 ground truth neighbors found in the 100 results returned.
Without quantization (in GB) |
Product quantization (in GB) |
Scalar quantization (in GB) |
|||
FP16 quantization [mode=in_memory] |
Byte quantization [mode=in_memory] |
Binary quantization [mode=on_disk] |
|||
Preconditions/Datasets | Applicable to all datasets | Recall depends on the nature of the training data | Works for dimension value in range [-65536 to 65535] |
Works for dimension value in range [-128 to 127] |
Works well for larger dimensions >=768 |
Preprocessing required? | No | Yes, preprocessing/training is required |
No | No | No |
Rescoring | No | No | No | No | Yes |
Recall @100 | >= 0.99 | >0.7 | >=0.95 | >=0.95 | >=0.90 |
p90 query latency (ms) | <50 ms | <50 ms | <50 ms | <50 ms | <200 ms |
Cost (baseline $X) |
$X | $0.1*X (up to 90% savings) |
$0.5*X (up to 50% savings) |
$0.25*X (up to 75%) |
$0.15*X (up to 85% savings) |
Sample cost for a billion vector | $20,923.14 | $2,092.31 | $10,461.57 | $5,230.79 | $3,138.47 |
The sample cost estimate for billion vector is based on a configuration optimized for cost. Please note that actual savings may vary based on your specific workload requirements and chosen configuration parameters. Notably in the table, product quantization offers up to 90% cost reduction compared to the baseline HNSW graph-based vector search cost ($X). Scalar quantization similarly yields proportional cost savings, ranging from 50% to 85% relative to the compressed memory footprint. The choice of compression technique involves balancing cost-effectiveness, accuracy, and performance, as it impacts precision and latency.
Conclusion
By leveraging OpenSearch’s quantization techniques, organizations can make informed tradeoffs between cost efficiency, performance, and recall, empowering them to fine-tune their vector database operations for optimal results. These quantization techniques significantly reduce memory requirements, improve query efficiency and offer built-in encoders for seamless compression. Whether you’re dealing with large-scale text embeddings, image features, or any other high-dimensional data, OpenSearch’s quantization techniques offer efficient solutions for vector search requirements, enabling the development of cost-effective, scalable, and high-performance systems.
As you move forward with your vector database projects, we encourage you to:
- Explore OpenSearch’s compression techniques in-depth
- Evaluate applicability of the right technique to your specific use case
- Determine the appropriate compression levels based on your requirements for recall and search latency
- Measure and compare cost savings based on accuracy, throughput, and latency
Stay informed about the latest developments in this rapidly evolving field, and don’t hesitate to experiment with different quantization techniques to find the optimal balance between cost, performance, and accuracy for your applications.
About the Authors
Aruna Govindaraju is an Amazon OpenSearch Specialist Solutions Architect and has worked with many commercial and open-source search engines. She is passionate about search, relevancy, and user experience. Her expertise with correlating end-user signals with search engine behavior has helped many customers improve their search experience.
Vamshi Vijay Nakkirtha is a software engineering manager working on the OpenSearch Project and Amazon OpenSearch Service. His primary interests include distributed systems. He is an active contributor to various OpenSearch projects such as k-NN, Geospatial, and dashboard-maps.