9.5 C
New York
Tuesday, March 11, 2025

Price Optimized Vector Database: Introduction to Amazon OpenSearch Service quantization methods


The rise of generative AI purposes has heightened the need to implement semantic search and pure language search. These superior search options assist discover and retrieve conceptually related paperwork from enterprise content material repositories to function prompts for generative AI fashions. Uncooked information inside numerous supply repositories within the type of textual content, pictures, audio, video, and so forth are transformed, with the assistance of embedding fashions, to a normal numerical illustration known as vectors that powers the semantic and pure language search. As organizations harness extra refined giant language and foundational fashions to energy their generative AI purposes, supplemental embedding fashions are additionally evolving to deal with giant, high-dimension vector embedding. Because the vector quantity expands, there’s a proportional enhance in reminiscence utilization and computational necessities, leading to greater operational prices. To mitigate this difficulty, numerous compression methods can be utilized to optimize reminiscence utilization and computational effectivity.

Quantization is a lossy information compression approach aimed to decrease computation and reminiscence utilization resulting in decrease prices, particularly for high-volume information workloads. There are numerous methods to compress information relying on the sort and quantity of the information. The standard approach is to map infinite values (or a comparatively giant checklist of finites) to smaller extra discrete values. Vector compression could be achieved by means of two major methods: product quantization and scalar quantization. Within the product quantization approach, the unique vector dimension array is damaged into a number of sub-vectors and every sub-vector is encoded into a set variety of bits that symbolize the unique vector. This methodology requires that you just solely retailer and search throughout the encoded sub-vector as an alternative of the unique vector. In scalar quantization, every dimension of the enter vector is mapped from a 32-bit floating-point illustration to a smaller information sort.

Amazon OpenSearch Service, as a vector database, helps scalar and product quantization methods to optimize reminiscence utilization and scale back operational prices.

OpenSearch as a vector database

OpenSearch is a distributed search and analytics service. The OpenSearch k-nearest neighbor (k-NN) plugin permits you to index, retailer, and search vectors. Vectors are saved in OpenSearch as a 32-bit float array of sort knn_vector and that helps as much as 16,000 dimensions per vector.

OpenSearch makes use of approximate nearest neighbor search to supply scalable vector search. The approximate k-NN algorithm retrieves outcomes based mostly on an estimation of the closest vectors to a given question vector. Two predominant strategies for performing approximate k-NN are the graph-based Hierarchical Navigable Small-World (HNSW) and the cluster-based Inverted File (IVF). These information buildings are constructed and loaded into reminiscence through the preliminary vector search operation. As vector quantity grows, each the information buildings and related reminiscence necessities for search operations scale proportionally.

For instance, every HNSW graph with 32-bit float information takes roughly 1.1 * (4 * d + 8 * m) * num_vectors bytes of reminiscence. Right here, num_vectors represents the full amount of vectors to be listed, d is the variety of dimensions decided by the embedding mannequin you utilize to generate the vectors and m is the variety of edges within the HSNW graphs, an index parameter that may be managed to tune efficiency. Utilizing this method, reminiscence necessities for vector storage for a configuration of 384 dimensions and an m worth of 16 can be:

  • 1 million vectors: 1.830 GB (1.1 * (4 * 384 + 8 * 16) * 1000,000 bytes)
  • 1 billion vectors: 1830 GB (1.1 * (4 * 384 + 8 * 16) * 1,000,000,000 bytes)

Though approximate nearest neighbor search could be optimized to deal with large datasets with billions of vectors effectively, the reminiscence necessities for loading 32-bit full-precision vectors to reminiscence through the search course of can grow to be prohibitively pricey. To mitigate this, OpenSearch service helps the next 4 quantization methods.

  • Binary quantization
  • Byte quantization
  • FP16 quantization
  • Product quantization

These methods fall throughout the broader class of scalar and product quantization that we mentioned earlier. On this put up, you’ll be taught quantization methods for optimizing vector workloads on OpenSearch Service, specializing in reminiscence discount and cost-efficiency. It introduces the brand new disk-based vector search method that allows environment friendly querying of vectors saved on disk with out loading them into reminiscence. The tactic integrates seamlessly with quantization methods, that includes key configurations such because the on_disk mode and compression_level parameter. These settings facilitate built-in, out-of-the-box scalar quantization on the time of indexing.

Binary quantization (as much as 32x compression)

Binary quantization (BQ) is a sort of scalar quantization. OpenSearch leverages FAISS engine’s binary quantization, enabling as much as 32x compression throughout indexing. This method reduces the vector dimension from the default 32-bit float to a 1-bit binary by compressing the vectors right into a 0s and 1s. OpenSearch helps indexing, storing and looking out binary vectors. It’s also possible to select to encode every vector dimension utilizing 1, 2, or 4 bits, relying upon the specified compression issue as proven within the instance beneath. The compression issue could be adjusted utilizing bits settings. A price of two yields 16x compression, whereas 4 leads to 8x compression. The default setting is 1. In binary quantization, the coaching is dealt with natively on the time of indexing, permitting you to keep away from an extra preprocessing step.

To implement binary quantization, outline the vector sort as knn_vector and specify the encoder title as binary with the specified variety of encoding bits. Word, the encoder parameter refers to a way used to compress vector information earlier than storing it within the index. Optimize efficiency by utilizing space_type, m, and ef_construction parameters. See the OpenSearch documentation for details about the underlying configuration of the approximate k-NN.

PUT my-vector-index
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "my_vector_field": {
        "sort": "knn_vector",
        "dimension": 8,
        "methodology": {
          "title": "hnsw",
          "engine": "faiss",
          "space_type": "l2",
          "parameters": {
            "m": 16,
            "ef_construction": 512,
            "encoder": {
              "title": "binary",
              "parameters": {
                "bits": 1
              }
            }
          }
        }
      }
    }
  }
}

Reminiscence necessities for implementing binary quantization with FAISS-HNSW:

1.1 * (bits * (d/8)+ 8 * m) * num_vectors bytes.

Compression Encoding bits

Reminiscence required for 1 billion vector

with d=384 and m=16 (in GB)

32x 1 193.6
16x 2 246.4
8x 4 352.0

For detailed implementation steps on binary quantization, see the OpenSearch documentation.

Byte-quantization (4x compression)

Byte quantization compresses 32-bit floating-point dimensions to 8-bit integers, starting from –128 to +127, lowering reminiscence utilization by 75%. OpenSearch helps indexing, storing, and looking out byte vectors, which have to be transformed to 8-bit format previous to ingestion. To implement byte vectors, specify the k-NN vector discipline data_type as byte within the index mapping. This function is suitable with each Lucene and FAISS engines. An instance of making an index for byte-quantized vectors follows.

PUT /my-vector-index
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
      "my_vector": {
        "sort": "knn_vector",
        "dimension": 3,
        "data_type": "byte",
        "space_type": "l2",
        "methodology": {
          "title": "hnsw",
          "engine": "faiss",
          "parameters": {
            "ef_construction": 100,
            "m": 16
          }
        }
      }
    }
  }
}

This methodology requires ingesting a byte-quantized vector into OpenSearch for direct storage within the k-NN vector discipline (of byte sort). Nevertheless, the not too long ago launched disk-based vector search function eliminates the necessity for exterior vector quantization. This function might be mentioned intimately later on this weblog.

Reminiscence necessities for implementing byte quantization with FAISS-HNSW:

1.1 * (1 * d + 8 * m) * num_vectors bytes.

For detailed implementation steps, see to the OpenSearch documentation. For efficiency metrics relating to accuracy, throughput, and latency, see Byte-quantized vectors in OpenSearch.

FAISS FP16 quantization (2x compression)

FP16 quantization is a way that makes use of 16-bit floating-point scalar illustration, lowering the reminiscence utilization by 50%. Every vector dimension is transformed from 32-bit to 16-bit floating-point, successfully halving the reminiscence necessities. The compressed vector dimensions have to be within the vary [–65504.0, 65504.0]. To implement FP16 quantization, create the index with the k-NN vector discipline and configure the next:

  • Set k-NN vector discipline methodology and engine to HNSW and FAISS, respectively.
  • Outline encoder parameter and set title to sq and sort to fp16.

Upon importing 32-bit floating-point vectors to OpenSearch, the scalar quantization FP16 (SQfp16) routinely quantizes them to 16-bit floating-point vectors throughout ingestion and shops them within the vector discipline. The next instance demonstrates the creation of the index for quantizing and storing FP16-quantized vectors.

PUT /my-vector-index
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
      "my_vector1": {
        "sort": "knn_vector",
        "dimension": 3,
        "space_type": "l2",
        "methodology": {
          "title": "hnsw",
          "engine": "faiss",
          "parameters": {
            "encoder": {
              "title": "sq",
              "parameters": {
                "sort": "fp16",
                "clip": true
              }
            },
            "ef_construction": 256,
            "m": 8
          }
        }
      }
    }
  }
}

Reminiscence necessities for implementing FP16 quantization with FAISS-HNSW:

(1.1 * (2 * d + 8 * m) * num_vectors) bytes.

The previous FP16 instance introduces an non-compulsory Boolean parameter known as clip, which defaults to false. When false, vectors with out-of-range values (values not between –65504.0 and +65504.0) are rejected. Setting clip to true allows rounding of out-of-range vector values to suit throughout the supported vary. For detailed implementation steps, see the OpenSearch documentation. For efficiency metrics relating to accuracy, throughput, and latency, see Optimizing OpenSearch with Faiss FP16 scalar quantization: Enhancing reminiscence effectivity and cost-effectiveness.

Product quantization

Product quantization (PQ) is a sophisticated dimension-reduction approach that provides considerably greater ranges of compression. Whereas typical scalar quantization strategies usually obtain as much as 32x compression, PQ can present compression ranges of as much as 64x, making it a extra environment friendly answer for optimizing storage and value. OpenSearch helps PQ with each IVF and HNSW methodology from FAISS engine. Product quantization partitions vectors into m sub-vectors, every encoded with a bit rely decided by the code dimension. The ensuing vector’s reminiscence footprint is m * code_size bits.

FAISS product quantization entails three key steps:

  1. Create and populate a coaching index to construct the PQ mannequin, optimizing for accuracy.
  2. Execute the _train API on the coaching index to generate the quantizer mannequin.
  3. Assemble the vector index, configuring the kNN discipline to make use of the ready quantizer mannequin.

The next instance demonstrates the three steps to establishing product quantization.

Step1: Create the coaching index. Populate the coaching index with an acceptable dataset, ensuring of dimensional alignment with train-index specs. Word that the coaching index requires a minimal of 256 paperwork.

PUT /train-index
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 2
  },
  "mappings": {
    "properties": {
      "train-field": {
        "sort": "knn_vector",
        "dimension": 4
      }
    }
  }
}

Step2: Create a quantizer mannequin known as my-model by working the _train API on the coaching index you simply created. Word that the encoder with title outlined as pq facilitates native vector quantization. Different parameters for encoder embrace code_size and m. FAISS-HNSW requires a code_size of 8 and a coaching dataset of no less than 256 (2^code_size) paperwork. For detailed parameter specs, see the PQ parameter reference.

POST /_plugins/_knn/fashions/my-model/_train
{
  "training_index": "train-index",
  "training_field": "train-field",
  "dimension": 4,
  "description": "My take a look at mannequin description",
  "methodology": {
    "title": "hnsw",
    "engine": "faiss",
    "parameters": {
      "encoder": {
        "title": "pq", 
         "parameters": {
           "code_size":8,
           "m":2
         }
      },
      "ef_construction": 256,
      "m": 8
    }
  }
}

Step3: Map the quantizer mannequin to your vector index.

PUT /my-vector-index
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 2,
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "target-field": {
        "sort": "knn_vector",
        "model_id": "my-model"
      }
    }
  }
}

Ingest the entire dataset into the newly created index, my-vector-index. The encoder will routinely course of the incoming vectors, making use of encoding and quantization based mostly on the compression parameters (code_size and m) specified within the quantizer mannequin configuration.

Reminiscence necessities for implementing product quantization with FAISS-HNSW:

1.1*(((code_size / 8) * m + 24 + 8 * m) * num_vectors bytes. Right here the code_size and m are parameters throughout the encoder parameter, num_vectors are the full variety of vectors.

Throughout quantization, every of the coaching vectors is damaged right down to a number of sub-vectors or sub-spaces, outlined by a configurable worth m. The variety of bits to encode every of the sub-vector is managed by parameter code_size. Every of the sub-vectors is then compressed or quantized individually by working the k-means clustering with the worth ok outlined as 2^code_size. On this approach, the vector is compressed roughly by m * code_size bits.

For detailed implementation pointers and understanding of the configurable parameters throughout product quantization, see the OpenSearch documentation. For efficiency metrics relating to accuracy, throughput and latency utilizing FAISS IVF for PQ, see Select the k-NN algorithm to your billion-scale use case with OpenSearch.

Disk-based vector search

Disk-based vector search optimizes question effectivity by utilizing compressed vectors in reminiscence whereas sustaining full-precision vectors on disk. This method allows OpenSearch to carry out searches throughout giant vector datasets with out the necessity to load total vectors into reminiscence, thus bettering scalability and useful resource utilization. Implementation is achieved by means of two new configurations at index creation: mode and compression stage. As of OpenSearch 2.17, the mode parameter could be set to both in_memory or on_disk throughout indexing. The beforehand mentioned strategies default to an in-memory mode. On this configuration, the vector index is constructed utilizing both a graph (HNSW) or bucket (IVF) construction, which is then loaded into native reminiscence throughout search operations. Whereas providing glorious recall, this method might influence reminiscence utilization, and scalability for top quantity vector workload.

The on_disk mode optimizes vector search effectivity by storing full-precision vectors on disk whereas utilizing real-time, native quantization throughout indexing. Coupled with adjustable compression ranges, this method permits solely compressed vectors to be loaded into reminiscence, thereby bettering reminiscence and useful resource utilization and search efficiency. The next compression ranges correspond to varied scalar quantization strategies mentioned earlier.

  • 32x: Binary quantization (1-bit dimensions)
  • 4x: Byte and integer quantization (8-bit dimensions)
  • 2x: FP16 quantization (16-bit dimensions)

This methodology additionally helps different compression ranges akin to 16x and 8x that aren’t obtainable with the in-memory mode. To allow disk-based vector search, create the index with mode set to on_disk as proven within the following instance.

PUT /my-vector-index
{
  "settings" : {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "my_vector_field": {
        "sort": "knn_vector",
        "dimension": 8,
        "space_type": "innerproduct",
        "data_type": "float",
        "mode": "on_disk"
      }
    }
  }
}

Configuring simply the mode as on_disk employs the default configuration, which makes use of the FAISS engine and HNSW methodology with a 32x compression stage (1-bit, binary quantization). The ef_construction to optimize index time latency defaults to 100. For extra granular fine-tuning, you possibly can override these k-NN parameters as proven within the instance that follows.

PUT /my-vector-index
{
  "settings" : {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "my_vector_field": {
        "sort": "knn_vector",
        "dimension": 8,
        "space_type": "innerproduct",
        "data_type": "float",
        "mode": "on_disk",
        "compression_level": "16x",
        "methodology": {
          "title": "hnsw",
          "engine": "faiss",
          "parameters": {
            "ef_construction": 512
          }
        }
      }
    }
  }
}

As a result of quantization is a lossy compression approach, greater compression ranges usually end in decrease recall. To enhance recall throughout quantization, you possibly can configure the disk-based vector search to run in two phases utilizing the search time configuration parameter ef_search and the oversample_factor as proven within the following instance.

GET my-vector-index/_search
{
  "question": {
    "knn": {
      "my_vector_field": {
        "vector": [1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5],
        "ok": 5,
        "method_parameters": {
            "ef_search": 512
        },
        "rescore": {
            "oversample_factor": 10.0
        }
      }
    }
  }
}

Within the first section, oversample_factor * ok outcomes are retrieved from the quantized vectors in reminiscence and the scores are approximated. Within the second section, the full-precision vectors of these oversample_factor * ok outcomes are loaded into reminiscence from disk, and scores are recomputed in opposition to the full-precision question vector. The outcomes are then lowered to the highest ok.

The oversample_factor for rescoring is decided by the configured dimension and compression stage at indexing. For dimensions beneath 1,000, the issue is mounted at 5. For dimensions exceeding 1,000, the default issue varies based mostly on the compression stage, as proven within the following desk.

Compression stage Default oversample_factor for rescoring
32x (default) 3
16x 2
8x 2
4x No default rescoring
2x No default rescoring

As beforehand mentioned, the oversample_factor could be dynamically adjusted at search time. This worth presents a essential trade-off between accuracy and search effectivity. Whereas the next issue improves accuracy, it proportionally will increase reminiscence utilization and reduces search throughput. See the OpenSearch documentation to be taught extra about disk-based vector search and perceive the appropriate utilization for oversample_factor.

Efficiency evaluation of quantization strategies: Reviewing reminiscence, recall, and question latency.

The OpenSearch documentation on approximate k-NN search offers a place to begin for implementing vector similarity search. Moreover, Select the k-NN algorithm to your billion-scale use case with OpenSearch gives beneficial insights into designing environment friendly vector workloads for dealing with billions of vectors in manufacturing environments. It introduces product quantization methods as a possible answer to cut back reminiscence necessities and related prices by cutting down the reminiscence footprint.

The next desk illustrates the reminiscence necessities for storing and looking out by means of 1 billion vectors utilizing numerous quantization methods. The desk compares the default reminiscence consumption of full-precision vector utilizing the HNSW methodology in opposition to reminiscence consumed by quantized vectors. The mannequin employed on this evaluation is the sentence-transformers/all-MiniLM-L12-v2, which operates with 384 dimensions. The uncooked metadata is assumed to be no more than 100Gb.

With out quantization
(in GB)
Product quantization
(in GB)
Scalar quantization
(in GB)
FP16 vectors Byte vectors Binary vectors
m worth 16 16 16 16 16
pq_m, code_size 16, 8
Native reminiscence consumption (GB) 1830.4 184.8 985.6 563.2 193.6
Whole storage =
100 GB+vector
1930.4 284.8 1085.6 663.2 293.6

Reviewing the previous desk reveals that for a dataset comprising 1 billion vectors, the HNSW graph with 32-bit full-precision vector requires roughly 1830 GB of reminiscence. Compression methods akin to product quantization can scale back this to 184.8 GB, whereas scalar quantization gives various ranges of compression. The next desk summarizes the correlation between compression methods and their influence on key efficiency indicators together with price financial savings, recall charge, and question latency. This evaluation builds upon our earlier evaluation of reminiscence utilization to assist in deciding on compression approach that meets your requirement.

The desk presents two key search metrics: search latency on the ninetieth percentile (p90) and recall at 100.

  • Search latency @p90 signifies that 90% of search queries might be accomplished inside that particular latency time.
  • recall@100 – The fraction of the highest 100 floor fact neighbors discovered within the 100 outcomes returned.
  With out quantization
(in GB)
Product quantization
(in GB)
Scalar quantization
(in GB)
  FP16 quantization
[mode=in_memory]
Byte quantization
[mode=in_memory]
Binary quantization
[mode=on_disk]
Preconditions/Datasets Relevant to all datasets Recall depends upon the character of the coaching information Works for dimension worth in
vary [-65536 to 65535]
Works for dimension worth in
vary [-128 to 127]
Works effectively for bigger dimensions >=768
Preprocessing required? No Sure,
preprocessing/coaching is required
No No No
Rescoring No No No No Sure
Recall @100 >= 0.99 >0.7 >=0.95 >=0.95 >=0.90
p90 question latency (ms) <50 ms <50 ms <50 ms <50 ms <200 ms
Price
(baseline $X)
$X $0.1*X
(as much as 90% financial savings)
$0.5*X
(as much as 50% financial savings)
$0.25*X
(as much as 75%)
$0.15*X
(as much as 85% financial savings)
Pattern price for a billion vector $20,923.14 $2,092.31 $10,461.57 $5,230.79 $3,138.47

The pattern price estimate for billion vector is predicated on a configuration optimized for price. Please word that precise financial savings could fluctuate based mostly in your particular workload necessities and chosen configuration parameters. Notably within the desk, product quantization gives as much as 90% price discount in comparison with the baseline HNSW graph-based vector search price ($X). Scalar quantization equally yields proportional price financial savings, starting from 50% to 85% relative to the compressed reminiscence footprint. The selection of compression approach entails balancing cost-effectiveness, accuracy, and efficiency, because it impacts precision and latency.

Conclusion

By leveraging OpenSearch’s quantization methods, organizations could make knowledgeable tradeoffs between price effectivity, efficiency, and recall, empowering them to fine-tune their vector database operations for optimum outcomes. These quantization methods considerably scale back reminiscence necessities, enhance question effectivity and provide built-in encoders for seamless compression. Whether or not you’re coping with large-scale textual content embeddings, picture options, or another high-dimensional information, OpenSearch’s quantization methods provide environment friendly options for vector search necessities, enabling the event of cost-effective, scalable, and high-performance techniques.

As you progress ahead together with your vector database tasks, we encourage you to:

  1. Discover OpenSearch’s compression methods in-depth
  2. Consider applicability of the appropriate approach to your particular use case
  3. Decide the suitable compression ranges based mostly in your necessities for recall and search latency
  4. Measure and examine price financial savings based mostly on accuracy, throughput, and latency

Keep knowledgeable in regards to the newest developments on this quickly evolving discipline, and don’t hesitate to experiment with totally different quantization methods to search out the optimum stability between price, efficiency, and accuracy to your purposes.


In regards to the Authors

Aruna Govindaraju is an Amazon OpenSearch Specialist Options Architect and has labored with many industrial and open-source serps. She is obsessed with search, relevancy, and consumer expertise. Her experience with correlating end-user indicators with search engine conduct has helped many shoppers enhance their search expertise.

Vamshi Vijay Nakkirtha is a software program engineering supervisor engaged on the OpenSearch Challenge and Amazon OpenSearch Service. His major pursuits embrace distributed techniques. He’s an energetic contributor to varied OpenSearch tasks akin to k-NN, Geospatial, and dashboard-maps.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles