This pipeline lets users index data with an extra step of adding vector data to a few fields like Name.
The following fields are created as dense_vector type.
ElasticSearch 8.0 or higher is required for
dense_vectortype to be indexable.
[Index] [Files Used]
The pipeline can work with any index, for the sake of example, the index in the pipeline file is set to app-store-data. This can be changed accordingly to the requirements.
The index, however requires a mapping to be set before it is created. This mapping ensures that the fields that are being stored as vectors are of proper type (so that aNN) can run and are indexable.
Following request would set the name_vector and the desc_vector field as dense_vector types and make them indexable with the similarity being set to cosine.
curl --location --request PUT 'https://{{host}}:{{port}}/app-store-data' \
--header 'Content-Type: application/json' \
--data-raw '{
"mappings": {
"properties": {
"name_vector": {
"type": "dense_vector",
"dims": 768,
"index": true,
"similarity": "cosine"
},
"desc_vector": {
"type": "dense_vector",
"dims": 768,
"index": true,
"similarity": "cosine"
}
}
}
}'In the above, change the
hostwith the host and theportwith the port where ES is listening to.
It is important that the dimension of the vector fields are set according to the data being used in the bert model. Since, we are using clip-as-service the dimension will be 512 so the mapping needs to be set in the following way:
Dimension is indicated by the dims field.
"desc_vector": {
"type": "dense_vector",
"dims": 512,
"index": true,
"similarity": "cosine"
}The endpoint is defined using a pipeline file. The pipeline consists of the following steps:
- authorization: Authorize the user credentials to make sure they are valid.
- convert fields to vector: This stage uses clip-as-service to convert the passed
NameandDescriptionfield into vectors and generate a body to send to ElasticSearch. - merge vector: The above generated values are merged into the request body in this stage
- index data: Utilize the prebuilt
elasticsearchQuerystage to index the data into ElasticSearch.
We need a separate stage because the conversion stage is an async stage. The pipelines do not allow asynchronous stages to make changes to the already existing fields in the context. However, new fields can be added. This is why the converted vectored fields are added to context.
On the merge stage, these vectors are extracted from the context and merged into the request body.