query not returning any result hugging face #244

mickyarun · 2024-11-12T15:46:38Z

I have used notebook from example vectorizers_04.ipynb
Step 1:
Use Hugging face to embedd data using the following code snippet
sentences = [ "That is a happy dog", "That is a happy person", "Today is a sunny day" ] embeddings = hf.embed_many(sentences, as_buffer=True, dtype="float32")

Step 2:
schema.yml changed length of encoding to 3072

Step3:
Loaded data
`from redisvl.index import SearchIndex

construct a search index from the schema

index = SearchIndex.from_yaml("./schema.yaml")

connect to local redis instance

index.connect("redis://localhost:6379")

create the index (no data yet)

index.create(overwrite=True)`

`data = [{"text": t,
"embedding": v}
for t, v in zip(sentences, embeddings)]

index.load(data)`

Step 4:
when running .query no output, tried with all the approaches, search is not working

The text was updated successfully, but these errors were encountered:

tylerhutcherson · 2024-11-12T15:49:12Z

@justin-cechmanek this might be a small regression in our docs (though I am suspicious since we test those with our CI). Can you look into this today?

mickyarun · 2024-11-12T15:53:29Z

Looks like a similar issue #219 . I have converted the storage_type to json then it works, byte it doesn't work.

justin-cechmanek · 2024-11-12T18:50:57Z

@mickyarun thanks for bringing this up. Can you tell me the hugging face model you're using, as well as the full json schema and code example that is working for you?

mickyarun · 2024-11-13T04:55:44Z

@justin-cechmanek All models working when stored in json format , I am attaching the code which is working in json but samething doesn't work if stored in bytes
`os.environ["TOKENIZERS_PARALLELISM"] = "false"
from redisvl.utils.vectorize import HFTextVectorizer

create a vectorizer

choose your model from the huggingface website

hf = HFTextVectorizer(model="sentence-transformers/all-mpnet-base-v2")

embed a sentence

test = hf.embed("This is a test sentence.")
test[:10]

You can also create many embeddings at once

sentences = [
"That is a happy dog",
"That is a happy person",
"Today is a sunny day"
]
embeddings = hf.embed_many(sentences, dtype="float32")

from redisvl.index import SearchIndex

construct a search index from the schema

index = SearchIndex.from_yaml("./schema.yaml")

connect to local redis instance

index.connect("redis://localhost:6379")

create the index (no data yet)

index.create(overwrite=True)
!rvl index listall
data = [{"text": t,
"embedding": v}
for t, v in zip(sentences, embeddings)]

index.load(data)
from redisvl.query import VectorQuery

use the HuggingFace vectorizer again to create a query embedding

query_embedding = hf.embed("That is a happy cat")

query = VectorQuery(
vector=query_embedding,
vector_field_name="embedding",
return_fields=["text"],
num_results=3
)

results = index.query(query)
for doc in results:
print(doc["text"], doc["vector_distance"])
##Schemaversion: '0.1.0'

index:
name: vectorizers
prefix: doc
storage_type: json

fields:
- name: text
type: text
- name: embedding
type: vector
attrs:
dims: 768
algorithm: flat
distance_metric: cosine
datatype: float32`

Same code if you change from json to bytes it will not return any result

justin-cechmanek · 2024-11-19T23:42:32Z

In testing your script we can get results for both hash and json index types, which require byte or floating point vector encodings, respectively. You may be running into issues with your two indices overlapping if you use the same prefix for both schema. Redis support multi-indexing which is why documents may be connected to multiple indices. This can cause confusion when you call index.create(overwrite=True) so our advice is to use unique prefixes. Can you try deleting your Redis indices and running the below script to verify you can query results for both hash and json index types?

from redisvl.utils.vectorize import HFTextVectorizer
from redisvl.index import SearchIndex
from redisvl.query import VectorQuery

#create a vectorizer
hf = HFTextVectorizer(model="sentence-transformers/all-mpnet-base-v2")

#You can also create many embeddings at once
sentences = ["That is a happy dog",
             "That is a happy person",
             "Today is a sunny day"
]

# for Hash use embeddings in byte format
byte_embeddings = hf.embed_many(sentences, dtype="float32", as_buffer=True)

#construct a search index from the schema
hash_index = SearchIndex.from_dict({
                                "index": {
                                        "name": "hash_vectorizers",
                                        "prefix": "hash_docs",
                                        "storage_type": "hash",
                                        },
                                "fields" : [
                                        {"name": "text",
                                        "type": "text"},
                                        {"name": "embedding",
                                        "type": "vector",
                                        "attrs": {
                                                "dims": 768,
                                                "algorithm": "flat",
                                                "distance_metric": "cosine",
                                                "datatype": "float32"}
}]}
                                )
hash_index.connect("redis://localhost:6379")
hash_index.create(overwrite=True, drop=True)

data = [{"text": t, "embedding": v} for t, v in zip(sentences, byte_embeddings)]
hash_keys = hash_index.load(data)
print(f'loaded {hash_keys} into hash index')

# now repeat the steps with a JSON storage type
json_index = SearchIndex.from_dict({
                                "index": {
                                        "name": "json_vectorizers",
                                        "prefix": "json_docs",
                                        "storage_type": "json",
                                        },
                                "fields" : [
                                        {"name": "text",
                                        "type": "text"},
                                        {"name": "embedding",
                                        "type": "vector",
                                        "attrs": {
                                                "dims": 768,
                                                "algorithm": "flat",
                                                "distance_metric": "cosine",
                                                "datatype": "float32"}
}]})

json_index.connect("redis://localhost:6379")
json_index.create(overwrite=True, drop=True)

# for JSON use embeddings in floating point
float_embeddings = hf.embed_many(sentences, as_buffer=False)
data = [{"text": t, "embedding": v} for t, v in zip(sentences, float_embeddings)]
json_keys = json_index.load(data)
print(f'loaded {json_keys} into json index')

# you can use the same vector to query either json or hash indices
query_embedding = hf.embed("That is a happy cat")

query = VectorQuery(
        vector=query_embedding,
        vector_field_name="embedding",
        return_fields=["text"],
        num_results=3
)

hash_results = hash_index.query(query)
print('hash results:')
for doc in hash_results:
    print(doc["text"], doc["vector_distance"])

json_results = json_index.query(query)
print('json results:')
for doc in json_results:
    print(doc["text"], doc["vector_distance"])

tylerhutcherson assigned justin-cechmanek Nov 12, 2024

tylerhutcherson added the bug Something isn't working label Nov 12, 2024

justin-cechmanek closed this as completed Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query not returning any result hugging face #244

query not returning any result hugging face #244

mickyarun commented Nov 12, 2024 •

edited

Loading

tylerhutcherson commented Nov 12, 2024

mickyarun commented Nov 12, 2024 •

edited

Loading

justin-cechmanek commented Nov 12, 2024

mickyarun commented Nov 13, 2024 •

edited

Loading

justin-cechmanek commented Nov 19, 2024

query not returning any result hugging face #244

query not returning any result hugging face #244

Comments

mickyarun commented Nov 12, 2024 • edited Loading

construct a search index from the schema

connect to local redis instance

create the index (no data yet)

tylerhutcherson commented Nov 12, 2024

mickyarun commented Nov 12, 2024 • edited Loading

justin-cechmanek commented Nov 12, 2024

mickyarun commented Nov 13, 2024 • edited Loading

create a vectorizer

choose your model from the huggingface website

embed a sentence

You can also create many embeddings at once

construct a search index from the schema

connect to local redis instance

create the index (no data yet)

use the HuggingFace vectorizer again to create a query embedding

justin-cechmanek commented Nov 19, 2024

mickyarun commented Nov 12, 2024 •

edited

Loading

mickyarun commented Nov 12, 2024 •

edited

Loading

mickyarun commented Nov 13, 2024 •

edited

Loading