Skip to content

query not returning any result hugging face #244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mickyarun opened this issue Nov 12, 2024 · 5 comments
Closed

query not returning any result hugging face #244

mickyarun opened this issue Nov 12, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@mickyarun
Copy link

mickyarun commented Nov 12, 2024

I have used notebook from example vectorizers_04.ipynb
Step 1:
Use Hugging face to embedd data using the following code snippet
sentences = [ "That is a happy dog", "That is a happy person", "Today is a sunny day" ] embeddings = hf.embed_many(sentences, as_buffer=True, dtype="float32")

Step 2:
schema.yml changed length of encoding to 3072

Step3:
Loaded data
`from redisvl.index import SearchIndex

construct a search index from the schema

index = SearchIndex.from_yaml("./schema.yaml")

connect to local redis instance

index.connect("redis://localhost:6379")

create the index (no data yet)

index.create(overwrite=True)`

`data = [{"text": t,
"embedding": v}
for t, v in zip(sentences, embeddings)]

index.load(data)`

Step 4:
when running .query no output, tried with all the approaches, search is not working
image

@tylerhutcherson
Copy link
Collaborator

@justin-cechmanek this might be a small regression in our docs (though I am suspicious since we test those with our CI). Can you look into this today?

@tylerhutcherson tylerhutcherson added the bug Something isn't working label Nov 12, 2024
@mickyarun
Copy link
Author

mickyarun commented Nov 12, 2024

Looks like a similar issue #219 . I have converted the storage_type to json then it works, byte it doesn't work.

@justin-cechmanek
Copy link
Collaborator

@mickyarun thanks for bringing this up. Can you tell me the hugging face model you're using, as well as the full json schema and code example that is working for you?

@mickyarun
Copy link
Author

mickyarun commented Nov 13, 2024

@justin-cechmanek All models working when stored in json format , I am attaching the code which is working in json but samething doesn't work if stored in bytes
`os.environ["TOKENIZERS_PARALLELISM"] = "false"
from redisvl.utils.vectorize import HFTextVectorizer

create a vectorizer

choose your model from the huggingface website

hf = HFTextVectorizer(model="sentence-transformers/all-mpnet-base-v2")

embed a sentence

test = hf.embed("This is a test sentence.")
test[:10]

You can also create many embeddings at once

sentences = [
"That is a happy dog",
"That is a happy person",
"Today is a sunny day"
]
embeddings = hf.embed_many(sentences, dtype="float32")

from redisvl.index import SearchIndex

construct a search index from the schema

index = SearchIndex.from_yaml("./schema.yaml")

connect to local redis instance

index.connect("redis://localhost:6379")

create the index (no data yet)

index.create(overwrite=True)
!rvl index listall
data = [{"text": t,
"embedding": v}
for t, v in zip(sentences, embeddings)]

index.load(data)
from redisvl.query import VectorQuery

use the HuggingFace vectorizer again to create a query embedding

query_embedding = hf.embed("That is a happy cat")

query = VectorQuery(
vector=query_embedding,
vector_field_name="embedding",
return_fields=["text"],
num_results=3
)

results = index.query(query)
for doc in results:
print(doc["text"], doc["vector_distance"])
##Schemaversion: '0.1.0'

index:
name: vectorizers
prefix: doc
storage_type: json

fields:
- name: text
type: text
- name: embedding
type: vector
attrs:
dims: 768
algorithm: flat
distance_metric: cosine
datatype: float32`

Same code if you change from json to bytes it will not return any result

@justin-cechmanek
Copy link
Collaborator

In testing your script we can get results for both hash and json index types, which require byte or floating point vector encodings, respectively. You may be running into issues with your two indices overlapping if you use the same prefix for both schema. Redis support multi-indexing which is why documents may be connected to multiple indices. This can cause confusion when you call index.create(overwrite=True) so our advice is to use unique prefixes. Can you try deleting your Redis indices and running the below script to verify you can query results for both hash and json index types?

from redisvl.utils.vectorize import HFTextVectorizer
from redisvl.index import SearchIndex
from redisvl.query import VectorQuery

#create a vectorizer
hf = HFTextVectorizer(model="sentence-transformers/all-mpnet-base-v2")

#You can also create many embeddings at once
sentences = ["That is a happy dog",
             "That is a happy person",
             "Today is a sunny day"
]

# for Hash use embeddings in byte format
byte_embeddings = hf.embed_many(sentences, dtype="float32", as_buffer=True)

#construct a search index from the schema
hash_index = SearchIndex.from_dict({
                                "index": {
                                        "name": "hash_vectorizers",
                                        "prefix": "hash_docs",
                                        "storage_type": "hash",
                                        },
                                "fields" : [
                                        {"name": "text",
                                        "type": "text"},
                                        {"name": "embedding",
                                        "type": "vector",
                                        "attrs": {
                                                "dims": 768,
                                                "algorithm": "flat",
                                                "distance_metric": "cosine",
                                                "datatype": "float32"}
}]}
                                )
hash_index.connect("redis://localhost:6379")
hash_index.create(overwrite=True, drop=True)

data = [{"text": t, "embedding": v} for t, v in zip(sentences, byte_embeddings)]
hash_keys = hash_index.load(data)
print(f'loaded {hash_keys} into hash index')

# now repeat the steps with a JSON storage type
json_index = SearchIndex.from_dict({
                                "index": {
                                        "name": "json_vectorizers",
                                        "prefix": "json_docs",
                                        "storage_type": "json",
                                        },
                                "fields" : [
                                        {"name": "text",
                                        "type": "text"},
                                        {"name": "embedding",
                                        "type": "vector",
                                        "attrs": {
                                                "dims": 768,
                                                "algorithm": "flat",
                                                "distance_metric": "cosine",
                                                "datatype": "float32"}
}]})

json_index.connect("redis://localhost:6379")
json_index.create(overwrite=True, drop=True)

# for JSON use embeddings in floating point
float_embeddings = hf.embed_many(sentences, as_buffer=False)
data = [{"text": t, "embedding": v} for t, v in zip(sentences, float_embeddings)]
json_keys = json_index.load(data)
print(f'loaded {json_keys} into json index')

# you can use the same vector to query either json or hash indices
query_embedding = hf.embed("That is a happy cat")

query = VectorQuery(
        vector=query_embedding,
        vector_field_name="embedding",
        return_fields=["text"],
        num_results=3
)

hash_results = hash_index.query(query)
print('hash results:')
for doc in hash_results:
    print(doc["text"], doc["vector_distance"])

json_results = json_index.query(query)
print('json results:')
for doc in json_results:
    print(doc["text"], doc["vector_distance"])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants