Skip to content

Conversation

@maxwelljin
Copy link
Contributor

@maxwelljin maxwelljin commented Jun 9, 2023

This PR is designed to enable the indexer (for all supported backends) to check whether a document has already been indexed. We aim to accommodate various backends including in_memory, hnswlib, elastic, qdrant, and weaviate. Given that different backends store and index documents in distinct ways, we need to custom-tailor our function for each backend.

Progress:

  • In-Memory
  • Hnswlib
  • Elastic
  • Qdrant
  • Weaviate

Signed-off-by: maxwelljin2 <gejin@berkeley.edu>
@maxwelljin maxwelljin linked an issue Jun 9, 2023 that may be closed by this pull request
rows = self._sqlite_cursor.fetchall()
return len(rows) > 0
else:
raise NotImplementedError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not seem a proper exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is still in progress, I'll change it later :) It should output proper hint to users

Signed-off-by: maxwelljin2 <gejin@berkeley.edu>
Signed-off-by: maxwelljin2 <gejin@berkeley.edu>
Signed-off-by: maxwelljin2 <gejin@berkeley.edu>
@maxwelljin maxwelljin marked this pull request as ready for review June 12, 2023 08:19
@maxwelljin maxwelljin requested a review from samsja June 13, 2023 07:15
Copy link
Member

@samsja samsja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Good job 🎩

Signed-off-by: maxwelljin2 <gejin@berkeley.edu>
return False

if safe_issubclass(type(item), BaseDoc):
docs = self._get_all_documents()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is the way to do right? I think u should call __contains__ in every subindex and if one returns True, it is true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's implemented in the line 1205 - 1207. I'll remove all _get_all_documents method, so it would only look for any sub-document inside the DocArray. (so the meaning for this subindex_contains method would similar to subindex_find)

@maxwelljin
Copy link
Contributor Author

With the subindex_contains method, users can verify the presence of specific documents within the indexer, even in the case of nested documents. This method is implemented in the abstract class, allowing various backends to utilize different data storage and retrieval approaches.

Signed-off-by: maxwelljin2 <gejin@berkeley.edu>
Signed-off-by: maxwelljin2 <gejin@berkeley.edu>
Signed-off-by: maxwelljin2 <gejin@berkeley.edu>
Signed-off-by: maxwelljin2 <gejin@berkeley.edu>
@maxwelljin maxwelljin requested a review from JoanFM June 14, 2023 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a method to check if a document is already index

3 participants