-
Notifications
You must be signed in to change notification settings - Fork 234
Description
Release Note
This release contains 2 breaking changes, 3 new features, 11 bug fixes, and 2 documentation improvements.
💣 Breaking Changes
Terminate Python 3.7 support
We decided to drop it for two reasons:
- Several dependencies of DocArray require Python 3.8.
- Python long-term support for 3.7 is ending this week. This means there will no longer
be security updates for Python 3.7, making this a good time for us to change our requirements.
Changes to DocVec Protobuf definition (#1639)
In order to fix a bug in the DocVec protobuf serialization described in #1561,
we have changed the DocVec .proto definition.
This means that DocVec objects serialized with DocArray v0.33.0 or earlier cannot be deserialized with DocArray
v.0.34.0 or later, and vice versa.
DocVec upgrade to DocArray v0.34.0 or
later.
🆕 Features
Allow users to check if a Document is already indexed in a DocIndex (#1633)
You can now check if a Document has already been indexed by using the in keyword:
from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
docs = DocList[MyDoc](
[MyDoc(text="Example text", embedding=np.random.rand(128))
for _ in range(2000)])
index = InMemoryExactNNIndex[MyDoc](docs)
assert docs[0] in index
assert MyDoc(text='New text', embedding=np.random.rand(128)) not in indexSupport subindexes in InMemoryExactNNIndex (#1617)
You can now use the find_subindex
method with the ExactNNSearch DocIndex.
from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
docs = DocList[MyDoc](
[MyDoc(text="Example text", embedding=np.random.rand(128))
for _ in range(2000)])
index = InMemoryExactNNIndex[MyDoc](docs)
assert docs[0] in index
assert MyDoc(text='New text', embedding=np.random.rand(128)) not in indexFlexible tensor types for protobuf deserialization (#1645)
You can deserialize any DocVec protobuf message to any tensor type,
by passing the tensor_type parameter to from_protobuf.
This means that you can choose at deserialization time if you are working with numpy, PyTorch, or TensorFlow tensors.
class MyDoc(BaseDoc):
tensor: TensorFlowTensor
da = DocVec[MyDoc](...) # doesn't matter what tensor_type is here
proto = da.to_protobuf()
da_after = DocVec[MyDoc].from_protobuf(proto, tensor_type=TensorFlowTensor)
assert isinstance(da_after.tensor, TensorFlowTensor)⚙ Refactoring
Add DBConfig to InMemoryExactNNSearch
InMemoryExactNNsearch used to get a single parameter index_file_path as a constructor parameter, unlike the rest of
the Indexers who accepted their own DBConfig. Now index_file_path is part of the DBConfig which allows to
initialize from it.
This will allow us to extend this config if more parameters are needed.
The parameters of DBConfig can be passed at construction time as **kwargs making this change compatible with old
usage.
These two initializations are equivalent.
from docarray.index import InMemoryExactNNIndex
db_config = InMemoryExactNNIndex.DBConfig(index_file_path='index.bin')
index = InMemoryExactNNIndex[MyDoc](db_config=db_config)
index = InMemoryExactNNIndex[MyDoc](index_file_path='index.bin')🐞 Bug Fixes
Allow protobuf deserialization of BaseDoc with Union type (#1655)
Serialization of BaseDoc types who have Union types parameter of Python native types is supported.
from docarray import BaseDoc
from typing import Union
class MyDoc(BaseDoc):
union_field: Union[int, str]
docs1 = DocList[MyDoc]([MyDoc(union_field="hello")])
docs2 = DocList[BasisUnion].from_dataframe(docs_basic.to_dataframe())
assert docs1 == docs2When these Union types involve other BaseDoc types, an exception is thrown.
class CustomDoc(BaseDoc):
ud: Union[TextDoc, ImageDoc] = TextDoc(text='union type')
docs = DocList[CustomDoc]([CustomDoc(ud=TextDoc(text='union type'))])
# raises an Exception
DocList[CustomDoc].from_dataframe(docs.to_dataframe())Cast limit to integer when passed to HNSWDocumentIndex (#1657, #1656)
If you call find or find_batched on an HNSWDocumentIndex, the limit parameter will automatically be cast to
integer.
Moved default_column_config from RuntimeConfig to DBconfig (#1648)
default_column_config contains specific configuration information about the columns and tables inside the backend's
database. This was previously put inside RuntimeConfig which caused an error because this information is required at
initialization time. This information has been moved inside DBConfig so you can edit it there.
from docarray.index import HNSWDocumentIndex
import numpy as np
db_config = HNSWDocumentIndex.DBConfig()
db_conf.default_column_config.get(np.ndarray).update({'ef': 2500})
index = HNSWDocumentIndex[MyDoc](db_config=db_config)Fix issue with Protobuf (de)serialization for DocVec (#1639)
This bug caused raw Protobuf objects to be stored as DocVec columns after they were deserialized from Protobuf, making the
data essentially inaccessible. This has now been fixed, and DocVec objects are identical before and after (de)serialization.
Fix order of returned matches when find and filter combination used in InMemoryExactNNIndex (#1642)
Hybrid search (find+filter) for InMemoryExactNNIndex was prioritizing low similarities (lower scores) for returned
matches. Fixed by adding an option to sort matches in a reverse order based on their scores.
# prepare a query
q_doc = MyDoc(embedding=np.random.rand(128), text='query')
query = (
db.build_query()
.find(query=q_doc, search_field='embedding')
.filter(filter_query={'text': {'$exists': True}})
.build()
)
results = db.execute_query(query)
# Before: results was sorted from worst to best matches
# Now: It's sorted in the correct order, showing better matches firstWorking with external Qdrant collections (#1632)
When using QdrandDocumentIndex to connect to a Qdrant DB initialized outside of docarray raised a KeyError.
This has been fixed, and now you can use QdrantDocumentIndex to connect to externally initialized collections.
Other bug fixes
- Update text search to match Weaviate client's new sig (fix: update text search to match client's new sig #1654)
- Fix
DocVecequality (fix: doc vec equality #1641, fix: docvec equality if tensors are involved #1663) - Fix exception when
summary()called forLegacyDocument. (fix: summary of legacy document #1637) - Fix
DocListandDocVeccoersion. (Validation bug Fix: DocList and DocVec are not coerced to each other #1568) - Fix
update()onBaseDocwith tensors fields (fix: fix update with tensors #1628)
📗 Documentation Improvements
- Enhance DocVec section (docs: enhance DocVec section #1658)
- Qdrant in memory usage (docs: qdrant in memory usage #1634)
🤟 Contributors
We would like to thank all contributors to this release:
- Johannes Messner (@JohannesMessner)
- Nikolas Pitsillos (@npitsillos)
- Shukri (@hsm207)
- Kacper Łukawski (@kacperlukawski)
- Aman Agarwal (@agaraman0)
- maxwelljin (@maxwelljin)
- samsja (@samsja)
- Saba Sturua (@jupyterjazz)
- Joan Fontanals (@JoanFM)