-
Notifications
You must be signed in to change notification settings - Fork 92
None optimization (part 2) #5409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Stores None's as 1-bit in a bitmap - In case something isn't None, store as the value in a dynamically sized-list Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Add code to generate batch files in different formats and makes sure we can read them correctly in the future. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes the serialization format for tuples containing Option types by implementing a sparse encoding strategy. Instead of storing the full size of each field even when None, the new format uses a bitmap to mark None fields and only serializes non-None values.
Changes:
- Implemented bitmap-based sparse tuple serialization for more efficient storage of tuples with many
Nonevalues - Incremented storage format version from 3 to 4 to support the new tuple encoding
- Added backward compatibility layer to deserialize v3 files using the legacy format
Reviewed changes
Copilot reviewed 21 out of 29 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/feldera-macros/src/tuples.rs | Core implementation of the new bitmap-based tuple serialization format |
| crates/dbsp/src/storage/file/format.rs | Incremented VERSION_NUMBER from 3 to 4 |
| crates/dbsp/src/storage/file.rs | Added Deserializer wrapper with version tracking |
| crates/dbsp/src/storage/file/reader.rs | Updated readers to pass version information for backward compatibility |
| crates/storage-test-compat/* | New compatibility testing crate with golden file tests |
| crates/feldera-macros/tests/* | Added comprehensive tests for new tuple serialization |
| @@ -0,0 +1,419 @@ | |||
| use std::collections::BTreeMap; | |||
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'optimziation' to 'optimization' in the PR title.
| { | ||
| #[inline] | ||
| fn eq(&self, other: &Self) -> bool { | ||
| unsafe { self.get_t0() == other.get_t0() && self.get_t1() == other.get_t1() } |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unsafe block here is unnecessary since get_t0() and get_t1() already contain the necessary unsafe operations internally. The unsafe should be removed to avoid misleading safety implications.
| unsafe { self.get_t0() == other.get_t0() && self.get_t1() == other.get_t1() } | |
| self.get_t0() == other.get_t0() && self.get_t1() == other.get_t1() |
| let t0_cmp = unsafe { self.get_t0().cmp(other.get_t0()) }; | ||
| if t0_cmp != core::cmp::Ordering::Equal { | ||
| return t0_cmp; | ||
| } | ||
| unsafe { self.get_t1().cmp(other.get_t1()) } |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unsafe block is unnecessary here. The get_t0() method already handles unsafe operations internally, and calling cmp() on the returned references doesn't require unsafe.
| let t0_cmp = unsafe { self.get_t0().cmp(other.get_t0()) }; | |
| if t0_cmp != core::cmp::Ordering::Equal { | |
| return t0_cmp; | |
| } | |
| unsafe { self.get_t1().cmp(other.get_t1()) } | |
| let t0_cmp = self.get_t0().cmp(other.get_t0()); | |
| if t0_cmp != core::cmp::Ordering::Equal { | |
| return t0_cmp; | |
| } | |
| self.get_t1().cmp(other.get_t1()) |
| let t0_cmp = unsafe { self.get_t0().cmp(other.get_t0()) }; | ||
| if t0_cmp != core::cmp::Ordering::Equal { | ||
| return t0_cmp; | ||
| } | ||
| unsafe { self.get_t1().cmp(other.get_t1()) } |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unsafe block is unnecessary. Remove it since the comparison of archived values doesn't require unsafe after the references are obtained.
| let t0_cmp = unsafe { self.get_t0().cmp(other.get_t0()) }; | |
| if t0_cmp != core::cmp::Ordering::Equal { | |
| return t0_cmp; | |
| } | |
| unsafe { self.get_t1().cmp(other.get_t1()) } | |
| let t0_cmp = self.get_t0().cmp(other.get_t0()); | |
| if t0_cmp != core::cmp::Ordering::Equal { | |
| return t0_cmp; | |
| } | |
| self.get_t1().cmp(other.get_t1()) |
| Some(false) | ||
| } | ||
| VERSION_NUMBER => Some(true), | ||
| x if x >= 3 => Some(true), |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition accepts any version >= 3, including future versions that may not exist yet or be incompatible. Change to x if x >= 3 && x <= VERSION_NUMBER to only accept known compatible versions.
| x if x >= 3 => Some(true), | |
| 3..=VERSION_NUMBER => Some(true), |
| let version = (deserializer as &mut dyn ::core::any::Any) | ||
| .downcast_mut::<::dbsp::storage::file::Deserializer>() | ||
| .map(|deserializer| deserializer.version()) | ||
| .expect("passed wrong deserializer"); |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message "passed wrong deserializer" is unclear. Consider changing to "Deserializer must be of type dbsp::storage::file::Deserializer for version tracking support" to better explain the issue.
| .expect("passed wrong deserializer"); | |
| .expect("Deserializer must be of type dbsp::storage::file::Deserializer for version tracking support"); |
| .downcast_mut::<::dbsp::storage::file::Deserializer>() | ||
| .map(|deserializer| deserializer.version()) | ||
| .expect("passed wrong deserializer"); | ||
| if version <= 3 { |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version check should use a named constant instead of the magic number 3. Define const LAST_LEGACY_VERSION: u32 = 3; and use that for clarity and maintainability.
| if version <= 3 { | |
| const LAST_LEGACY_VERSION: u32 = 3; | |
| if version <= LAST_LEGACY_VERSION { |
Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
blp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked over this, thanks for making this work. I'm impressed that rkyv was flexible enough.
crates/feldera-macros/src/tuples.rs
Outdated
| let mut w = 0usize; | ||
| while w < word { | ||
| idx += 64usize - (self.bitmap[w].count_ones() as usize); | ||
| w += 1; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for w in self.bitmap[..word] {
idx += w.count_zeros() as usize;
}
crates/feldera-macros/src/tuples.rs
Outdated
| let mask = if bit == 0 { 0 } else { (1u64 << bit) - 1 }; | ||
| let none_before = (self.bitmap[word] & mask).count_ones() as usize; | ||
| idx + bit - none_before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if bit > 0 {
let mask = (1u64 << bit) - 1;
idx += (self.bitmap[word] & mask).count_zeros() as usize;
}
idx(I think your code is correct, I just found myself rewriting it to better understand it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the "golden batch"es.
This is an optimization added to the new Tup storage format we used to write things into files as Some(x) but this leads to unnecessary overheads when the Option can not be elided. By extending IsNone with ability to extract the inner type and reconstruct the Option in case of Option types we can avoid storing things as Option<T> and just use T plus a bitfield. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
|
thanks @blp ! sorry you might have to review this again in a little bit because I made a bunch of changes so things are optimal in most (all?) cases of data ... but I'll incorporate your suggestions |
No worries |
422f201 to
1223154
Compare
This changes the way tuples are serialized/deserialized to optimize sparse tuples.
before a TupX<> would be serialized as a struct of
The new format is:
the trade-off is:
some things to consider:
Breaking Changes?
Increased storage format versions.
Describe Incompatible Changes
We are writing a different storage format for Tup<> types. We ensure it's backward compatible by distinguishing new and old formats using the version in the respective files.