Prepare tokenizer for using borrowed strings instead of allocations. by eyalleshem · Pull Request #2073 · apache/datafusion-sqlparser-rs

eyalleshem · 2025-10-23T17:16:44Z

Key points for this commit:

The peekable trait isn't sufficient for using string slices, as we need the byte indexes (start/end) to create string slices, so added the current byte position to the State struct (Note: in the long term we could potentially remove peekable and use only the current position as an iterator)
Created internal functions that create slices from the original query instead of allocating strings, then converted these functions to return String to maintain compatibility (the idea is to make a small, reviewable commit without changing the Token struct or the parser)

eyalleshem · 2025-10-23T17:41:39Z

This commit is a first step toward addressing issue #2036.

It doesn't contain any behavioral changes yet, and the scope is intentionally limited because it touches some core elements of the tokenizer. I wanted to:

Keep the changes small and easy to review
Get early feedback before proceeding with the full implementation

I'd appreciate any thoughts on this approach before I continue!

iffyio

Thanks for starting to look into this @eyalleshem! Took a look now and I think the changes look reasonable to me overall, left some comments

iffyio · 2025-10-29T13:29:30Z

src/tokenizer.rs

+    /// return the character after the next character (lookahead by 2) without advancing the stream
+    pub fn peek_next(&self) -> Option<char> {


Suggested change

/// return the character after the next character (lookahead by 2) without advancing the stream

pub fn peek_next(&self) -> Option<char> {

/// Return the `nth` next character without advancing the stream.

pub fn peek_nth(&self, n: usize) -> Option<char> {

Thinking we can have this more generic so that it can be reused in other context? Similar to the peek_nth* parser methods

iffyio · 2025-10-29T13:30:38Z

src/tokenizer.rs


+    /// return the character after the next character (lookahead by 2) without advancing the stream
+    pub fn peek_next(&self) -> Option<char> {
+        // Use the source and byte_pos instead of cloning the peekable iterator


Can we add a guard here that the index is safe? So that we don't panic if we hit a bug or the API is being misused. e.g.

if self.byte_pos >= self.source.len() { return None }

iffyio · 2025-10-29T13:32:04Z

src/tokenizer.rs

        chars.next(); // consume the first char
-        let ch: String = ch.into_iter().collect();
-        let word = self.tokenize_word(ch, chars);
+                      // Calculate total byte length without allocating a String


The indentation looks a bit off on this line?

code comment deleted (as result of the next suggestion)

iffyio · 2025-10-29T13:43:50Z

src/tokenizer.rs

-        let ch: String = ch.into_iter().collect();
-        let word = self.tokenize_word(ch, chars);
+                      // Calculate total byte length without allocating a String
+        let consumed_byte_len: usize = ch.into_iter().map(|c| c.len_utf8()).sum();


I wonder if we can instead replace the ch parameter to this tokenize_identifier_or_keyword function with the actual consumed_byte_len: usize?

If I'm understanding the intent/requirement of this change correctly the caller of this function should have that value on hand given that we seem to require that ch contains the preceeding characters in the stream?

The current flow was initially unclear to me that the caller passes in an iterator of items, whose contents we did not use here, and it required digging further into tokenize_word to realised why that was the case.

agree , move it out .
The downside is that callers now need to calculate the UTF-8 byte length of their characters.

iffyio · 2025-10-29T13:58:46Z

src/tokenizer.rs

+    /// `consumed_byte_len` is the byte length of the consumed character(s).
+    fn tokenize_word(&self, consumed_byte_len: usize, chars: &mut State<'a>) -> String {
+        // Calculate where the first character started
+        let first_char_byte_pos = chars.byte_pos - consumed_byte_len;


Can we add a check that the operation doesn't overflow? e.g.

if consumed_byte_len >= chars.byte_pos { return "".to_string() }

iffyio · 2025-10-29T14:34:19Z

src/tokenizer.rs

+/// Borrow a slice from the original string until `predicate` returns `false` or EOF is hit.
+///
+/// # Arguments
+/// * `chars` - The character iterator state (contains reference to original source)
+/// * `predicate` - Function that returns true while we should continue taking characters
+///
+/// # Returns
+/// A borrowed slice of the source string containing the matched characters


Doc wise I think its easier to only reference the peeking_take_while method instead, and we mention only the difference to this function. That way we only describe the functionality once. Also, the project doesn't use the # Arguments, # Returns etc documentation format so that we can skip that for consistency I think.

iffyio · 2025-10-29T14:36:06Z

src/tokenizer.rs

+    chars: &mut State<'a>,
+    mut predicate: impl FnMut(char) -> bool,
+) -> &'a str {
+    // Record the starting byte position


We can sanity check the index before using it here as well?

s a sanity check needed here? The start_pos and end_pos is taken from the iterator, and the iterator is incremented according to the characters in the buffer

and the iterator is incremented according to the characters in the buffer

Yeah I think this is the part that we don't necessarily have a guarantee on, hence the sanity check part if we have a bug somewhere or make wrong assumption etc

Ok, added the check here as well. I think this case is slightly different since the position is derived directly from the char iterator, but I don't object to adding the sanity check for extra safety."

iffyio · 2025-10-29T14:36:56Z

src/tokenizer.rs

    pub line: u64,
    pub col: u64,
+    /// Byte position in the source string
+    pub byte_pos: usize,


Suggested change

pub byte_pos: usize,

byte_pos: usize,

iffyio · 2025-10-29T14:40:54Z

src/tokenizer.rs

+///
+/// # Returns
+/// A borrowed slice of the source string containing the matched characters
+fn borrow_slice_until_next<'a>(


wondering, given this function is quite similar, would it make sense to implement borrowed_slice_until as following instead? To reuse the same definition/impl

fn borrow_slice_until(chars) { borrow_slice_until_next(chars, |ch, _|{ predicate(ch) }) }

Maybe, but I'm not sure what happens if EOF is reached on the next character. I don't think I want to include that as part of this commit.

iffyio · 2025-10-29T14:41:22Z

src/tokenizer.rs

-/// Same as peeking_take_while, but also passes the next character to the predicate.
-fn peeking_next_take_while(
-    chars: &mut State,
+/// Borrow a slice from the original string until `predicate` returns `false` or EOF is hit.


just flagging similar comments for borrow_while_until apply to this function

I think its ok now, let me know if not

Key points for this commit: - The peekable trait isn't sufficient for using string slices, as we need the byte indexes (start/end) to create string slices, so added the current byte position to the State struct (Note: in the long term we could potentially remove peekable and use only the current position as an iterator) - Created internal functions that create slices from the original query instead of allocating strings, then converted these functions to return String to maintain compatibility (the idea is to make a small, reviewable commit without changing the Token struct or the parser)

eyalleshem · 2025-11-20T19:04:08Z

@iffyio - I think all comments have been addressed. Let me know if there's anything else.
Another thing - could you open a feature branch for this task? Then I can change the PR destination to that branch. (I don't think I have permission to create the branch myself.)

iffyio · 2025-11-22T06:40:30Z

Thanks @eyalleshem! I've created a new branch here that we can use

eyalleshem · 2025-11-22T18:48:05Z

Thanks @iffyio! I've changed the PR to target this branch. Do we want to keep the branch protected to enforce reviews?

iffyio

LGTM! Thanks @eyalleshem!

iffyio · 2025-11-25T10:27:03Z

Do we want to keep the branch protected to enforce reviews?

I unfortunately don't have permissions to do this, but I think we can manually enforce it should be fine

eyalleshem force-pushed the reduce-string-copies branch 3 times, most recently from ad1f74a to 82c6657 Compare October 23, 2025 17:31

This was referenced Oct 26, 2025

Improve performance by reducing string copies #2036

Open

Reduce string copies cow #2075

Merged

iffyio reviewed Oct 29, 2025

View reviewed changes

eyalleshem force-pushed the reduce-string-copies branch 3 times, most recently from 65de6dc to 0ad848c Compare November 6, 2025 18:45

eyalleshem force-pushed the reduce-string-copies branch from 0ad848c to f01cee7 Compare November 18, 2025 13:29

eyalleshem force-pushed the reduce-string-copies branch from f01cee7 to 2c98581 Compare November 20, 2025 18:52

eyalleshem changed the base branch from main to reduce-string-copying November 22, 2025 18:43

iffyio approved these changes Nov 25, 2025

View reviewed changes

iffyio merged commit c8acf9f into apache:reduce-string-copying Nov 25, 2025
10 checks passed

		/// return the character after the next character (lookahead by 2) without advancing the stream
		pub fn peek_next(&self) -> Option<char> {

Comments

Conversation

eyalleshem commented Oct 23, 2025

Uh oh!

eyalleshem commented Oct 23, 2025

Uh oh!

iffyio left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eyalleshem commented Nov 20, 2025

Uh oh!

iffyio commented Nov 22, 2025

Uh oh!

eyalleshem commented Nov 22, 2025

Uh oh!

iffyio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iffyio commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants