Skip to content

feat(bigframes): Support loading avro, orc data#16555

Draft
TrevorBergeron wants to merge 1 commit intomainfrom
tbergeron_bf_read_orc_avro
Draft

feat(bigframes): Support loading avro, orc data#16555
TrevorBergeron wants to merge 1 commit intomainfrom
tbergeron_bf_read_orc_avro

Conversation

@TrevorBergeron
Copy link
Copy Markdown
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for reading ORC and Avro files into BigQuery DataFrames by implementing read_orc and read_avro methods in the Session class and providing corresponding API wrappers. Review feedback identifies a bug in the system tests where to_orc is called on a BigFrames DataFrame instead of a pandas DataFrame. Additionally, several improvements are suggested to maintain alphabetical order in imports and function definitions, along with a minor wording update for an error message to improve clarity.

)
df_write = df_in.reset_index(drop=False)
df_write.index.name = f"ordering_id_{random.randrange(1_000_000)}"
df_write.to_orc(write_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

It appears bigframes.dataframe.DataFrame does not currently implement a to_orc method. Since BigQuery also does not support exporting to ORC, you likely need to convert to a pandas DataFrame first to write the test data to GCS.

Suggested change
df_write.to_orc(write_path)
df_write.to_pandas().to_orc(write_path)

Comment on lines +464 to +465
read_orc,
read_avro,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The functions should be listed in alphabetical order to maintain consistency with the rest of the file. While the preceding lines are context, swapping these two will at least ensure they are sorted relative to each other.

Suggested change
read_orc,
read_avro,
read_avro,
read_orc,

Comment on lines +501 to +502
"read_orc",
"read_avro",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The entries in __all__ should be in alphabetical order. Swapping these two improves consistency.

Suggested change
"read_orc",
"read_avro",
"read_avro",
"read_orc",

Comment on lines +592 to +621
def read_orc(
path: str | IO["bytes"],
*,
engine: str = "auto",
write_engine: constants.WriteEngineType = "default",
) -> bigframes.dataframe.DataFrame:
return global_session.with_default_session(
bigframes.session.Session.read_orc,
path,
engine=engine,
write_engine=write_engine,
)


read_orc.__doc__ = inspect.getdoc(bigframes.session.Session.read_orc)


def read_avro(
path: str | IO["bytes"],
*,
engine: str = "auto",
) -> bigframes.dataframe.DataFrame:
return global_session.with_default_session(
bigframes.session.Session.read_avro,
path,
engine=engine,
)


read_avro.__doc__ = inspect.getdoc(bigframes.session.Session.read_avro)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Reorder read_avro and read_orc to maintain alphabetical order. Additionally, these should ideally be placed before read_parquet in the file to follow the alphabetical convention used in this module.

def read_avro(
    path: str | IO["bytes"],
    *,
    engine: str = "auto",
) -> bigframes.dataframe.DataFrame:
    return global_session.with_default_session(
        bigframes.session.Session.read_avro,
        path,
        engine=engine,
    )


read_avro.__doc__ = inspect.getdoc(bigframes.session.Session.read_avro)


def read_orc(
    path: str | IO["bytes"],
    *,
    engine: str = "auto",
    write_engine: constants.WriteEngineType = "default",
) -> bigframes.dataframe.DataFrame:
    return global_session.with_default_session(
        bigframes.session.Session.read_orc,
        path,
        engine=engine,
        write_engine=write_engine,
    )


read_orc.__doc__ = inspect.getdoc(bigframes.session.Session.read_orc)

Comment on lines +1399 to +1400
"please use the 'bigquery' engine by setting `engine='bigquery'` in "
"your configuration."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The phrase 'in your configuration' might be confusing as the engine is typically passed as an argument to the function call. Consider clarifying this to 'in the function call'.

Suggested change
"please use the 'bigquery' engine by setting `engine='bigquery'` in "
"your configuration."
"please use the 'bigquery' engine by setting engine='bigquery' in "
"the function call."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant