feat(bigframes): Support loading avro, orc data#16555
feat(bigframes): Support loading avro, orc data#16555TrevorBergeron wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for reading ORC and Avro files into BigQuery DataFrames by implementing read_orc and read_avro methods in the Session class and providing corresponding API wrappers. Review feedback identifies a bug in the system tests where to_orc is called on a BigFrames DataFrame instead of a pandas DataFrame. Additionally, several improvements are suggested to maintain alphabetical order in imports and function definitions, along with a minor wording update for an error message to improve clarity.
| ) | ||
| df_write = df_in.reset_index(drop=False) | ||
| df_write.index.name = f"ordering_id_{random.randrange(1_000_000)}" | ||
| df_write.to_orc(write_path) |
There was a problem hiding this comment.
It appears bigframes.dataframe.DataFrame does not currently implement a to_orc method. Since BigQuery also does not support exporting to ORC, you likely need to convert to a pandas DataFrame first to write the test data to GCS.
| df_write.to_orc(write_path) | |
| df_write.to_pandas().to_orc(write_path) |
| read_orc, | ||
| read_avro, |
There was a problem hiding this comment.
| "read_orc", | ||
| "read_avro", |
| def read_orc( | ||
| path: str | IO["bytes"], | ||
| *, | ||
| engine: str = "auto", | ||
| write_engine: constants.WriteEngineType = "default", | ||
| ) -> bigframes.dataframe.DataFrame: | ||
| return global_session.with_default_session( | ||
| bigframes.session.Session.read_orc, | ||
| path, | ||
| engine=engine, | ||
| write_engine=write_engine, | ||
| ) | ||
|
|
||
|
|
||
| read_orc.__doc__ = inspect.getdoc(bigframes.session.Session.read_orc) | ||
|
|
||
|
|
||
| def read_avro( | ||
| path: str | IO["bytes"], | ||
| *, | ||
| engine: str = "auto", | ||
| ) -> bigframes.dataframe.DataFrame: | ||
| return global_session.with_default_session( | ||
| bigframes.session.Session.read_avro, | ||
| path, | ||
| engine=engine, | ||
| ) | ||
|
|
||
|
|
||
| read_avro.__doc__ = inspect.getdoc(bigframes.session.Session.read_avro) |
There was a problem hiding this comment.
Reorder read_avro and read_orc to maintain alphabetical order. Additionally, these should ideally be placed before read_parquet in the file to follow the alphabetical convention used in this module.
def read_avro(
path: str | IO["bytes"],
*,
engine: str = "auto",
) -> bigframes.dataframe.DataFrame:
return global_session.with_default_session(
bigframes.session.Session.read_avro,
path,
engine=engine,
)
read_avro.__doc__ = inspect.getdoc(bigframes.session.Session.read_avro)
def read_orc(
path: str | IO["bytes"],
*,
engine: str = "auto",
write_engine: constants.WriteEngineType = "default",
) -> bigframes.dataframe.DataFrame:
return global_session.with_default_session(
bigframes.session.Session.read_orc,
path,
engine=engine,
write_engine=write_engine,
)
read_orc.__doc__ = inspect.getdoc(bigframes.session.Session.read_orc)| "please use the 'bigquery' engine by setting `engine='bigquery'` in " | ||
| "your configuration." |
There was a problem hiding this comment.
The phrase 'in your configuration' might be confusing as the engine is typically passed as an argument to the function call. Consider clarifying this to 'in the function call'.
| "please use the 'bigquery' engine by setting `engine='bigquery'` in " | |
| "your configuration." | |
| "please use the 'bigquery' engine by setting engine='bigquery' in " | |
| "the function call." |
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕