-
Notifications
You must be signed in to change notification settings - Fork 47
Add Data-Juicer Service #181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Data-Juicer Service #181
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @pan-x-c, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces the initial integration of a Data-Juicer service into the system. It primarily focuses on adding the necessary components to interact with this service, including a new operator to process data and a client to manage communication. The goal is to enable external data processing capabilities, with the server-side implementation noted as a work in progress.
Highlights
- Data-Juicer Operator Integration: A new DataJuicerOperator is introduced, allowing the system to leverage the Data-Juicer service for processing Experience objects. This operator can be configured using various methods, including direct operator lists, a configuration file, or a natural language description.
- Dedicated Data-Juicer Client: A DataJuicerClient has been added to handle the communication with the Data-Juicer server. This client manages session initialization, sends and receives data (converted to/from HuggingFace Datasets via Parquet files), and ensures proper session closure.
- Experience Data Conversion Utilities: New utility functions (to_hf_datasets, from_hf_datasets) are now available to seamlessly convert Experience objects to and from HuggingFace Dataset format, which is crucial for efficient data transfer to the Data-Juicer service.
- Enhanced Configuration Management: New configuration classes, DataJuicerConfig and ServiceConfig, have been added to trinity/common/config.py. These allow for detailed configuration of the Data-Juicer service, including server URL, OpenAI API settings for its agent, and options for automatic server startup.
- Improved Resource Cleanup: The base ExperienceOperator now includes a close method, and the ExperiencePipeline has been updated to iterate through and call this close method on all its operators. This ensures that resources held by operators, such as the DataJuicerClient connection, are properly released.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a Data-Juicer service, including a client and an operator to integrate it into the experience processing pipeline. The changes are generally well-structured. However, I've identified a critical issue in the DataJuicerOperator where the process method's signature and return type do not match the base class, which would cause a runtime error. Additionally, I've provided suggestions to improve the DataJuicerClient for better efficiency and robustness, such as using in-memory buffers and adding error handling. Please address these points to ensure the new service is reliable and performs well.
|
/unittest-all |
Summary
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-all |
Summary
Failed Tests
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a comprehensive Data-Juicer service integration to Trinity, enabling data processing capabilities through a client-server architecture. The implementation includes both standalone server functionality and automated server management.
Key changes:
- Implements Data-Juicer client with auto-start capabilities and experience processing
- Adds Data-Juicer server with session management and HTTP API endpoints
- Integrates Data-Juicer operator into the experience pipeline system
Reviewed Changes
Copilot reviewed 18 out of 20 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| trinity/utils/distributed.py | Adds port utility function for server auto-start |
| trinity/service/data_juicer/server/utils.py | Implements configuration parsing and validation |
| trinity/service/data_juicer/server/session.py | Manages Data-Juicer processing sessions |
| trinity/service/data_juicer/server/server.py | Implements Flask HTTP server with API endpoints |
| trinity/service/data_juicer/client.py | Provides client interface with auto-start capability |
| trinity/service/init.py | Exports Data-Juicer client |
| trinity/explorer/explorer.py | Updates method call to align with pipeline changes |
| trinity/common/models/vllm_model.py | Adds configuration options for response generation |
| trinity/common/experience.py | Enhances Experience class with dataset conversion support |
| trinity/common/config.py | Adds Data-Juicer service configuration |
| trinity/buffer/pipelines/experience_pipeline.py | Renames method and adds cleanup functionality |
| trinity/buffer/operators/experience_operator.py | Adds close method to operator base class |
| trinity/buffer/operators/data_juicer_operator.py | Implements Data-Juicer operator for pipeline integration |
| trinity/buffer/operators/init.py | Exports new Data-Juicer operator |
| tests/service/data_juicer_test.py | Comprehensive test suite for Data-Juicer functionality |
| tests/explorer/explorer_test.py | Updates tests for new configuration options |
| tests/common/experience_test.py | Tests dataset conversion functionality |
| tests/buffer/experience_pipeline_test.py | Updates test for renamed pipeline method |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Summary
Tests
Github Test Reporter by CTRF 💚 |
Description
Add Data-Juicer Service
Checklist
Please check the following items before code is ready to be reviewed.