-
Notifications
You must be signed in to change notification settings - Fork 47
Normalize Trainer #271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize Trainer #271
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @pan-x-c, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request refactors the core Trainer class to centralize control over key training lifecycle events. It introduces asynchronous operations for train_step and sync_weight to enhance performance and responsiveness, while also standardizing metric collection. Furthermore, checkpoint saving logic is consolidated within the Trainer, removing previous redundancies and preparing for more robust and configurable checkpointing.
Highlights
- Centralized Checkpoint Management: The
save_checkpointoperation has been extracted from the trainer engine and is now managed directly by theTrainerclass, allowing for more unified control over checkpointing frequency and logic. - Asynchronous Training Operations: The
train_step,sync_weight, andneed_syncmethods within theTrainerhave been converted to asynchronous functions, enabling non-blocking execution and improved concurrency. - Enhanced Metric Recording: Metric collection has been optimized across the training process, with
Timerusage updated to use more descriptive keys (e.g., "time/read_experience") andtrain_stepandsync_weightnow returning metrics for consolidated logging. - Streamlined verl_trainer Checkpointing: The
verl_trainer's internal checkpointing logic has been simplified by removing redundant save calls fromtrain_stepand eliminating the direct saving of dataloader state, delegating these responsibilities to the mainTrainerclass.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
|
/unittest-module-synchronizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively refactors the Trainer class by making train_step and sync_weight asynchronous, centralizing checkpoint saving logic, and improving metric collection. The main training loop in train is now much cleaner and easier to follow. The changes to use await instead of ray.get are correctly implemented. I've found one issue with a redundant call which I've commented on. Overall, this is a great improvement to the codebase.
Summary
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-module-manager |
Summary
Failed Tests
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-module-manager |
Summary
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-module-trainer |
Summary
Failed Tests
Skipped
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR normalizes the Trainer implementation by extracting checkpoint saving operations from the trainer engine, making certain operations async, and optimizing metrics recording.
- Extract
save_checkpointoperation from trainer engine to the mainTrainerclass - Convert
train_stepandsync_weightmethods to async and update timing metrics - Add save interval control and optimize metrics collection
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| trinity/trainer/verl_trainer.py | Remove checkpoint saving from train_step, add save_as_hf parameter |
| trinity/trainer/verl/megatron_workers.py | Add save_as_hf parameter to checkpoint save methods |
| trinity/trainer/verl/fsdp_workers.py | Add save_as_hf parameter to checkpoint save methods |
| trinity/trainer/trainer.py | Convert methods to async, extract checkpoint saving, add timing metrics |
| trinity/algorithm/sample_strategy/sample_strategy.py | Update timer metric names for consistency |
| tests/trainer/trainer_test.py | Add trainer cache validation in test |
| tests/manager/synchronizer_test.py | Update train_step method signature in test mock |
Comments suppressed due to low confidence (1)
trinity/trainer/verl_trainer.py:378
- The removed import statement
from verl.utils.fs import local_mkdir_safeis still needed for the dataloader saving code that was removed. However, since the dataloader saving code was also removed, this change is consistent.
def _save_checkpoint(self, save_as_hf: bool = False):
# path: given_path + `/global_step_{global_steps}` + `/actor`
local_global_step_folder = os.path.join(
self.config.trainer.default_local_dir, f"global_step_{self.global_steps}"
)
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Summary
Skipped
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-module-trainer |
Summary
Failed Tests
Skipped
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-module-trainer |
Summary
Skipped
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-all |
|
/unittest-module-manager |
Summary
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-module-trainer |
Summary
Skipped
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-module-common |
Summary
Tests
Github Test Reporter by CTRF 💚 |
Description
save_checkpointoperation from the trainer engine to theTrainertrain_stepandsync_weightuse asyncTrainerChecklist
Please check the following items before code is ready to be reviewed.