Skip to content

Commit 52cfa4e

Browse files
ChenhanYukevalmorabia97coderabbitai[bot]AAnoosheh
authored
fix: #981 (#983)
### What does this PR do? Type of change: Bug fix <!-- Details about the change. --> An issue is reported in #981 where `str(v)` on some `TransformerConfig` fields will raise `TypeError`. We remove the yaml saving logic entirely as it's unused and can cause future errors still. ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing <!-- Mention how have you tested your change if applicable. --> ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, using `torch.load(..., weights_only=True)`, avoiding `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other source, did you follow IP policy in [CONTRIBUTING.md](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md#-copying-code-from-other-sources)?: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved checkpoint loading stability by handling unusual configuration values more gracefully; such values no longer cause failures and are skipped with a warning instead. * Reduced risk of crashes during configuration processing when encountering non-standard or unsupported objects. * **Chores** * Checkpoints no longer include saved run configuration or tool-version metadata, yielding smaller, simpler checkpoint files. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Asha Anoosheh <aanoosheh@nvidia.com>
1 parent 7e2e85a commit 52cfa4e

1 file changed

Lines changed: 0 additions & 39 deletions

File tree

modelopt/torch/opt/plugins/mcore_dist_checkpointing.py

Lines changed: 0 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@
2222
from typing import Any
2323

2424
import torch
25-
import yaml
2625
from megatron.core import dist_checkpointing, mpu
2726
from megatron.core.dist_checkpointing.serialization import get_default_load_sharded_strategy
2827
from megatron.core.dist_checkpointing.strategies.common import COMMON_STATE_FNAME
@@ -36,21 +35,6 @@
3635

3736
SUPPORTED_WRAPPERS[Float16Module] = "module"
3837

39-
DROP_SUBSTRINGS = [
40-
"fp4",
41-
"fp8",
42-
"tp_",
43-
"parallel",
44-
"cuda_graph",
45-
"init_",
46-
"cpu",
47-
"recompute",
48-
"inference",
49-
"pipeline",
50-
"comm",
51-
"batch",
52-
]
53-
5438

5539
def remove_per_module_state(
5640
modelopt_state: dict[str, Any],
@@ -138,29 +122,6 @@ def save_sharded_modelopt_state(
138122
sharded_strategy: configures sharded tensors saving behavior and backend
139123
prefix: the prefix to add to the modelopt_state keys ("model." for NeMo)
140124
"""
141-
142-
def _parse_transformer_config(transformer_config: dict) -> dict:
143-
config = {}
144-
145-
for k, v in transformer_config.items():
146-
if any(substring in k for substring in DROP_SUBSTRINGS):
147-
continue
148-
if isinstance(v, (bool, int, str)):
149-
config[k] = v
150-
else:
151-
config[k] = str(v)
152-
153-
return config
154-
155-
# Save own version of run config, if not already saved by the framework.
156-
if dist.is_master() and not os.path.exists(f"{checkpoint_name}/run_config.yaml"):
157-
run_config_name = f"{checkpoint_name}/modelopt_run_config.yaml"
158-
# We avoid deepcopy since some attributes in Megatron-Bridge config cannot be deepcopied.
159-
config_dict = _parse_transformer_config(model[0].config.__dict__)
160-
config_dict["nvidia_modelopt_version"] = modelopt.__version__
161-
with open(run_config_name, "w") as f:
162-
yaml.dump(config_dict, f, default_flow_style=False)
163-
164125
if not mto.ModeloptStateManager.is_converted(model[0]):
165126
return
166127
if len(model) > 1:

0 commit comments

Comments
 (0)