forked from NVIDIA/Megatron-LM
-
Notifications
You must be signed in to change notification settings - Fork 369
Open
Description
For OptimizerParamScheduler step, it takes two ways to update self.num_tokens and self.num_steps:
if token_num is None:
args = get_args()
token_num = args.consumed_train_tokens
self.num_tokens = token_num
self.num_steps += increment
and at get_lr function
it calculates lr by self.num_steps or self.num_tokens by judging whether self.lr_warmup_tokens is None
# Use linear warmup for the initial part.
if self.lr_warmup_tokens is None:
if self.lr_warmup_steps > 0 and self.num_steps <= self.lr_warmup_steps:
if self.num_steps == self.lr_warmup_steps and \
self.lr_decay_tokens is not None:
# The case of step/sample-wise warmup + token-wise decay
self.lr_warmup_tokens = self.num_tokens
return self.max_lr * float(self.num_steps) / \
float(self.lr_warmup_steps)
else:
if self.lr_warmup_tokens > 0 and self.num_tokens <= self.lr_warmup_tokens:
return self.max_lr * float(self.num_tokens) / \
float(self.lr_warmup_tokens)
However, args.consumed_train_tokenswill be updated after train step, hence when OptimizerParamScheduler.step() is excuted, the args.consumed_train_tokens has not been updated, then self.num_tokens will not be increased, which cause the second optimizer step will use the old lr. For example, when we do training at begining, the lr used by optimizer will be zero for both first and second steps when we specify self.lr_warmup_tokens
Is this design based on some special considerations or a possible bug?
Thanks
xrrain
Metadata
Metadata
Assignees
Labels
No labels