请选择 进入手机版 | 继续访问电脑版
MSIPO技术圈 首页 IT技术 查看内容

tensorflow 中的 gradient 与 optimizer

2023-07-13

一. gradient

根据链式法则自动微分机制去计算梯度. 所谓自动, 是矩阵运算op种类固定且有限, 所以可以做到对每个op都维护一个求导方法, 直接硬编码到源码中.

二. optimizer

优化器作梯度计算及参数更新.

2.1 优化器基类

  • class tensorflow.python.training.optimizer.Optimizer
    优化方法的基类.
  • Optimizer.minimize(self, loss, global_step=None, var_list=None, …)
    返回一个 train_op, 运用优化方法求损失函数的极小值. 它其实是以下两个 api(梯度计算与参数更新) 的封装. 当我们想在二者之间做一些自定义操作时, 就可以显式地分开调用. 常用场景之一就是梯度截断, 见参考[3].
  • Optimizer._slots, Dict[ slot_name , Dict [(graph, primary_var), slot_var]]

2.2 简明源码

@tf_export("train.Optimizer")
class Optimizer(checkpointable.CheckpointableBase):
    def __init__(self, use_locking, name):
        self._name = name
        self._slots = {}

    def minimize(self, loss, global_step=None, var_list=None,
                 gate_gradients=GATE_OP, aggregation_method=None,
                 colocate_gradients_with_ops=False, name=None,
                 grad_loss=None):
        """Add operations to minimize `loss` by updating `var_list`.
        This method simply combines calls `compute_gradients()` and
        `apply_gradients()`. If you want to process the gradient before applying
        them, call `compute_gradients()` and `apply_gradients()` explicitly instead
        of using this function.
        """
        grads_and_vars = self.compute_gradients(
            loss, var_list=var_list, gate_gradients=gate_gradients,
            aggregation_method=aggregation_method,
            colocate_gradients_with_ops=colocate_gradients_with_ops,
            grad_loss=grad_loss)

        return self.apply_gradients(grads_and_vars, global_step=global_step,
                                    name=name)

    def compute_gradients(self, loss, var_list=None,
                          gate_gradients=GATE_OP,
                          aggregation_method=None,
                          colocate_gradients_with_ops=False,
                          grad_loss=None):
        return grads_and_vars

    def apply_gradients(self, grads_and_vars, global_step=None, name=None):
        with ops.init_scope():
            self._create_slots(var_list)
        update_ops = []
        with ops.name_scope(name, self._name) as name:
            self._prepare()
            for grad, var, processor in converted_grads_and_vars:
                with ops.name_scope("update_" + scope_name), ops.colocate_with(var):
                    update_ops.append(processor.update_op(self, grad))
            with ops.control_dependencies([self._finish(update_ops, "update")]):
                with ops.colocate_with(global_step):
                    apply_updates = state_ops.assign_add(
                        global_step, 1, name=name)
        return apply_updates

    def _create_slots(self, var_list):
        pass

    def _get_or_make_slot_with_initializer(self, var, initializer, shape, dtype,
                                           slot_name, op_name):
        new_slot_variable = slot_creator.create_slot_with_initializer(var, initializer, shape, dtype, op_name)
        self._slot_dict(slot_name)[_var_key(var)] = new_slot_variable
        return new_slot_variable


# slot_creator.py
def create_slot_with_initializer(primary, initializer, shape, dtype, name,
                                 colocate_with_primary=True):
    prefix = primary.op.name
    with variable_scope.variable_scope(None, prefix + "/" + name):
        with distribution_strategy.colocate_vars_with(primary):
            return _create_slot_var(primary, initializer, "", validate_shape, shape,
                                dtype)

def _create_slot_var(primary, val, scope, validate_shape, shape, dtype):
    current_partitioner = variable_scope.get_variable_scope().partitioner
    slot = variable_scope.get_variable(
        scope, initializer=val, trainable=False,
        use_resource=resource_variable_ops.is_resource_variable(primary),
        shape=shape, dtype=dtype,
        validate_shape=validate_shape)
    variable_scope.get_variable_scope().set_partitioner(current_partitioner)
    return slot

2.3 slot

TensorFlow 中的 optimizer slot 是在优化器中用于存储和更新变量的辅助变量。每个变量都有一个或多个 slot,用于存储该变量在优化过程中的状态。例如,Adam 优化器使用了两个 slot,分别存储了变量的一阶和二阶动量估计。在每次优化时,这些 slot 将被更新并用于计算变量的梯度。通常,slot 也可以用于实现正则化、momentum 和 batch normalization 等优化算法中的额外功能。

case 体验

对应下文中 AdagradDecayOptimizer 的 _create_slots() 方法, 当 primary_var 是
scope_emb/input_from_feature_columns/word_embedding/weights 时, 相应的 _slot 内容为:

{'accumulator': {
    (<tensorflow.python.framework.ops.Graph object at 0x000002871512F390>, 'scope_emb/input_from_feature_columns/word_embedding/weights'): 
    <tf.Variable 'OptimizeLoss/scope_emb/input_from_feature_columns/word_embedding/weights/AdagradDecay:0' shape=(1000, 10) dtype=float32_ref>
    }, 
 'accumulator_decay_power': {
     (<tensorflow.python.framework.ops.Graph object at 0x000002871512F390>, 'scope_emb/input_from_feature_columns/word_embedding/weights'): 
     <tf.Variable 'OptimizeLoss/scope_emb/input_from_feature_columns/word_embedding/weights/AdagradDecay_1:0' shape=(1000, 10) dtype=int64_ref>
     }
}

二. 常用子类

2.1 GradientDescentOptimizer

  • class GradientDescentOptimizer(optimizer.Optimizer)
    类. 梯度下降法的实现.
  • __init__(self, learning_rate)
    构造函数中指定学习速率.

2.2 AdagradOptimizer

@tf_export("train.AdagradOptimizer")
class AdagradOptimizer(optimizer.Optimizer):
    def _create_slots(self, var_list):
        for v in var_list:
            dtype = v.dtype.base_dtype
            if v.get_shape().is_fully_defined():
                init = init_ops.constant_initializer(self._initial_accumulator_value,
                                                     dtype=dtype)
            else:
                init = self._init_constant_op(v, dtype)
            # Optimizer 父类的方法
            self._get_or_make_slot_with_initializer(v, init, v.get_shape(), dtype,
                                                    "accumulator", self._name)

2.3 AdagradDecayOptimizer

class AdagradDecayOptimizer(optimizer.Optimizer):
    def _create_slots(self, var_list):
        for v in var_list:
        	with ops.colocate_with(v)
        		self._get_or_make_slot_with_initializer(v, init, v_shape, dtype,
                                                    "accumulator", self._name)
               self._get_or_make_slot_with_initializer(
                v, init_ops.zeros_initializer(self._global_step.dtype),
                v_shape, self._global_step.dtype, "accumulator_decay_power", self._name)
        		

2.4 AdamOptimizer

  • AdamOptimizer(optimizer.Optimizer)
    类. 实现了Adam算法的优化器, 它是一种随机梯度下降法.

三. 多优化器并存

搭建计算图就像搭积木一样, 可以划分为多个模块, 自然也可以给各模块应用不同的优化器.
上文的 minimize 方法中有 var_list 参数, 就可以让不同的 optimizer 优化不同的模块. 那怎么给不同模块的参数作划分呢?

collection

tf.add_to_collection(name, value)
提供一个全局的存储机制,不会受到变量命名空间的影响。一处保存,到处可取。
tf.get_collection(key, scope=None)
对应add_to_collection操作, 这里把存进去的内容读出来.

相关阅读

热门文章

    手机版|MSIPO技术圈 皖ICP备19022944号-2

    Copyright © 2024, msipo.com

    返回顶部