一. gradient
根据链式法则自动微分机制去计算梯度. 所谓自动, 是矩阵运算op种类固定且有限, 所以可以做到对每个op都维护一个求导方法, 直接硬编码到源码中.
二. optimizer
优化器作梯度计算及参数更新.
2.1 优化器基类
class
tensorflow.python.training.optimizer.Optimizer
优化方法的基类.- Optimizer.
minimize
(self, loss, global_step=None, var_list=None, …)
返回一个 train_op, 运用优化方法求损失函数的极小值. 它其实是以下两个 api(梯度计算与参数更新) 的封装. 当我们想在二者之间做一些自定义操作时, 就可以显式地分开调用. 常用场景之一就是梯度截断, 见参考[3]. - Optimizer.
_slots
, Dict[ slot_name , Dict [(graph, primary_var), slot_var]]
2.2 简明源码
@tf_export("train.Optimizer")
class Optimizer(checkpointable.CheckpointableBase):
def __init__(self, use_locking, name):
self._name = name
self._slots = {}
def minimize(self, loss, global_step=None, var_list=None,
gate_gradients=GATE_OP, aggregation_method=None,
colocate_gradients_with_ops=False, name=None,
grad_loss=None):
"""Add operations to minimize `loss` by updating `var_list`.
This method simply combines calls `compute_gradients()` and
`apply_gradients()`. If you want to process the gradient before applying
them, call `compute_gradients()` and `apply_gradients()` explicitly instead
of using this function.
"""
grads_and_vars = self.compute_gradients(
loss, var_list=var_list, gate_gradients=gate_gradients,
aggregation_method=aggregation_method,
colocate_gradients_with_ops=colocate_gradients_with_ops,
grad_loss=grad_loss)
return self.apply_gradients(grads_and_vars, global_step=global_step,
name=name)
def compute_gradients(self, loss, var_list=None,
gate_gradients=GATE_OP,
aggregation_method=None,
colocate_gradients_with_ops=False,
grad_loss=None):
return grads_and_vars
def apply_gradients(self, grads_and_vars, global_step=None, name=None):
with ops.init_scope():
self._create_slots(var_list)
update_ops = []
with ops.name_scope(name, self._name) as name:
self._prepare()
for grad, var, processor in converted_grads_and_vars:
with ops.name_scope("update_" + scope_name), ops.colocate_with(var):
update_ops.append(processor.update_op(self, grad))
with ops.control_dependencies([self._finish(update_ops, "update")]):
with ops.colocate_with(global_step):
apply_updates = state_ops.assign_add(
global_step, 1, name=name)
return apply_updates
def _create_slots(self, var_list):
pass
def _get_or_make_slot_with_initializer(self, var, initializer, shape, dtype,
slot_name, op_name):
new_slot_variable = slot_creator.create_slot_with_initializer(var, initializer, shape, dtype, op_name)
self._slot_dict(slot_name)[_var_key(var)] = new_slot_variable
return new_slot_variable
def create_slot_with_initializer(primary, initializer, shape, dtype, name,
colocate_with_primary=True):
prefix = primary.op.name
with variable_scope.variable_scope(None, prefix + "/" + name):
with distribution_strategy.colocate_vars_with(primary):
return _create_slot_var(primary, initializer, "", validate_shape, shape,
dtype)
def _create_slot_var(primary, val, scope, validate_shape, shape, dtype):
current_partitioner = variable_scope.get_variable_scope().partitioner
slot = variable_scope.get_variable(
scope, initializer=val, trainable=False,
use_resource=resource_variable_ops.is_resource_variable(primary),
shape=shape, dtype=dtype,
validate_shape=validate_shape)
variable_scope.get_variable_scope().set_partitioner(current_partitioner)
return slot
2.3 slot
TensorFlow 中的 optimizer slot 是在优化器中用于存储和更新变量的辅助变量。每个变量都有一个或多个 slot,用于存储该变量在优化过程中的状态。例如,Adam 优化器使用了两个 slot,分别存储了变量的一阶和二阶动量估计。在每次优化时,这些 slot 将被更新并用于计算变量的梯度。通常,slot 也可以用于实现正则化、momentum 和 batch normalization 等优化算法中的额外功能。
case 体验
对应下文中 AdagradDecayOptimizer 的 _create_slots() 方法, 当 primary_var 是
scope_emb/input_from_feature_columns/word_embedding/weights
时, 相应的 _slot 内容为:
{'accumulator': {
(<tensorflow.python.framework.ops.Graph object at 0x000002871512F390>, 'scope_emb/input_from_feature_columns/word_embedding/weights'):
<tf.Variable 'OptimizeLoss/scope_emb/input_from_feature_columns/word_embedding/weights/AdagradDecay:0' shape=(1000, 10) dtype=float32_ref>
},
'accumulator_decay_power': {
(<tensorflow.python.framework.ops.Graph object at 0x000002871512F390>, 'scope_emb/input_from_feature_columns/word_embedding/weights'):
<tf.Variable 'OptimizeLoss/scope_emb/input_from_feature_columns/word_embedding/weights/AdagradDecay_1:0' shape=(1000, 10) dtype=int64_ref>
}
}
二. 常用子类
2.1 GradientDescentOptimizer
class
GradientDescentOptimizer
(optimizer.Optimizer)
类. 梯度下降法的实现.__init__(self, learning_rate)
构造函数中指定学习速率.
2.2 AdagradOptimizer
@tf_export("train.AdagradOptimizer")
class AdagradOptimizer(optimizer.Optimizer):
def _create_slots(self, var_list):
for v in var_list:
dtype = v.dtype.base_dtype
if v.get_shape().is_fully_defined():
init = init_ops.constant_initializer(self._initial_accumulator_value,
dtype=dtype)
else:
init = self._init_constant_op(v, dtype)
self._get_or_make_slot_with_initializer(v, init, v.get_shape(), dtype,
"accumulator", self._name)
2.3 AdagradDecayOptimizer
class AdagradDecayOptimizer(optimizer.Optimizer):
def _create_slots(self, var_list):
for v in var_list:
with ops.colocate_with(v)
self._get_or_make_slot_with_initializer(v, init, v_shape, dtype,
"accumulator", self._name)
self._get_or_make_slot_with_initializer(
v, init_ops.zeros_initializer(self._global_step.dtype),
v_shape, self._global_step.dtype, "accumulator_decay_power", self._name)
2.4 AdamOptimizer
AdamOptimizer(optimizer.Optimizer)
类. 实现了Adam算法的优化器, 它是一种随机梯度下降法.
三. 多优化器并存
搭建计算图就像搭积木一样, 可以划分为多个模块, 自然也可以给各模块应用不同的优化器.
上文的 minimize 方法中有 var_list 参数, 就可以让不同的 optimizer 优化不同的模块. 那怎么给不同模块的参数作划分呢?
collection
tf.add_to_collection(name, value)
提供一个全局的存储机制,不会受到变量命名空间的影响。一处保存,到处可取。
tf.get_collection(key, scope=None)
对应add_to_collection操作, 这里把存进去的内容读出来.