To assist with everyday human activities, robots must solve complex long-horizon tasks and generalize to new settings.
为了协助日常人类活动,机器人必须解决复杂的长期任务并推广到新环境。
Recent deep reinforcement learning (RL) methods show promises in fully autonomous learning, but they struggle to reach long-term goals in large environments.
最近的深度强化学习方法在完全自主学习方面表现出了一定的优势,但在大型环境中很难达到长期目标。
On the other hand, Task and Motion Planning (TAMP) approaches excel at solving and generalizing across long-horizon tasks, thanks to their powerful state and action abstractions.
More importantly, LEAGUE learns manipulation skills in-situ of the task planning system, continuously growing its capability and the set of tasks that it can solve.
First, complex real-world tasks are often long-horizon. This requires a learning agent to explore a prohibitively large space of possible action sequences that scales exponentially with the task horizon.
Opal: Offline primitive discovery for accelerating offline reinforcement learning
Learning to coordinate manipulation skills via skill behavior diversification
Efficient bimanual manipulation using learned task schemas
作者的点评:low sample effificiency, lack of interpretability, and fragile generalization; task-specifific and fall short in cross-task and cross-domain generalization.
作者是如何介绍自己的源方法 TAMP 呢?
作者先阐述了 TAMP 的概念:“leverages symbolic action abstractions to enable tractable planning and strong generalization”——利用符号行动抽象实现可处理的规划和强大的泛化能力。“Specifically, the symbolic action operators divide a large planning problem into pieces that are each easier to solve.”——具体而言,符号动作运算符将一个大型规划问题分成多个更容易解决的部分。“The ‘lifted’ action abstraction allows skill reuse across tasks and even domains.”——“被重点强化/强调”的行动抽象允许在任务甚至领域之间重复使用技能。
LEAGUE (LEarning and Abstraction with GUidancE) —— an integrated task planning and skill learning framework that learns to solve and generalize across long-horizon tasks
State abstraction allows agents to focus on task-relevant features of the environment. Action abstraction enables temporally-extended decision-making for long-horizon tasks.
Augmenting reinforcement learning with behavior primitives for diverse manipulation tasks;
Accelerating robotic reinforcement learning via parameterized action primitives;
作者与上述文章的相比,优势在于,他能自己不断的扩展运动原语集合。
III. METHOD
A. Background
MDP
其实就是一般的强化学习MDP框架啦
<
χ
,
A
,
R
(
x
,
a
)
,
T
(
x
′
∣
x
,
a
)
,
p
(
x
0
)
,
γ
>
<\chi,A,R(x,a),T(x^{\prime}|x,a),p(x^{0}),\gamma>
<χ,A,R(x,a),T(x′∣x,a),p(x0),γ>
J
=
E
x
0
,
a
0
,
a
1
,
⋯
,
a
t
−
1
,
s
T
∼
π
,
p
(
x
0
)
[
∑
t
γ
t
R
(
x
t
,
a
t
)
]
J=E_{x^{0},a^{0},a^{1},\cdots,a^{t-1},s^{T}\sim\pi,p(x^{0})}\big[ \sum_{t}\gamma^{t}R(x^{t},a^{t}) \big]
J=Ex0,a0,a1,⋯,at−1,sT∼π,p(x0)[t∑γtR(xt,at)]
Task planning space
<
O
,
Λ
,
Ψ
^
,
Ω
^
,
G
>
<O,\Lambda,\hat{\Psi},\hat{\Omega},G>
<O,Λ,Ψ^,Ω^,G>
O
O
O 对象集;
Λ
\Lambda
Λ 对象类型的有限集合; 对于每一个对象
o
∈
O
o\in O
o∈O ,都存在一个向量
λ
∈
Λ
\lambda\in\Lambda
λ∈Λ ;向量的维度
dim
(
λ
)
\text{dim}(\lambda)
dim(λ) 是这个对象携带的特征信息的含量(3维的位置、rpy角度等等…)
存在以下映射:
x
∈
χ
x\in\chi
x∈χ 则存在
x
(
o
)
∈
R
dim
(
type
(
o
)
)
x(o)\in R^{\text{dim}(\text{type}(o))}
x(o)∈Rdim(type(o)) ,其实可以用条件概率风格这样改写公式:
x
∣
o
=
x
(
o
)
∈
R
dim
(
type
(
o
)
)
x|o=x(o)\in R^{\text{dim}(\text{type}(o))}
x∣o=x(o)∈Rdim(type(o)) 这就说明,当一个对象
o
∈
O
o\in O
o∈O 作用于环境的某个状态
x
∈
χ
x\in\chi
x∈χ 时,环境状态信息会被表征为与对象
o
∈
O
o\in O
o∈O 种类维度相一致的实数向量。
Ψ
^
\hat{\Psi}
Ψ^ 描述多个对象之间的关系。谓词
ψ
∈
Ψ
^
\psi\in\hat{\Psi}
ψ∈Ψ^ 描述了多个对象
o
∈
O
o\in O
o∈O 之间的关系。每个谓词
ψ
∈
Ψ
^
\psi\in\hat{\Psi}
ψ∈Ψ^ (例如,Holding)由一个对象类型元组
(
λ
1
,
⋯
,
λ
m
)
(\lambda_{1},\cdots,\lambda^{m})
(λ1,⋯,λm) 和一个二元分类器组成,确定关系是否成立。
c
ϕ
:
χ
×
O
m
→
{
T
r
u
e
,
F
a
l
s
e
}
c_{\phi}:\chi\times O^{m}\rightarrow \{True,False\}
cϕ:χ×Om→{True,False} 其中每个下标
o
i
∈
O
o_{i}\in O
oi∈O 都有自己的向量
λ
i
∈
Λ
\lambda_{i}\in\Lambda
λi∈Λ
一个任务目标
g
∈
G
g \in G
g∈G 可以表示为一组基本组件,其中符号状态
x
Ψ
x_{\Psi}
xΨ 可以通过评估一组谓词
Ψ
^
\hat{\Psi}
Ψ^ 并保留所有正向组件来获得。
Comments: 分析到这里,其实也比较好理解一些了。
每一个对象
o
∈
O
o\in O
o∈O 实际就是场景中的物体,比如轴、孔以及小物块这样的,那么这些物体在场景中必然有其自身的属性,于是就有了向量
λ
∈
Λ
\lambda\in\Lambda
λ∈Λ 来描绘这个物体的属性(位姿…),因为不同物体需要使用的属性信息不一样,那么每个向量
λ
∈
Λ
\lambda\in\Lambda
λ∈Λ 的维度也就不一样。
接下来,映射关系也好分析了。物体在MDP里面的状态中,都会映射出一个当前时刻的属性表现,也就是
R
dim
(
type
(
o
)
)
R^{\text{dim}(\text{type}(o))}
Rdim(type(o)) 。接下来就是考虑谓词,这里强调了“多个目标之间的关系”,我的理解为“A目标+谓词&