Offline Learning of Controllable Diverse Behaviors

1Ubisoft La Forge 2University of Angers 3H Company

Abstract

Imitation Learning (IL) techniques aim to replicate human behaviors in specific tasks. While IL has gained prominence due to its effectiveness and efficiency, traditional methods often focus on datasets collected from experts to produce a single efficient policy. Recently, extensions have been proposed to handle datasets of diverse behaviors by mainly focusing on learning transition-level diverse poli- cies or on performing entropy maximization at the trajectory level. While these methods may lead to diverse behaviors, they may not be sufficient to reproduce the actual diversity of demonstrations or to allow controlled trajectory generation. To overcome these drawbacks, we propose a different method based on two key features: a) Temporal Consistency that ensures consistent behaviors across entire episodes and not just at the transition level as well as b) Controllability obtained by constructing a latent space of behaviors that allows users to selectively activate specific behaviors based on their requirements. We compare our approach to state-of-the-art methods over a diverse set of tasks and environments.

1. Diverse Behavior Learning

Challenge: Generating the same trajectory distribution $p_{\mathcal{M},\mu}(\tau)$ as in a dataset $\mathcal{D}_e$ generated by the experts $\Pi_e = \{\pi_e^{(1)}, \pi_e^{(2)}, \ldots, \pi_e^{(k)}\}$ with $1\le k \le |\mathcal{D}_e|$ in a fully offline manner.

Limits: Traditional imitation learning such as Behavior Cloning (BC) fails when $k>1$ because it captures only transition‑level diversity.

Question: How to learn from $\mathcal{D}_e$ a policy $\pi$ capable of:

  •  Diversity: Reconstructing the trajectory distribution?
  •  Controllability: Being prompted to generate specific behaviors?
  •  Robustness: Displaying robustness to compounding errors and environment stochasticity?

2. ZBC: Stylized BC

ZBC addresses Diversity & Controllability by minimising

$$ \mathcal{L}_{ZBC}(\theta) = -\mathbb{E}_{\textcolor{blue}{\tau_i}\sim \mathcal{D}}\Bigl[\,\mathbb{E}_{(\textcolor{blue}{s_t^i},\textcolor{blue}{a_t^i})\sim \textcolor{blue}{\tau_i}}\bigl[\log\pi_{\theta}(\textcolor{blue}{a_t^i}\mid \textcolor{blue}{s_t^i},\textcolor{blue}{z_i})\bigr]\Bigr]. $$

and reconstructing

$$ p_{\mathcal{M},\pi_{\theta}}^{ZBC}(\tau)=\frac{1}{|\mathcal{D}_e|}\sum_{\textcolor{blue}{i}=0}^{|\mathcal{D}_e|-1}p_{\mathcal{M},\pi_{\theta}(\cdot\mid\cdot,\textcolor{blue}{z_i})}(\tau). $$

Yet compounding errors and environment stochasticity can create OOD $(s,z)$ pairs, and ZBC tends to over‑fit.

3. WZBC: Weighted Stylized BC

WZBC achieves Diversity, Controllability and Robustness with

$$ \mathcal{L}_{WZBC}(\theta) = -\mathbb{E}_{(\textcolor{blue}{\tau_i},\textcolor{red}{\tau_j})\sim \mathcal{D}}\Bigl[\,\mathbb{E}_{(\textcolor{blue}{s_t^i},\textcolor{blue}{a_t^i})\sim \textcolor{blue}{\tau_i}}\bigl[e^{-\beta\,\nu(\textcolor{blue}{\tau_i},\textcolor{red}{\tau_j})}\,\log\pi_{\theta}(\textcolor{blue}{a_t^i}\mid \textcolor{blue}{s_t^i},[\textcolor{red}{z_j}]_{i\neq j})\bigr]\Bigr]. $$

where the style–distance $\nu$ is

$$ \forall (\textcolor{blue}{\tau_i},\textcolor{red}{\tau_j})\in \mathcal{D}_e:\;\;\nu(\textcolor{blue}{\tau_i},\textcolor{red}{\tau_j}) = \frac{\left\|\operatorname{pad}(\textcolor{blue}{\tau_i^s})\!-\!\operatorname{pad}(\textcolor{red}{\tau_j^s})\right\|}{\displaystyle\max_{\textcolor{green}{\tau_k}\in \mathcal{D}_e}\left\|\operatorname{pad}(\textcolor{blue}{\tau_i^s})\!-\!\operatorname{pad}(\textcolor{green}{\tau_k^s})\right\|}. $$

4. Environments & Datasets

Maze2D (One‑side k=2, Only‑forward k=12)

maze one side
maze one side hist
maze only forward
maze only forward hist

D3IL (Avoiding k=24, Aligning k=2)

avoiding
avoiding hist
aligning
aligning hist

With the balanced datasets from D3IL and unbalanced datasets from their filtered versions.

5. Diversity Results

Evaluating ZBC and WZBC diversity capture. ZBC generates behaviour histograms with the lowest L1-distance to $\mathcal{D}_e$, yet attains lower performance in more complex settings. WZBC is slightly less precise but succeeds more often.

Dataset (distance)BCZBCWZBCBESODDPM‑ACTDDPM‑GPT
medium_maze-only_forward1.74 ± 0.0540.256 ± 0.0230.248 ± 0.0470.744 ± 0.0410.916 ± 0.2520.604 ± 0.082
medium_maze-one_side1.4 ± 0.490.044 ± 0.0320.06 ± 0.0330.140 ± 0.0490.640 ± 0.3900.100 ± 0.075
d3il_avoiding1.917 ± 0.00.265 ± 0.00.482 ± 0.0260.901 ± 0.0910.781 ± 0.1840.531 ± 0.093
d3il_unbalanced_avoiding1.925 ± 0.0620.102 ± 0.1161.457 ± 0.0871.283 ± 0.0671.342 ± 0.1341.26 ± 0.022
d3il_aligning1.0 ± 0.00.172 ± 0.170.552 ± 0.2240.472 ± 0.1110.488 ± 0.0750.296 ± 0.104
d3il_unbalanced_aligning0.4 ± 0.00.172 ± 0.0570.364 ± 0.0370.256 ± 0.0660.212 ± 0.0530.288 ± 0.063
Dataset (success rate)BCZBCWZBCBESODDPM‑ACTDDPM‑GPT
medium_maze-only_forward1.0 ± 0.01.0 ± 0.00.99 ± 0.00.998 ± 0.0040.9 ± 0.121.0 ± 0.0
medium_maze-one_side0.6 ± 0.491.0 ± 0.01.0 ± 0.01.0 ± 0.00.994 ± 0.0121.0 ± 0.0
d3il_avoiding1.0 ± 0.00.996 ± 0.0050.954 ± 0.0240.998 ± 0.0080.904 ± 0.080.986 ± 0.006
d3il_unbalanced_avoiding0.6 ± 0.490.75 ± 0.0920.802 ± 0.1131.0 ± 0.00.996 ± 0.0050.999 ± 0.0
d3il_aligning0.21 ± 0.3950.52 ± 0.4320.806 ± 0.1050.910 ± 0.120.872 ± 0.0470.852 ± 0.055
d3il_unbalanced_aligning1.0 ± 0.00.328 ± 0.0540.762 ± 0.1260.922 ± 0.0130.882 ± 0.0380.844 ± 0.015

6. Controllable Generation

Control histograms for length_metric
Control histograms for length_metric

For control we :

  • Select trajectory styles in $\mathcal{D}_e$ that satisfy a property $\Psi(\tau)$ with $\Psi(\tau)=\bigl(\text{length}(\tau)\!\in[70,80]\bigr)$.
  • Generate adequate trajectories by conditioning $\pi$ on the selected styles.

7. Robustness to Stochasticity

Evaluating ZBC and WZBC under stochastic dynamics. WZBC generally surpasses ZBC in both L1-distance and success rate as uncertainty increases.

ConfigurationL1 DistanceSuccess Rate
ZBCWZBCZBCWZBC
medium_maze-only-forward (determinist)0.256 ± 0.0230.248 ± 0.0471.0 ± 0.00.99 ± 0.0
medium_maze-only-forward (pseudo-r-init)1.152 ± 0.0940.828 ± 0.3490.448 ± 0.0310.684 ± 0.152
medium_maze-only-forward (r-init)1.556 ± 0.0791.552 ± 0.0370.858 ± 0.0460.978 ± 0.019
medium_maze-only-forward (noise-transi)0.729 ± 0.1340.744 ± 0.0290.632 ± 0.0660.744 ± 0.038

8. Conclusion

ZBC learns a style‑conditioned policy enabling trajectory diversity capture and controllable generation in a fully unsupervised offline setting. WZBC bridges BC and ZBC by mixing styles to enhance robustness.

Limits: Using simple Euclidean distance is not scalable to all observation modalities; performance can degrade in highly complex environments.

Future work: Incorporating IRL techniques for richer observation spaces and leveraging RL signals to improve task performance.