An autonomous agent often counters various tasks within a single complex environment. **Our two-stage framework** proposes to first build a simple directed weighted graph abstraction over the world in an unsupervised task-agnostic manner and then to accelerate the hierarchical reinforcement learning of a diversity of downstream tasks. Details please refer to Paper with Appendix.

- To discover the nodes, i.e.
*pivotal states*, a recurrent variational autoencoder takes state-action pairs as inputs, infers binary latent variables each of which indicates the activation of its corresponding input state, and recovers the original action sequence with the activated states only. - Binary latent variables follow Beta prior and Hard Kumaraswamy approximated posterior. The prior signifies on average how critical a state is for action reconstruction, i.e. to be a pivotal state.
- The overall loss is consistent of the evidence lower bound, a sparsity penalty to regularize the desired ratio of pivotal states and a transition penalty to evenly distribute the pivotal states.
- The training trajectories are collected via random walks and a curiosity-driven goal-conditioned policy that is optimized in alternation with the latent model.
- To forge the edge after the nodes are finalized, use random walks to decide the adjacency matrix, i.e. the neighbors of each pivotal states and then use both random walk trajectories and goal-conditioned policy to approximate edge weights as well as to form transitions between nodes.

- Baselines in our work (but not limited to): non-hierarchical A2C and its hierarchical extension Feudal Network (FN).
- Wide-then-Narrow (WN) sequential Manager instruction: pre-define a set of states, e.g. the set of pivotal states, for Manager to propose a
*Wide Goal*from; around the local neighborhood of Wide Goa, Manager proposes a*Narrow*. - World Graph traversal: send the agent from its near-by pivotal state to the potentially distant Wide Goal, which improves long-horizon planning and exploration. Traversal path is planned via dynamic programming and executed using edge paths or goal-conditioned policy.
- Initialization with the goal-conditioned policy weights: achieve implicit skill transfer and more stable optimization.

- Tasks: MultiGoal (mean reward with std), with Dense Reward, with Sparse Reward, with Stochasticity, and Door-Key (mean success rate in % with std).
- 3 different sets Wide Goal: all valid states, pivotal states, random selected states
- With or without goal-conditioned policy.
- With or without World Graph traversal.

- Compare baseline
*A2C*and*FN*with intialization (without it training fails):

## A2C |
## FN |

**without graph traversal, with initialization**(without the training fails), compare using different Wide Goal sets:

## All Feasible States |
## Random States |
## Pivotal States |

**with graph traversal, without initialization**, compare*pivotal states*, and*randomly selected states*as Wide Goals:

## Random States |
## Pivotal States |

**with graph traversal, with initialization**, compare using*pivotal states*, and*randomly selected states*as Wide Goals:

## Random States |
## Pivotal States |

- Compare
*the proposed framework***with WN, pivotal state graph traversal and initialization**against*A2C*and*FN*baselines:

## A2C |
## FN |
## Proposed |

**with initialization and graph traversal**, but when*all feasible states*are considred, traversal becomes trivial hence the same as without traversal:

## All Feasible States |
## Pivotal States |
## Random States |

**with graph traversal**, compare the impact of goal-conditioned policy initialization when usning*pivotal states*:

## with Goal-Conditioned Policy Initialization |
## without Goal-Conditioned Policy Initialization |

**with graph traversal**, compare the impact of goal-conditioned policy initialization when usning*randomly selected states*:

## with Goal-Conditioned Policy Initialization |
## without Goal-Conditioned Policy Initialization |