Steeven Janny1 Hervé Poirier1 Leonid Antsfeld1 Guillaume Bono1 Gianluca Monaci1 Boris Chidlovskii1 Francesco Giuliari3 Alessio Del Bue2 Christian Wolf1
Abstract
Progress in Embodied AI has made it possible for end-to-end-trained agents to navigate in photo-realistic environments with high-level reasoning and zero-shot or language-conditioned behavior, but benchmarks are still dominated by simulation. In this work, we focus on the fine-grained behavior of fast-moving real robots and present a large-scale experimental study involving 262 navigation episodes in a real environment with a physical robot, where we analyze the type of reasoning emerging from end-to-end training. In particular, we study the presence of realistic dynamics which the agent learned for open-loop forecasting, and their interplay with sensing. We analyze the way the agent uses latent memory to hold elements of the scene structure and information gathered during exploration. We probe the planning capabilities of the agent, and find in its memory evidence for somewhat precise plans over a limited horizon. Furthermore, we show in a post-hoc analysis that the value function learned by the agent relates to long-term planning. Put together, our experiments paint a new picture on how using tools from computer vision and sequential decision making have led to new capabilities in robotics and control. An interactive tool is available [here].
1 Introduction
00footnotetext: 1 Naver Labs Europe, France. firstname.lastname@naverlabs.com00footnotetext: 2 Istituto Italiano di Tecnologia, Genova, Italy.00footnotetext: 3 Fondazione Bruno Kessler, Trento, Italy.
Historically, the importance of computer vision for robot navigation tasks largely depended on which field addressed the problem. On one hand, roboticists consider navigation as a perception and reconstruction problem combined with planning. This naturally required robots to deal with noise and uncertainty [75] and thus computer vision, leading to the development of visual odometry and SLAM solutions [75, 55, 22, 74, 53]. On the other hand, research in machine learning addressed it as a form of sequential decision making and a large body of work modeled it as a POMDP solved with Reinforcement Learning (RL). Early work was vision-less and focused on simple grid-like environments, finite states and observations and tabular representations. The introduction of function approximation into RL enabled end-to-end training of agents in photo-realistic environments like Habitat [67] and AI2Thor [41], and very recently on large-scale offline datasets like Open-X-Embodiment [18].Now, computer vision is enabling a switch to realistic conditions in what is now known as Embodied AI (EAI).
End-to-end navigation policies are typically trained with RL in an environment with simple motion — stepwise teleportation, leading to learning high-level decisions only. When deployed on a real robot, the displacements predicted by the policy are passed to a classical low-level controller, tasked with reaching the desired position. While this strategy led to breakthroughs in navigation, it results in very slow motion, as the agent stops between each step, and heavily relies on the quality of the low-level controller.
Interestingly, robotic manipulation followed a similar track, but moved rapidly to dynamic-aware environments [17, 2, 48], integrating a dynamical model to mimic the behavior of a real robot, which has been shown to be instrumental for sim2real transfer. Recent work [9] indicates that this also applies to navigation: empowering the simulator with a numerical model of the robot motion plays an important role in transferring the navigation skills of the end-to-end agent to a real robot. This policy does not perform position control, but outputs target velocities, which are input to a dynamical model simulating the behavior of the real physical robot, enabling smoothly and quick navigation.
In this work we raise the question on what is actually learned by these policies. In contrast to a large part of the EAI community, which recently focused on tasks requiring high-level reasoning and semantics, up to language understanding, we here aim to understand the capabilities of trained agents in another task dimension: acting for precise motion enabled by fine-grained geometry/perception and reliable estimates of robot dynamics. There is evidence, for instance, that generalist robot policies are capable of addressing broad classes of tasks, but tend to fail at skills which require finer precision in motion planning [57]. An essential question remains: what are the inner workings of agents trained on large-scale visual data?
We experimentally show that a policy trained end-to-end with RL leads to the internal identification of a latent dynamical model from its interaction with the simulator alone, if realistic motion is simulated. We show that prediction/correction steps akin to a Kalman filter are learned, introduce a method to estimate the importance of each step by analyzing sensitivity of the policy to changes of the real dynamics of the agent, and show that a potential gap in dynamics can be closed with proper techniques. We introduce a new probing method evaluating the dynamics prediction performance of an agent.
We further show how the agent memory encodes information on the scene structure and on the explored area, and we measure the impact of memory ablations. We probe the role of long-horizon planning in these agents through comparisons with experts planners and by analyzing the agent’s value function.All together, these experiments, a majority of which have been executed with real robots in a real environment, shed light on how large-scale training on sequential visual data with realistic motion leads to the emergence of a useful dynamical model in visual agents.
2 Related work
Visual navigation has been classically solved in robotics using explicit modeling[11, 49, 51], which requires solutions for mapping and localization[10, 44, 76],for planning[42, 70] and for low-level control [25, 64]. These methods depend on accurate sensor models, filtering, dynamical models and optimization. End-to-end trained models directly map input to actions and are typically trained with RL[35, 54, 81, 77] or IL[20]. They learn flat representations[9], occupancy maps[13], semantic maps[12], latent metric maps[5, 31, 58], topological maps[4, 14, 71], explicit episodic memory[16, 21, 24, 62], implicit representations[52] or navigability[8].Our study targets end-to-end policies featuring recurrent memory and benefiting from an additional motion model in simulation.Dynamical models were rarely considered as key concern for learning-based solutions in indoor navigation, which typically focus on planning and high-level control in simulation [12, 13, 26, 7] and rely on handcrafted low-level controllers for deployment. The notion of dynamics is more common in manipulation where joint inertia, response time and slack must be simulated to accurately reproduce displacement and interactions [17, 2, 48]; consequently, they played a large role in sim2real gap mitigation and domain adaptation [79, 23].Frequent choices are domain randomization and a stochastic dynamical model to maintain good behavior on real platform [15, 56, 59, 50].In navigation, the separation between high-level and low-level controller, as in [12, 13, 26, 7], leads to delays and limits reactivity. Our work builds on [9] which solves this problem by augmenting the Habitat simulator with a dedicated realistic motion.Learning dynamical systems – A large body of work also focuses explicitly on learning dynamical systems from offline datasets of {action, observations} pairs, a task inherited from System Identification in control theory. Learned models then target different tasks, such as simulation [60, 3, 37], state estimation [73, 19] or control [3, 83, 36] by coupling the learned model with a classical algorithm.Model-based ML – Learning accurate models of the environment has been classically done in RL as State Representation Learning [47]. Model-based RL directly integrates a model of the environment into the learning algorithm and/or the agent, eg. as World Models [27, 28]. They have seen a recent resurgence in variants trained large-scale with next-token prediction [33, 82]. Our work is based on model-free RL, which obtains SOTA performance in the targeted navigation tasks, and evaluates what kind of models are learned from large-scale training.Evaluating navigation models –The sim2real gap represents one of the main challenges [32, 38, 59].Experiments with mobile robots are time-consuming,and the reproduction of identical evaluation conditions is difficult [65]. Evaluation in calibrated simulation has been proposed as an alternative [38].
Method | Sim (+dyn.) | Sim (+dyn.) | Real | |||||||
---|---|---|---|---|---|---|---|---|---|---|
& action space | dyn* | SR | SPL | SCT | SR | SPL | SCT | SR | SPL | SCT |
(a) D4 [66] | ✗ | 29.1 | 18.1 | 2.0 | 10.0 | 7.3 | 0.9 | 0.0 | 0.0 | 0.0 |
(b) D28-instant [80] | ✗ | 27.6 | 11.6 | 5.0 | 25.0 | 11.1 | 5.8 | 10.0 | 5.3 | 1.7 |
(c) D28-dynamics [9] | ✓ | 97.6 | 82.3 | 52.5 | 100.0 | 82.1 | 64.5 | 92.5 | 61.1 | 22.4 |
2.9 | 5.0 | 3.0 |
3 Preliminaries
We reproduced the work by Bono et al. [9] and use PPO to train a policy in simulation. The conclusions of this work should hold for a variety of navigation tasks, but for the sake of simplicity we focus on point goal navigation of fast moving agents in real environments with a particular emphasis on sim2real transfer. In this section, we recall their setup and state our modifications, which are not considered as contributions. More details are in sup. mat.J . The agent receives a set of observations at each time step in the form of RGB images and a Lidar-like vector of ranges , “scan” and takes actions to reach a goal position given as polar coordinates relative to its starting position.In our case, the scan is collected by a set of four RealSense depth sensors. The action space is discrete, each action class among the possibilities corresponds to a pair of linear and angular velocity commands.
The agent does not have direct access to a map and outputs actions from sensory inputs. However, some form of onboard localization wrt. the starting position at is required, which allows the task to communicate the goal to the agent. We provide the agent with two forms of localization:integrated odometry sensed from wheel encoders , and Adaptive Monte Carlo Localization from the ROS package AMCL, which uses a 1D-Lidar, denoted as . Both are given wrt. the start of the episode, as are the static point goal coordinates . Both inputs contains estimated current position and velocity.Similar to a large body of work [30, 35, 40, 6, 7], the agent disposes of a latent memory vector , defined as the hidden state of a recurrent network,
(1) | ||||
(2) |
where is a two-layer GRU with hidden state , and gating equations have been omitted for ease of notation.The different inputs go through dedicated encoders: a ResNet-18 , a 1D-CNN and an embedding of the previous action. The other inputs are encoded with MLPs, omitted from the notation. The policy is linear.
The goal is static, ie. constant over the episode and given in the coordinate frame centered on the agent at . As the agent needs to transform it into its egocentric frame while moving, we add an auxiliary linear head predicting dynamically the relative goal position , supervised from privileged information in simulation.
Training for realistic motion – As in [9], we integrate realistic motion into the Habitat simulator by identifying the physical parameters of the robot. A second order dynamical model is identified from real collected trajectories. Remarkably, the policy trained in this simulator offers better performance on real world experiments and leads to faster and smoother trajectories.
Evaluation – the experimental results in the paper are color-coded to indicate the different evaluation settings: experiments evaluate the agent trained in simulation on a real Rookie robot. This form of evaluation is the most challenging as it deals with the sim2real gap, but limits the number of episodes. We perform 20 episodes per experiment in our test building, indicated as , and do certain analyses on a subset of 14 episodes, indicated as . evaluates in simulation using the Habitat simulator augmented with the realistic motion model identified from the real robot. This allows to evaluate the model on a large number of scenes and episodes while still taking into account realistic non-linear motion. We evaluate on the 2,500 episodes of the HM3D [61] validation set, and certain experiments are done on the 250 episodes of HM3D val-mini, . To compare performance with the real experiments, we also add experiments on a scanned and simulated version of the real episodes, e.g. .
As usual, we report Success Rate (SR), i.e., the proportion of successful episodes (terminated at m to the goal by the agent not moving and no motion ordered in sim, and m in real) and SPL[1], i.e., SR weighted by path length,where is the agent path length and the GT path length. We also provide a lower bound on Success Weighted by Completion Time (SCT)[80] which weighs success by time, explicitly taking the agent’s dynamics model into consideration,Here, is the episode time and is the optimal time on the shortest path.111as is an estimate in simulation, may happen.
Our modifications – We reproduce the work from [9], up to variations in experimental conditions inherent with evaluations in real conditions and in our own buildings. We substantially improved their performance through modifications in the engineering details, which we did not ablate, as we do not consider them to be scientific contributions: (1) we boost training from 200M to 500M env. steps, (2) we replace raw angle inputs by sin/cos projections, (3) we perform RGB data augmentations also during test time, which alone improved SR by 15% in the real-world setup.
Our results are shown in Table 3. The end-to-end agent obtains excellent navigation performance with a close to optimal success rate of 92.5% in real world scenarios. In particular, the policy learned with the simulator augmented with the dynamical model significantly outperforms the baselines — same agent trained with the classical simulator w/o dynamical model, and — an agent with relative position commands FORWARD 25cm, TURN_LEFT , TURN_RIGHT , STOP, trained w/o dynamical model. This confirms the necessity of adding a realistic motion model reported in [9]. In the next sections, we will provide a detailed analysis of what the agent has actually learned.
4 Do e2e agents learn a dynamical system?
Generally speaking, an agent has several options to estimate its current state wrt. the goal and to take actions. Two extreme solutions would be, (1) to use sensory inputs alone, or (2) to learn a latent dynamics alone and integrate it step by step.In the context of the studied trained navigation policies, a common (but yet untested) belief is that these agents use a combination of both, often achieved through Prediction-Correction steps akin to a Kalman filter [39]: during the Prediction step, the agent exploits a learned dynamical model to estimate its next state (open-loop forecasting), and during Correction step it corrects its estimation using sensory inputs, (closed-loop improvement of the state estimate). Optimally, as the policy improves, the agent learns to better estimate its state by balancing the uncertainty of its internal dynamical model and the uncertainty in its learned sensor model. This would be, again, a generalization of the (Extended) Kalman filter, where these respective models are designed by experts and the uncertainties encoded explicitly in probability distributions. In this section, we will test this hypothesis with a series of targeted experiments, partially in simulation, partially in the real environment.
–To quantify to which extent the trained policy is sensitive towards changes in input vs. changes in the process dynamics, we propose a new analysis method, which evaluates the agent in environments subject to normalized and thus comparable corruptions and measures the drop in success rate. We disturb the dynamical model used in the simulator by modifying the damping ratio(lower values makes the dynamics closer to instantaneous at cost of larger overshoot)the response time (high value requires more time to reach a target velocity)or the maximal velocity. On the input side we corrupt by gradually increasing the gaussian noise parameters applied on the odometry sensors (mean and std.dev).

Comparing changes in environment parameters is difficult, since these parameters do not have comparable ranges and have completely different meanings. In order to superpose the impacts on agents performance of these parameters in the same plots in Figure 4, we introduce an intermediate property we call “distance to belief”, which quantifies the spatial distance (expressed in meters) of the agent in the corrupted environment compared to its behavior in the training environment. To this end, we collected a set of 1,000 5s-long sequences of actions and measured average spatial distance of the agent path running in both versions, before and after (see fig. 13 and more details in appendix B).
Figure 5 shows the underlying causal graph: a change translates into a measure , comparable between parameters. It also gives rise to a change in navigation performance, through two mediators: the out-of-distribution situation of the agent, “”, and the eventual difference in task complexity, .
We report our results in Figure 4: for both agents, trained w/ (left) or w/o (right) dynamical model, the largest sensitivity can be observed for odometry inputs, indicating that the agent significantly relies on its sensing capabilities to correct a state potentially estimated by open-loop forecasting. For the agent trained w/ dynamical model, changes in the parameters of the dynamics have a large effect, in particular response time and damping, almost comparable to changes in input odometry. Performance is low for the agent trained w/o dynamics, it seems to overfit to the “teleportation” seen in simulation. The gap between is related to the OOD situation, i.e. the minimal difference between in-domain dynamics (here, instant and constant velocities) and corrupted dynamics (more details in Sec. C).
– Inspired by RMA [43], we train a new policy with a simulator which randomly samples environment changes during each training episode, aiming to train a policy capable of adapting to them. We encode the environment parameters into an embedding vector , which is passed to the policy as input, leading to a new architecture , replacing eq. (2). Figure 6 shows the drop in performance between in-domain and OOD (applying ) and then the gain back in performance by the RMA-augmented policy, which has access to the embedded ground-truth environment parameters. This provides evidence, that changes in dynamics have an important impact on agent performance, and that the agent can adapt if it has access to them or can estimate them.

– We further analyze the agent by directly probing whether it is capable to predict its future pose. We generate a dataset of pairs of hidden states and agent pose from trajectories on the HM3D train split and train a model to predict future pose from . We train two variants, each of which is a separate network for each horizon predicting . One is a linear model, the second one has access to the goal and prev. action ,
We’d like to insist, that neither of the models have access to observations beyond time step . Visualizations on 4 trajectories and numerical errors from scenes unseen during training are given in Figure 7. The linear variant (orange curve) shows quite low prediction error and good quality behavior in a reasonable horizon but quickly seems to break down for larger horizons. The linear + prev-action variant (green curve) has lower error and shows better long-term behavior, indicating that providing some elementary information on the plan (the goal and a single previous action) to the probing network is helpful. We also explore a version, which uses the latent dynamics of the agent itself rolled out auto-regressively into the future (blue curve). It tests whether the recurrent update the agent did over its past, as given by eq. (1), can also be used to rollout the latent state auto-regressively into the future for prediction, but without future observations (see sup. mat.D for more details and equations). Needless to say that the policy does not have access to such a model and as expected, the results are lukewarm.The numerical results in the inlet of Figure 7 are prediction errors for 39k trajectories in unseen scenes given for 20 steps, ie. 6s into the future. A mean of error 0.76m after 6s is arguably quite low, given that in many cases an accurate prediction cannot be done without future observations, as can be easily seen in the upper left trajectory of Figure 9. We argue for the presence of short-to-medium horizon dynamics in the latent representation .
– The agent has been trained for a maximum velocity of 1m/s, and in Table 3 we test the impact of clipping its maximum speed on the real robot. Limiting speed to 0.7m/s is beneficial: we conjecture that the disadvantage of being OOD wrt. to the training speed is compensated by the slightly easier task. Decreasing the speed limit further is not helpful anymore. We set the limitation to 0.7m/s as the default behavior in all our experiments. Retraining the agent with this limitation could potentially increase performance, and will be left for future work. In all tables, lines marked by “(def)” indicate the default parameter settings used in the rest of the experiments.
– The decision loop of 3Hz provides 333ms to perform observation collection, pre-processing and network forward pass, which our onboard Nvidia Jetson AGX Orin executes in 100ms. We exploit the unused 230ms in real world experiments, by taking decisions faster than in the 333ms done during training, cf. Table 3. The combined effects of being OOD and deciding faster on less “stale” observations does not seem to have significant effects on navigation performance.
Delay (ms) | SR | SPL | SCT |
---|---|---|---|
325 | 100.0 | 66.5 | 24.1 |
270 | 100.0 | 61.1 | 23.8 |
225 | 92.9 | 57.7 | 22.0 |
190 | 92.9 | 62.5 | 24.6 |
150 (def) | 100.0 | 63.3 | 25.9 |
130 | 100.0 | 66.0 | 24.5 |
Max. Vel. | SR | SPL | SCT |
---|---|---|---|
100% | 85.0 | 54.0 | 20.9 |
70% (def) | 100.0 | 63.3 | 25.9 |
50% | 85.0 | 66.1 | 23.3 |
Periodic | 2m | SR | SPL | SCT |
---|---|---|---|---|
✗ | ✗ | 40.0 | 32.3 | 10.7 |
✗(def) | ✓(def) | 100.0 | 63.3 | 25.9 |
30 sec | ✓ | 85.0 | 58.9 | 23.7 |
10 sec | ✓ | 80.0 | 59.9 | 24.5 |
3 sec | ✓ | 75.0 | 55.9 | 18.8 |
5 Does e2e training lead to planning?
The agent analyzed in our work does not have an explicit baked-in inductive bias for planning, and we raise the question whether planning emerges through RL training alone.
–In section 4 we introduced probing future pose from . The low prediction errors given in Figure 7 on trajectories of the unseen HM3D validation scenes can only be achieved if the agent(i) is able to leverage a learned notion of dynamics, as argued, and (ii) has a latent plan in its hidden state, as predicting the future pose also requires access to future actions. The visualizations also provide evidence of short-to-medium horizon plans, eg. direction changes aligned between GT and prediction.
– As the agent had been trained with PPO, its learned critic can provide an estimate of the value function. While the value estimate does not provide a long-horizon plan of the agent, it does provide the agent’s estimate of the expected cumulated future rewards, and as such provides some indication of what the agent plans to achieve. In Figure 8, we provide a post-hoc analysis of a single episode collected during a demonstration of the agent in a crowded scenario. The episode provides evidence, that navigation strategies in the form of choices of paths are taken, tested, and rejected to be replaced by better options. In particular, these choices seem to have an effect on the value estimate. Abandoning a navigation option for a more promising one increases the value estimate, as the agent now expects a higher future return. This gives some evidence that the agent has an idea of where it stands in a plan structured on the level of paths and that its estimate of success goes beyond the effect of the next action.
– We introduce an expert in the form of the Fast Marching Square method[78] working on a floor-plan of our test building, and which produces a cost function taking into consideration the effect of the agent’s action on the geodesic distance from position to the target. It also considers the alignment of the agent to the gradient of the geodesic distance, as well as velocity term that encourages high velocity when far from the target and a low velocity when approaching it in order to stop correctly — cf. sup. mat.F for details.
We compare trajectories of the agent in the real environment with expert trajectories using the cost function to estimate a planning quality measure for each time step as,
(3) |
We translate this into a heatmap using kernel density estimation. To be more precise, we separately estimate positive and negative values and superpose them, as seen in Figure 10 (top) for the 60 episodes of 3 experiments from Table 3(c). Figure 10 (bottom) shows the positive and negative actions of a single episode. Episodes are defined as navigation between consecutive goals, eg. 01.
The heatmap indicates compatibility with expert decisions near bottleneck areas like doors (near goal numbers 16, 8, 15). The episode from 45 is particularly difficult due to its high difference between Euclidean distance and geodesic path: it always failed due to problems in long-term planning, corroborated by the bad decisions visible in the heatmap. Other complicated areas are narrow passages (eg. near goal 1), where the agent fails to execute its plans.


6 Do agents use episodic memory?
The agent’s latent representation is a compression of its history of observations trained to keep the most relevant information for the task. We probe its impact on navigation.
– In Table 3 we test the agent in the real env. by zeroing periodically every 30s, 10s or 3s. Note, that this requires resetting episode start such that the goal is redefined, since the agent potentially uses its latent state to translate the static goal to an ego-centric frame (see sup. mat.E). We see a steady decrease of agent performance when the frequency of memory ablations is increased, with a drop of 25% in SR when memory is zeroed every 3s. As expected, the agent loses its capacity to remember which parts of the scene have been explored and gets stuck.
In our experiments we zero memory when the onboard sensors estimate that the agent is closer than 2m from the goal. Without, performance drops to 40% SR (first line in Table 3). We link this behavior to the task setting as static point goal, where the goal coordinates fed to the agent are constant and wrt. to the episode start. Refreshing memory helps to better deal with the very last meter to the goal and prevents its from prematurely deciding to stop.
– We trained a probing network to predict a local occupancy map of centered on the agent’s position from the agent memory . We trained on the HM3D train split and show results on our test building in simulation and on real data in Figure 9. We also integrated all these probing results from the 14 episodes of and super-imposed them over the map of the building, both in simulation and using probes from the real data (inputs and memory). For the latter, we used the onboard pose estimates from the agent. The occupancy predictions align very well with the real occupancy of the scene, also in difficult areas with doors and transparent walls.
7 Sensitivity of observation types
– We performed Shapley value analysis [72] to assess the impact of the various input modalities on navigation performance. Based on cooperative game theory, it helps to fairly allocate a payoff among different players of a game. In our case, the players are different inputs and the Shapley value quantifies the contribution of each input to the agent’s SR and SPL.To this end, we created a ‘background’ dataset with alternative inputs derived from previously unseen scenes and systematically perturbed one input modality at every step.This was conducted separately for odometry, localization, RGB, scan, and previous action inputs. Fig. 11 shows, that the agent heavily relies on odometry and scans, while RGB, localization, and previous actions contribute significantly less to the final outcome.
– We explore replacing the pose input calculated with AMCL from the Lidar input with visual localization calculated by the real robot’s onboard sensors. We use visual localization with repeatable keypoints R2D2 [63] on images retrieved from a dataset of observations [46] — cf. sup. mat.I. Results in Table 4 show that visual localization underperforms as measured by navigation metrics, which we link to failures to call STOP precisely enough. As in (static) PointNav the goal is not specified visually but with episode centric coordinates, localization is used heavily for the last meters, and localization from visual input does not emerge from the RL loss alone.
Pose input | SR(%) | SPL(%) | SCT(%) |
---|---|---|---|
Visual local. w. R2D2 [63] | 42.9 | 23.2 | 7.8 |
AMCL from 1D Lidar (def) | 100.0 | 63.3 | 25.9 |
8 Discussion and conclusion
In this paper we raised the question whether and what kind of models emerge in a navigation agent trained with model-free RL, in particular when trained on realistic motion. Our findings showed, that some limited information on planning is present in the end-to-end trained memory representation, which can be exploited by a linear probe to predict poses in a short-to-medium time horizon with a reasonable precision. This capacity emerges although planning has not been baked into the agent — the architecture is purely recurrent and updates a hidden memory over time up to the present, without performing any rollouts into the future, as for instance MPC or TD-MPC like models do [29].
It can be argued, that the actor-critic RL algorithm we used for training, PPO[68], did provide some form of long-horizon training signal through the loss of its value estimate (Generalized Advantage Estimation[69]). Our instance based post-hoc analyzes gave some evidence, that the PPO value critic is indeed capable of providing value estimates, which are linked to long-term plans of the agent, as can be seen by sharp discontinuities in the value estimate during changes of strategy and moments of observed “re-planning”. This was confirmed in real world settings, transferred from simulation. The elements of a long-term plan seem to be present in the agent’s memory, as evidenced by the simple form of the PPO-value estimate — a non-linear projection from the hidden state. Ablating the memory further confirmed its importance, leading to metrics drop as the agent can’t retain information on unsuccessful trajectories. We again stress, that a large part of these experiments have been done on a real robot in 262 episodes.
Further experiments probed the same agent memory for the presence of a precise fine-grained motion model, which was confirmed by numerical results up to a reasonable precision. This arguable comes to no surprise, as the control of a fast-moving robot with a low-frequency decision loop of 3 Hz only requires some form of anticipation of the robots future pose.We proposed a new sensitivity analysis that compares the effects of OOD behavior wrt. varying dynamics and odometry inputs in a consistent manner. It showed the reliance of the agent on odometry input, but also the strong positive impact of training the agent on realistic motion. This method also suggests that the agent has learned a Kalman-like procedure of prediction and correction steps.
What we did not see – our experiments failed to provide evidence for correct long-term plans of the agent. We noticed a form of “tunnel vision”, where agent made strategical choices for long-horizon paths, which a humans could have quickly evaluated as fruitless using their high-level geometric understanding and situation awareness — see sup. mat.H for an example. We argue that additional work on large-scale training of geometric foundation models could provide further improvements in navigation and Embodied AI when used as pre-trained visual encoders.
References
- Anderson etal. [2018]Peter Anderson, AngelX. Chang, DevendraSingh Chaplot, Alexey Dosovitskiy,Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, RoozbehMottaghi, Manolis Savva, and AmirRoshan Zamir.On evaluation of embodied navigation agents.arXiv preprint, 2018.
- Andrychowicz etal. [2020]OpenAI:Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, BobMcGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, AlexRay, etal.Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 2020.
- Bauersfeld etal. [2021]Leonard Bauersfeld, Elia Kaufmann, Philipp Foehn, Sihao Sun, and DavideScaramuzza.Neurobem: Hybrid aerodynamic quadrotor model.arXiv preprint, 2021.
- Beeching etal. [2020a]E. Beeching, J. Dibangoye, O. Simonin, and C. Wolf.Learning to reason on uncertain topological maps.In European Conference on Computer Vision (ECCV),2020a.
- Beeching etal. [2020b]Edward Beeching, Jilles Dibangoye, Olivier Simonin, and Christian Wolf.Egomap: Projective mapping and structured egocentric memory for deepRL.In ECML-PKDD, 2020b.
- Bongratz etal. [2024]Fabian Bongratz, Vladimir Golkov, Lukas Mautner, Luca DellaLibera, FrederikHeetmeyer, Felix Czaja, Julian Rodemann, and Daniel Cremers.How to choose a reinforcement-learning algorithm.CoRR, 2024.
- Bono etal. [2024a]Guillaume Bono, Leonid Antsfeld, Boris Chidlovskii, Philippe Weinzaepfel, andChristian Wolf.End-to-End (Instance)-Image Goal Navigation through Correspondenceas an Emergent Phenomenon,.In ICLR, 2024a.
- Bono etal. [2024b]Guillaume Bono, Leonid Antsfeld, Assem Sadek, Gianluca Monaci, and ChristianWolf.Learning with a Mole: Transferable latent spatial representationsfor navigation without reconstruction.In ICLR, 2024b.
- Bono etal. [2024c]Guillaume Bono, Hervé Poirier, Leonid Antsfeld, Gianluca Monaci, BorisChidlovskii, and Christian Wolf.Learning to navigate efficiently and precisely in real environments.In CVPR, 2024c.
- Bresson etal. [2017]Guillaume Bresson, Zayed Alsayed, Li Yu, and Sébastien Glaser.Simultaneous localization and mapping: A survey of current trends inautonomous driving.IEEE Transactions on Intelligent Vehicles, 2017.
- Burgard etal. [1998]Wolfram Burgard, ArminB Cremers, Dieter Fox, Dirk Hähnel, GerhardLakemeyer, Dirk Schulz, Walter Steiner, and Sebastian Thrun.The interactive museum tour-guide robot.In Aaai/iaai, pages 11–18, 1998.
- Chaplot etal. [2020a]DevendraSingh Chaplot, Dhiraj Gandhi, Abhinav Gupta, and Ruslan Salakhutdinov.Object goal navigation using goal-oriented semantic exploration.In NeurIPS, 2020a.
- Chaplot etal. [2020b]DevendraSingh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, and RuslanSalakhutdinov.Learning to explore using active neural slam.In ICLR, 2020b.
- Chaplot etal. [2020c]DevendraSingh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, and Saurabh Gupta.Neural topological slam for visual navigation.In CVPR, 2020c.
- Chebotar etal. [2019]Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles Macklin, Jan Issac,Nathan Ratliff, and Dieter Fox.Closing the sim-to-real loop: Adapting simulation randomization withreal world experience.In International Conference on Robotics and Automation (ICRA),2019.
- Chen etal. [2022]Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and IvanLaptev.Think Global, Act Local: Dual-scale Graph Transformer forVision-and-Language Navigation.In CVPR, 2022.
- Chua etal. [2018]Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine.Deep reinforcement learning in a handful of trials usingprobabilistic dynamics models.NeurIPS, 2018.
- Collaboration [2024]Embodiment Collaboration.Open x-embodiment: Robotic learning datasets and rt-x models.In ICRA, 2024.
- deJesúsRubio and Yu [2007]José de JesúsRubio and Wen Yu.Nonlinear system identification with recurrent neural networks anddead-zone kalman filter algorithm.Neurocomputing, 2007.
- Ding etal. [2019]Yiming Ding, Carlos Florensa, Pieter Abbeel, and Mariano Phielipp.Goal-conditioned imitation learning.In NeurIPS, 2019.
- Du etal. [2021]Heming Du, Xin Yu, and Liang Zheng.Vtnet: Visual transformer network for object goal navigation.In International Conference on Learning Representations, 2021.
- Engel etal. [2014]J. Engel, T. Schöps, and D. Cremers.LSD-SLAM: Large-scale direct monocular SLAM.In ECCV, 2014.
- Eysenbach etal. [2021]Benjamin Eysenbach, Shreyas Chaudhari, Swapnil Asawa, Sergey Levine, and RuslanSalakhutdinov.Off-dynamics reinforcement learning: Training for transfer withdomain classifiers.In ICLR, 2021.
- Fang etal. [2019]Kuan Fang, Alexander Toshev, Li Fei-Fei, and Silvio Savarese.Scene memory transformer for embodied agents in long-horizon tasks.In CVPR, 2019.
- Fox etal. [1997]Dieter Fox, Wolfram Burgard, and Sebastian Thrun.The dynamic window approach to collision avoidance.IEEE Robotics & Automation Magazine, 1997.
- Gervet etal. [2023]Theophile Gervet, Soumith Chintala, Dhruv Batra, Jitendra Malik, andDevendraSingh Chaplot.Navigating to objects in the real world.Science Robotics, 8(79), 2023.
- Ha and Schmidhuber [2018]David Ha and Jürgen Schmidhuber.World Models.In NeurIPS, 2018.
- Hafner etal. [2021]Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba.Mastering atari with discrete world models.In ICLR, 2021.
- Hansen etal. [2024]Nicklas Hansen, Hao Su, and Xiaolong Wang.Td-mpc2: Scalable, robust world models for continuous control.In ICLR, 2024.
- Hausknecht and Stone [2015]Matthew Hausknecht and Peter Stone.Deep recurrent q-learning for partially observable mdps.In AAAI, 2015.
- Henriques and Vedaldi [2018]JoãoF. Henriques and Andrea Vedaldi.Mapnet: An allocentric spatial memory for mapping environments.In CVPR, 2018.
- Höfer etal. [2020]Sebastian Höfer, KostasE. Bekris, Ankur Handa, Juan CamiloGamboaHiguera, Florian Golemo, Melissa Mozifian, ChristopherG. Atkeson, DieterFox, Ken Goldberg, John Leonard, C.Karen Liu, Jan Peters, Shuran Song, PeterWelinder, and Martha White.Perspectives on sim2real transfer for robotics: A summary of therss 2020 workshop.CoRR, 2020.
- Hu etal. [2023]Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, AlexKendall, Jamie Shotton, and Gianluca Corrado.Gaia-1: A generative world model for autonomous driving.In arXiv preprint, 2023.
- Humenberger etal. [2020]Martin Humenberger, Yohann Cabon, Nicolas Guerin, Julien Morat, JérômeRevaud, Philippe Rerole, Noé Pion, Cesar de Souza, Vincent Leroy, andGabriela Csurka.Robust image retrieval-based visual localization using kapture, 2020.
- Jaderberg etal. [2017]Max Jaderberg, Volodymyr Mnih, WojciechMarian Czarnecki, Tom Schaul, JoelZ.Leibo, David Silver, and Koray Kavukcuoglu.Reinforcement learning with unsupervised auxiliary tasks.In ICLR, 2017.
- Janny etal. [2021]Steeven Janny, Vincent Andrieu, Madiha Nadri, and Christian Wolf.Deep kkl: Data-driven output prediction for non-linear systems.In Conference on Decision and Control (CDC), 2021.
- Janny etal. [2024]Steeven Janny, Madiha Nadri, Julie Digne, and Christian Wolf.Space and time continuous physics simulation from partialobservations.In International Conference on Learning Representations(ICLR), 2024.
- Kadian etal. [2020]Abhishek Kadian, Joanne Truong, Aaron Gokaslan, Alexander Clegg, Erik Wijmans,Stefan Lee, Manolis Savva, Sonia Chernova, and Dhruv Batra.Sim2Real Predictivity: Does Evaluation in SimulationPredict Real-World Performance?IEEE Robotics and Automation Letters, 5(4):6670–6677, 2020.
- Kalman [1960]R.E. Kalman.A New Approach to Linear Filtering and PredictionProblems.Journal of Basic Engineering, 82(1):35–45, 1960.
- Khanna etal. [2024]Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, TheophileGervet, Matthew Chang, Zsolt Kira, DevendraSingh Chaplot, Dhruv Batra, andRoozbeh Mottaghi.Goat-bench: A benchmark for multi-modal lifelong navigation.In CVPR, 2024.
- Kolve etal. [2017]Eric Kolve, Roozbeh Mottaghi, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and AliFarhadi.AI2-THOR: An Interactive 3D Environment for Visual AI.CoRR, 2017.
- Konolige [2000]Kurt Konolige.A gradient method for realtime robot control.In IROS, 2000.
- Kumar etal. [2021]Ashish Kumar, Zipeng Fu, Deepak Pathak, and Jitendra Malik.RMA: Rapid motor adaptation for legged robots.In Robotics: Science and Systems XVII, Virtual Event, 2021,2021.
- Labbé and Michaud [2019]Mathieu Labbé and François Michaud.RTAB-Map as an open-source lidar and visual simultaneouslocalization and mapping library for large-scale and long-term onlineoperation.Journal of Field Robotics, 2019.
- Lee etal. [2021a]Donghwan Lee, Soohyun Ryu, Suyong Yeon, Yonghan Lee, Deokhwa Kim, Cheolho Han,Yohann Cabon, Philippe Weinzaepfel, Nicolas Guerin, Gabriela Csurka, andMartin Humenberger.Large-scale localization datasets in crowded indoor spaces.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition (CVPR), pages 3227–3236, 2021a.
- Lee etal. [2021b]Donghwan Lee, Soohyun Ryu, Suyong Yeon, Yonghan Lee, Deokhwa Kim, Cheolho Han,Yohann Cabon, Philippe Weinzaepfel, Nicolas Guérin, Gabriela Csurka, andMartin Humenberger.Large-scale localization datasets in crowded indoor spaces.In CVPR, 2021b.
- Lesort etal. [2018]Timothée Lesort, Natalia Díaz-Rodríguez, Jean-François Goudou, and DavidFilliat.State Representation Learning for Control: An Overview.Neural Networks, 2018.
- Loquercio etal. [2021]Antonio Loquercio, Elia Kaufmann, René Ranftl, Matthias Müller, VladlenKoltun, and Davide Scaramuzza.Learning high-speed flight in the wild.Science Robotics, 2021.
- Macenski etal. [2020]Steve Macenski, Francisco Martín, Ruffin White, and JonatanGinésClavero.The marathon 2: A navigation system.In IROS, 2020.
- Mankowitz etal. [2020]DanielJ Mankowitz, Nir Levine, Rae Jeong, Abbas Abdolmaleki, JostTobiasSpringenberg, Yuanyuan Shi, Jackie Kay, Todd Hester, Timothy Mann, and MartinRiedmiller.Robust reinforcement learning for continuous control with modelmisspecification.In International Conference on Learning Representations(ICLR), 2020.
- Marder-Eppstein etal. [2010]Eitan Marder-Eppstein, Eric Berger, Tully Foote, Brian Gerkey, and KurtKonolige.The office marathon: Robust navigation in an indoor officeenvironment.In ICRA, 2010.
- Marza etal. [2023]Pierre Marza, Laetitia Matignon, Olivier Simonin, and Christian Wolf.Multi-Object Navigation with dynamically learned neural implicitrepresentations.In ICCV, 2023.
- Matsuki etal. [2024]Hidenobu Matsuki, Riku Murai, Paul H.J. Kelly, and AndrewJ. Davison.Gaussian Splatting SLAM.In CVPR, 2024.
- Mirowski etal. [2017]Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andy Ballard, AndreaBanino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu,Dharshan Kumaran, and Raia Hadsell.Learning to navigate in complex environments.In ICLR, 2017.
- Mur-Artal and Tardós [2017]Raúl Mur-Artal and JuanD. Tardós.ORB-SLAM2: an open-source SLAM system for monocular, stereo andRGB-D cameras.IEEE Transactions on Robotics, 2017.
- Muratore etal. [2022]Fabio Muratore, Fabio Ramos, Greg Turk, Wenhao Yu, Michael Gienger, and JanPeters.Robot learning from randomized simulations: A review.Frontiers in Robotics and AI, 2022.
- Nakamoto etal. [2024]Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine.Steering your generalists: Improving robotic foundation models viavalue guidance.In CORL, 2024.
- Parisotto and Salakhutdinov [2018]Emilio Parisotto and Ruslan Salakhutdinov.Neural map: Structured memory for deep reinforcement learning.In ICLR, 2018.
- Peng etal. [2018]XueBin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel.Sim-to-real transfer of robotic control with dynamics randomization.In International Conference on Robotics and Automation (ICRA),2018.
- Pfaff etal. [2020]Tobias Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and Peter Battaglia.Learning mesh-based simulation with graph networks.In International Conference on Learning Representations, 2020.
- Ramakrishnan etal. [2021]SanthoshKumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets,Alexander Clegg, JohnM Turner, Eric Undersander, Wojciech Galuba, AndrewWestbury, AngelX Chang, Manolis Savva, Yili Zhao, and Dhruv Batra.Habitat-matterport 3D dataset (HM3D): 1000 large-scale 3denvironments for embodied AI.In NeurIPS Datasets and Benchmarks Track, 2021.
- Reed etal. [2022]Scott Reed, Konrad Zolna, Emilio Parisotto, SergioGomez Colmenarejo, AlexanderNovikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay,JostTobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards,Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, andNando de Freitas.A Generalist Agent.Transactions on Machine Learning Research, 2022.
- Revaud etal. [2019]Jerome Revaud, CesarDe Souza, Martin Humenberger, and Philippe Weinzaepfel.R2D2: Reliable and Repeatable Detector and Descriptor.In NeurIPS, 2019.
- Rösmann etal. [2015]Christoph Rösmann, Frank Hoffmann, and Torsten Bertram.Timed-elastic-bands for time-optimal point-to-point nonlinear modelpredictive control.In European Control Conference (ECC), 2015.
- Sadek etal. [2022a]Assem Sadek, Guillaume Bono, Boris Chidlovskii, and Christian Wolf.An in-depth experimental study of sensor usage and visual reasoningof robots navigating in real environments.In ICRA, 2022a.
- Sadek etal. [2022b]A. Sadek, G. Bono, B. Chidlovskii, and C. Wolf.An in-depth experimental study of sensor usage and visual reasoningof robots navigating in real environments.In ICRA, 2022b.
- Savva etal. [2019]Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans,Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, DeviParikh, and Dhruv Batra.Habitat: A platform for embodied ai research.In IEEE/CVF International Conference on Computer Vision, 2019.
- Schulman etal. [2017]John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.Proximal policy optimization algorithms.arXiv preprint, 2017.
- Schulman etal. [2018]John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and PieterAbbeel.High-dimensional continuous control using generalized advantageestimation.In ICLR, 2018.
- Sethian [1996]JamesA Sethian.A fast marching level set method for monotonically advancing fronts.Proceedings of the National Academy of Sciences, 1996.
- Shah and Levine [2022]Dhruv Shah and Sergey Levine.ViKiNG: Vision-based kilometer-scale navigation with geographichints.In RSS, 2022.
- Shapley [1951]LloydS Shapley.Notes on the n-person game—ii: The value of an n-person game.1951.
- Stubberud etal. [1991]A Stubberud, H Wabgaonkar, and S Stubberud.A neural-network-based system identification technique.In Conference on Decision and Control, 1991.
- Sucar etal. [2021]Edgar Sucar, Shikun Liu, Joseph Ortiz, and AndrewJ. Davison.imap: Implicit mapping and positioning in real-time.In ICCV, 2021.
- Thrun etal. [2005a]Sebastian Thrun, Wolfram Burgard, and Dieter Fox.Probabilistic Robotics.MIT Press, 2005a.
- Thrun etal. [2005b]Sebastian Thrun, Wolfram Burgard, and Dieter Fox.Probabilistic robotics, 2005b.
- Uppal etal. [2024]Shagun Uppal, Ananye Agarwal, Haoyu Xiong, Kenneth Shaw, and Deepak Pathak.Spin: Simultaneous perception interaction and navigation.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, 2024.
- Valero-Gomez etal. [2013]Alberto Valero-Gomez, JavierV. Gomez, Santiago Garrido, and Luis Moreno.The path to efficiency: Fast marching method for safer, moreefficient mobile robot trajectories.IEEE Robotics & Automation Magazine, 2013.
- VanBaar etal. [2019]Jeroen VanBaar, Alan Sullivan, Radu Cordorel, Devesh Jha, Diego Romeres, andDaniel Nikovski.Sim-to-real transfer learning using robustified controllers inrobotic tasks involving complex dynamics.In 2019 International Conference on Robotics and Automation(ICRA), 2019.
- Yokoyama etal. [2021]Naoki Yokoyama, Sehoon Ha, and Dhruv Batra.Success weighted by completion time: A dynamics-aware evaluationcriteria for embodied navigation.In IROS, 2021.
- Zeng etal. [2024]Kuo-Hao Zeng, Zichen Zhang, Kiana Ehsani, Rose Hendrix, Jordi Salvador, AlvaroHerrasti, Ross Girshick, Aniruddha Kembhavi, and Luca Weihs.Poliformer: Scaling on-policy rl with transformers results inmasterful navigators.CoRL, 2024.
- Zhang etal. [2024]Lunjun Zhang, Yuwen Xiong, Ze Yang, Sergio Casas, Rui Hu, and Raquel Urtasun.Copilot4d: Learning unsupervised world models for autonomous drivingvia discrete diffusion.In ICLR, 2024.
- Zoboli etal. [2023]Samuele Zoboli, Steeven Janny, and Mattia Giaccagli.Deep learning-based output tracking via regulation and contractiontheory.In International Federation of Automatic Control (IFAC), 2023.
\thetitle
Supplementary Material
.
Appendix A Interactive website
We developed an interactive website to support our findings and help better visualize the results of our experiments. In particular, our project page features an interactive second order dynamical model similar to the one implemented in the simulator. Several sliders control the value of physical parameters from the model, and the animated figure displays the impact on the step response, the trajectory and the action space in real-time. We also replayed real episodes from the different methods in Table 3 synchronized on the same scene to better compare them — although these episodes are replayed in the simulator, these were recorded with the agent , poses estimated and then shown in the simulator. Figure 4 is replicated with different metrics and visualization of the distance to belief for each point on the figure. The planning quality map (Figure 10) is also reproduced, with control over the parameters of the density estimation. Figure 12 shows some of the tools available on the website.
Appendix B Calculation of
The distance to belief measures the discrepancy between nominal trajectories within the in-domain environment and out-of-domain trajectories in the corrupted environment, hence modeling the impact of a change in configuration . Formally, let us define a function corresponding to the forward step of the environment parametrized by some physical parameters . The function maps an action and a current state (position + velocity) to the future state . To simplify, models solely the dynamics of the robot, collisions are ignored to compute .
The distance to belief is computed on a fixed set of action sequences and an initial state where is the length of the sequences. The metric is defined as follows:
(4) |
where is the set of corrupted environment parameters. In simple words, the distance to belief is proportional to the area between the in-domain and out-of-domain trajectories, and is measured in meters, so can be interpreted as the corrupted trajectory diverges from the in-domain by in average (see Figure 13). We compute in Figure 4 using sequences on steps (corresponding to ). The action sequences are collected by sampling navigation episodes from the train set of HM3D solved by the model being studied ( for Figure 4 (left), and for Figure 4 (middle)). The distance to belief is actually independent from the policy, which is only used to collect meaningful action sequences corresponding to realistic movements. The calculation of only depends on the physical parameters of the environment.
As a rule of thumb, values above can be considered as highly corrupted environment, and reasonable values (arguably comparable with sim2real distance between the robot dynamics and the simulated model) lie in . Our interactive website shows the relation between the corrupted trajectory and the distance to belief.

Appendix C Details on Prediction vs. Correction
C.1 Dynamical model
The dynamics of the real robot is modeled in the simulator using a second order dynamical model similar to [9]. Let be the linear and angular velocity of the agent, and be the linear and angular actions taken at time . The dynamical model is
(5) |
where and are the response time and damping factor. We apply different values depending on the motion direction (acceleration or braking) and type (linear or angular), resulting in four different constants for each parameter. We also apply saturation on acceleration (absent of this study) and on the velocity. This model is inserted between the policy and the simulator: the agent predicts a velocity command at a Hz rate. This command is sent to the dynamical system which integrates (5) (using a symplectic Euler scheme running at Hz, the command is held constant during integration). The resulting position is then sent to the simulator (Habitat) which directly teleports the agent. We discretize the action space into 28 discrete actions where and . We used the same numerical values as in [9].
Note that modification of the maximum velocity changes not only the saturation value, but also the distribution of discrete action. In other words, the 28 discrete actions are always scaled to fit the range of possible velocities. The dynamical model runs faster than the policy to prevent aliasing effects.
C.2 Corrupted environments
Corrupted environments are generated by multiplying one of the physical parameters by a constant change factor , while leaving the other parameters untouched. Such a change results in a drop of performance on one hand, and an increase of the distance to belief on the other. As mentioned in the main paper, the factor causes a change in environment parameters , which has different interpretations depending on the physical quantity and its impact on the dynamics. We unwrap Figure 4 and exposed the change factor in Figure 14. In particular, we show the relationship between the change factor and the distance to belief, and its impact on SPL. We manually define suitable ranges for each corruption type (damping, max. velocity and response time) up to 0% SR, and evaluate the agents (b) and (c) from Table 3 on using linearly sampled change factors within the range.
Figure 14 shows that a fixed value of change factor can correspond to very different values of distance to belief depending on the corrupted parameter, which motivates the use of a proxy metric such as the distance of belief. We also observe steeper curves for SPL when testing the variant compared to , which confirms that training dynamic-aware agent allows the adaptation to different dynamic unseen during training.
Appendix D Probing future pose: variants
Our goal is to probe the existence of a plan in the latent state of the navigation agent. To do so, we collected a dataset of 500,000 navigation episodes generated from a trained agent and stored the latent state, action and path . We split this dataset in proper train/validation/testing sets (80%, 10%, 10%). During training, a random time instant is sampled in the episode, and the probing network is supervised to predict the future positions from the first latent state . Future latent states are not provided to the probing network, which can not use new observations to improve the predicted path. The probing network is trained to minimize
(6) |
assuming and being the predicted pose. We used the Adam optimizer (learning rate ) with a batch size of 64, a prediction horizon and performed 100,000 gradient updates. We tested different architectures of the probing network:
- Linear
uses a straightforward linear layer per time step to predict the future position from the initial latent state.
(7) - Linear + non-linear(action,goal)
exploits a non-linear embedding of the previous action and the goal direction (given in polar coordinates with respect to the episode start).
(8) - GRU-agent
where we use the latent dynamics of the trained agent itself for prediction. Recall that the hidden state is updated by a GRU, cf. eq. (1), which we reproduce here in a simplified notation without gates and only a single layer,
(9) where is a concatenation of all observation features, is the matrix modeling latent dynamics, projecting observations into the latent space, and an activation function. Yet, since future observations are not available during probing, we replace them by a transformation of the previous latent state performed by an MLP (3 layers, 1024 units, TanH activated) which compensates the absence of observations.
(12)
All variants use a linear projection layer to map the latent space to the predicted position .
Appendix E Zeroing the hidden state of the agent
In Table 4 of the main paper we described results ablating the memory of the agent, ie. setting the hidden GRU state to zero. We argue that zeroing only makes sense if at the same time we reset the episode specific coordinate frame of the agent, which defines the static goal vector and the pose inputs and. This is motivated by the fact, that the PointGoal task requires the agent to understand where it is with respect to the goal, ie. it needs to have an understanding of the (unobserved!) dynamic goal vector , which is defined in its own egocentric frame. It can only do this by using the static goal vector defined w.r.t. the episode start, and localization information, provided by and , also in the same frame. The calculation can be done by a simple rigid transform, but the noise of these inputs will lead to a noisy signal. This noise can be filtered, for which the hidden memory of the agent is likely used. For this reason zeroing memory without resetting the episode centric coordinate can potentially lead to very undesirable effects.
Appendix F Details on the planning heatmap
In order to gain deeper insights into the specific areas where our robot encounters difficulties in navigation, we generated heatmaps that visually highlight challenging locations within our environment. This allows us to focus our efforts on improving the robot’s navigation in these identified areas.Below is the detailed description of the heatmap generation.
Let represents the state of an agent, where are its 2D coordinates, is the linear velocity, and is the angular velocity. Given a goal , we define as the time to goal, which is the travel time to reach the goal. This value is computed by solving the Eikonal equation with fast marching, assuming the agent navigates at full speed and slows down near the walls. For each navigable point on the grid, the velocity is computed as where is the max velocity of the agent, is the distance to nearest wall and a weighting coefficient.
We introduce a cost function at state , taking an action as:
(13) |
where is the next state of the agent taking the action given by the dynamical model , the collision indicator, equal to if the agent collides and otherwise, and represents the braking strength, or how rapidly the agent can decelerate.
Each term in the cost function has a specific purpose:
- •
Position cost: based on the time estimated to reach the goal.
- •
Angle alignment: encourages alignment with the goal direction.
- •
Slowing near goal: slows the agent down as it approaches the goal.
- •
Rotation speed cost: discourages high angular velocities.
- •
Collision penalty: a large penalty applied if the agent collides.
This cost function is designed to balance reaching the goal quickly, maintaining alignment, slowing down near the goal, and avoiding high rotation speeds and collisions. Then we can replay the recorded trajectory, knowing the taken action at every position, we calculate the following metric . To create a smooth, continuous heatmap, we apply a Gaussian kernel at each position , with a mean and standard deviation . This results in a heatmap that provides a clear spatial representation of the robot’s performance across the environment.
Appendix G Evaluation in the training domain
We evaluate the three agents of Table 3 also in a third setting: evaluates them in simulation w/o motion model, ie. with instantaneous velocity changes and constant velocities between time steps, and with their respective action spaces. This evaluates the difficulty of the training task and does not provide indications on performance in a real environment.
Method | Sim(train) | Sim(+dyn) | ||||
---|---|---|---|---|---|---|
SR% | SPL% | SCT% | SR% | SPL% | SCT% | |
(a) D4 | 91.6 | 76.4 | 20.4 | 29.1 | 18.1 | 2.0 |
(b) D28-instant | 98.3 | 82.4 | 66.5 | 27.6 | 11.6 | 5.0 |
(c) D28-dynamics | 97.6 | 82.3 | 52.2 | 97.6 | 82.3 | 52.2 |
Appendix H Evidence of Tunnel vision
We found some evidence of “tunnel vision”, by which we mean that the agent attempts strategies, which a human couldeasily discard even without having access to a map. This is not necessarily a problem for successful completion of the episodes, as the agent detects blockings and searches for alternatives, eventually finding the goal. However, it is not efficient, and translates into lower then optimal SPL measures.
An example is seen in Figure 15. In this episode, starting at position ① and aiming for the goal position at ④, the agent tries to pass through the path indicated by the red trajectory, doing a turn into the area indicated by ③ although it is clearly visible (from position ① already), that there is no path possible between ② and ③. Although the dotted part is occluded, a human would be able to estimate that it is blocked.
Appendix I Details on visual localization
As an alternative to Adaptive Monte-Carlo Localization (AMCL)[76], we experimented with a custom visual localization system, cf. Table 5 in the main paper. Here we provide more details on the setup.
For the pre-mapping part, we follow a procedure similar to the one describe in[45]: a dedicated robot is driven through the environment capturing synchronized 3D LIDARs, RGB cameras and odometry data. Both standard Simultaneous Localization And Mapping (SLAM) using Iterative CLosest Point (ICP) on the LIDAR point-clouds and Structure-from-Motion (SfM) matching local features in RGB frames are used to recover the poses of all RGB frames relative to an absolute, unified, coordinate system.An elastic-search database stores the 12k RGB frames, associated with a global descriptor, and local descriptors of keypoints computed by fast-R2D2[63].
The navigating agent can then query the visual localization system by sending an image captured by its own RGB sensor and its last pose estimate, which locally combines the results of the last visual localization query and odometry. First, as described in[34], the global descriptor for the image is used to quickly retrieve a set of nearest neighbors from the database. Then local R2D2 key-points in the image are matched against the ones of the retrieved neighbors and from this relative poses estimates and the absolute camera poses of the neighbors, a consensus is established for the absolute pose of the camera of the agent, which is sent back to the agent, where it is fused into its own localization optimization graph (with previous loc and odometry).
Appendix J Architecture and training details
Training – all variants are trained on a single Nvidia A100-80Gb GPU for 500 million steps (24 environments in parallel, 192 steps per rollouts, 4 epochs per rollouts split in 2 batches). The agent is trained using PPO with the following reward function: where , is the gain in distance to the goal, to prevent the agent from stalling and to penalize collisions.
Agent architecture – the RGB encoder is a ResNet18 with 64 base planes and 4 layers of 2 basic blocks each, and the scan encoder is a 1d-CNN composed of 3 Conv-ReLU-MaxPool blocks with 64, 128 and 256 channels, kernels of sizes 7, 3 and 3, circular padding of sizes 3, 1 and 1, and pooling windows of widths 3, 3 and 5. The encoder ends with a linear layer to flatten the representation to an embedding size of 512. Odometry, localization, and goal are encoded using MLPs with a single hidden unit of size 1024 and an output size of 64 (ReLU activation). Angles are transformed into cos/sin representation before being fed to the network. The action encoder is a discrete set of 29 embeddings of size 32, and the state encoder is a 2-layer GRU with hidden state of size 1024. The policy is a fully linear layer applied on the hidden state of the last layer of GRU.