See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation

Abstract

We present See, Point, Fly (SPF), a training-free aerial vision-and-language navigation (AVLN) framework built atop vision-language models (VLMs). SPF is capable of navigating to any goal based on any type of free-form instructions in any kind of environment. In contrast to existing VLM-based approaches that treat action prediction as a text generation task, our key insight is to consider action prediction for AVLN as a 2D spatial grounding task. SPF harness VLMs to decompose vague language instructions into iterative annotation of 2D waypoints on the input image. Along with the predicted traveling distance, SPF transforms predicted 2D waypoints into 3D displacement vectors as action commands for UAVs. Moreover, SPF also adaptively adjusts the traveling distance to facilitate more efficient navigation. Notably, SPF performs navigation in a closed-loop control manner, enabling UAVs to follow dynamic targets in dynamic environments. SPF sets a new state of the art in DRL simulation benchmark, outperforming the previous best method by an absolute margin of 63%. In extensive real-world evaluations, SPF outperforms strong baselines by a large margin. We also conduct comprehensive ablation studies to highlight the effectiveness of our design choice. Lastly, SPF shows remarkable generalization to different VLMs.

Method Overview

We formulate UAV navigation as an iterative target-reaching process in 3D space. At each timestep $t$, the system processes the current visual observation $I_t \in \mathbb{R}^{H\times W\times3}$ along with a natural language instruction $\ell$ to determine the next motion. Our approach leverages vision-language models (VLMs) to transform navigation instructions into interpretable waypoint decisions that can be efficiently converted into UAV control signals.

The system operates through three key stages: (1) Given $\ell$ and $I_t$, we use a VLM to produce structured spatial understanding, predicting 2D waypoints and step sizes; (2) We transform the predicted 2D waypoint into 3D displacement vectors, yielding executable low-level commands; (3) A lightweight reactive controller continuously updates observations and executes motion commands in closed-loop manner.

Pipeline Overview: A camera frame and user instructions enter a frozen vision-language model, which returns structured JSON with 2D waypoints and obstacle information. An Action-to-Control layer converts this into low-level velocity commands for the UAV.

VLM-based Obstacle-Aware Action Planning

We frame this stage as a structured visual grounding task, where the VLM $G$ processes an egocentric UAV camera observation $I_t$ alongside a natural language instruction $\ell$. The VLM outputs a spatial plan $O_t = \{u, v, d_\text{VLM}\}$ specifying a navigation target in image space, where $(u, v)$ are pixel coordinates and $d_\text{VLM} \in \{1, 2, \ldots, L\}$ is a discretized depth label representing the intended travel distance.

The key insight is that $d_\text{VLM}$ represents the VLM's prediction of appropriate movement magnitude along the UAV's forward direction, rather than a sensored depth measurement. When obstacle-avoidance is activated, the VLM generates waypoints that guide the UAV toward the goal while avoiding detected obstacles, enabling safe navigation through cluttered environments.

Adaptive Travel Distance and 3D Transformation

To address the limitation that VLMs may lack precise understanding of real-world 3D geometry, we employ an adaptive scaling approach. The discrete depth label $d_\text{VLM}$ is converted into an adjusted step size using a non-linear scaling curve:

$$d_\text{adj} = \max \left( d_{\min},\; s \times \left( \frac{d_\text{VLM}}{L} \right)^p \right)$$

This enables the UAV to take larger steps in open areas while executing smaller, more cautious movements near targets and obstacles. The predicted 2D waypoint $(u, v)$ and adjusted depth $d_\text{adj}$ are then unprojected through the camera model to obtain a 3D displacement vector $(S_x, S_y, S_z)$, which is decomposed into control primitives: yaw, pitch, and throttle commands.

By outsourcing high-level spatial reasoning to the VLM and employing a lightweight geometric controller, our method achieves robust zero-shot UAV navigation directly from language—without relying on skill libraries, external depth sensors, policy optimization, or model training.

Method Details: Our approach transforms 2D waypoint predictions into 3D displacement vectors through geometric projection, with adaptive step-size control for efficient navigation.

Quantitative Results

We evaluated SPF in both high-fidelity simulation and real-world environments across six task categories: Navigation, Obstacle Avoidance, Long Horizon, Reasoning, Search, and Follow. Our method significantly outperforms existing approaches, achieving 93.9% success rate in simulation and 92.7% in real-world experiments, compared to TypeFly (0.9%/23.6%) and PIVOT (28.7%/5.5%).

SPF demonstrates superior performance in complex scenarios requiring spatial reasoning and planning, with perfect navigation performance in simulation (100%) and strong capabilities in obstacle avoidance (92% vs. 16% for PIVOT), long-horizon tasks (92% vs. 28% for PIVOT), and search tasks (92% vs. 36% for PIVOT).

Real-world experiments confirmed our method's effectiveness with exceptional performance in reasoning tasks (100% success rate) and robust navigation across diverse environments.

Method	Navigation	Obstacle Avoid	Long Horizon	Reasoning	Search / Follow	Overall
Simulation
TypeFly	1/25	0/25	0/25	0/15	0/25	0.9%
PIVOT	11/25	4/25	7/25	2/15	9/25	28.7%
SPF (Ours)	25/25	23/25	23/25	14/15	23/25	93.9%
Real-world
TypeFly	1/5	3/10	5/10	2/20	2/10	23.6%
PIVOT	0/5	1/10	0/10	2/20	0/10	5.5%
SPF (Ours)	5/5	7/10	9/10	20/20	10/10	92.7%

Success rate (%) comparison across task categories: Our framework significantly outperforms TypeFly and PIVOT baselines in both high-fidelity simulation and real-world DJI Tello experiments. Note that Search tasks were exclusively evaluated in simulation, while Follow tasks were only tested in real-world settings due to environment constraints

Qualitative Comparison

Simulator Results

Simulator Obstacle Avoidance — Obstacle Avoidance

Simulator Pattern Search — Pattern Search

Simulator Target Identification — Target Identification

■ Our Method | ■ PIVOT | ■ TypeFly

Qualitative comparison of flight trajectories in the simulator: The absence of a colored path indicates the baseline failed to issue any fly command. Full prompts and videos are included below.

Real-World Results

Obstacle Avoidance Task

Reasoning Task

Following Task

Long Horizon Task

■ Take Off Trajectory | ■ Task Trajectory

Qualitative comparison of flight trajectories in the real-world: SPF demonstrates robust performance across diverse real-world scenarios, successfully completing complex tasks that require spatial reasoning, obstacle avoidance, and long-term planning. Full prompts and videos are included below.

Prompts and Videos

Ablation Studies

We conducted comprehensive ablation studies to evaluate the effectiveness of each component in our framework. Our analysis covers different VLM backends, prompting strategies, and the impact of adaptive step-size control.

VLM Backend and Prompting Strategy

Method	Action Prediction	VLM Model	Success Rate (%)
Plain VLM	Text Generation	Gemini 2.0 Flash	7
PIVOT	Visual Prompting	Gemini 2.0 Flash	40
SPF (Ours)	2D Waypoint Labeling	Gemini 2.0 Flash-Lite	87
		Gemini 2.0 Flash	100
		Gemini 2.5 Pro	100
		GPT-4.1	100
		Claude 3.7 Sonnet	93.3
		Llama 4 Maverick	93.3

Structured Prompting and Grounding: We compared three VLM-based action prediction approaches: our method (prompting VLM to label 2D waypoints on images), plain VLM (predicting actions as text) and PIVOT (selecting from candidate 2D points on images). Our approach significantly outperforms alternatives with a success rate of 100% versus just 7% for plain VLM and 40% for PIVOT on navigation tasks, demonstrating the effectiveness of our structured visual grounding formulation.

VLM Generalization: Our method performs robustly across multiple VLMs: Gemini 2.5 Pro, Gemini 2.0 Flash, and GPT-4.1 all achieved 100% success rate; Claude 3.7 Sonnet and Llama 4 Maverick reached 93.3%; and even Gemini 2.0 Flash-Lite achieved 87%. This demonstrates our framework's effective generalization across vision-language models of varying capabilities.

Adaptive Step-Size Controller

Task	Step	Completion Time	Success Rate
"Fly to the cones and the next."	Fixed	61s	5/5
"Fly to the cones and the next."	Adaptive	28s	5/5
"I'm thirsty. Find something that can help me."	Fixed	50.25s	4/5
"I'm thirsty. Find something that can help me."	Adaptive	35.20s	5/5
"It's raining. Head to the comfiest chair that will keep you dry."	Fixed	47s	5/5
	Adaptive	30s	5/5

Completion Time Comparison — Completion Time Analysis

Adaptive Travel Distance Scaling: Our method significantly speeds up the travel time using the proposed adaptive distance scaling. It maintains navigation performance, while reducing the average completion time from 50.25 to 35.20 seconds. The results are presented in the table above. We refer to the supplementary material for more details of the experimental setup.

BibTeX

@inproceedings{hu2025spf,
	title        = {See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation},
	author       = {Chih Yao Hu and Yang-Sen Lin and Yuna Lee and Chih-Hai Su and Jie-Ying Lee and Shr-Ruei Tsai and Chin-Yang Lin and Kuan-Wen Chen and Tsung-Wei Ke and Yu-Lun Liu},
	year         = 2025,
	booktitle    = {9th Annual Conference on Robot Learning},
	url          = {https://openreview.net/forum?id=AE299O0tph}
}