See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation

CoRL 2025
National Yang Ming Chiao Tung University
National Taiwan University
*Indicates Equal Contribution
See, Point, Fly: Teaser Image

Zero-shot language-guided UAV control. SPF enables UAVs to navigate to any goal based on free-form natural language instructions in any environment, without task-specific training. The system demonstrates robust performance across diverse scenarios including obstacle avoidance, long-horizon planning, and dynamic target following.

Abstract

We present See, Point, Fly (SPF), a training-free aerial vision-and-language navigation (AVLN) framework built atop vision-language models (VLMs). SPF is capable of navigating to any goal based on any type of free-form instructions in any kind of environment. In contrast to existing VLM-based approaches that treat action prediction as a text generation task, our key insight is to consider action prediction for AVLN as a 2D spatial grounding task. SPF harness VLMs to decompose vague language instructions into iterative annotation of 2D waypoints on the input image. Along with the predicted traveling distance, SPF transforms predicted 2D waypoints into 3D displacement vectors as action commands for UAVs. Moreover, SPF also adaptively adjusts the traveling distance to facilitate more efficient navigation. Notably, SPF performs navigation in a closed-loop control manner, enabling UAVs to follow dynamic targets in dynamic environments. SPF sets a new state of the art in DRL simulation benchmark, outperforming the previous best method by an absolute margin of 63%. In extensive real-world evaluations, SPF outperforms strong baselines by a large margin. We also conduct comprehensive ablation studies to highlight the effectiveness of our design choice. Lastly, SPF shows remarkable generalization to different VLMs.

Method Overview

We formulate UAV navigation as an iterative target-reaching process in 3D space. At each timestep $t$, the system processes the current visual observation $I_t \in \mathbb{R}^{H\times W\times3}$ along with a natural language instruction $\ell$ to determine the next motion. Our approach leverages vision-language models (VLMs) to transform navigation instructions into interpretable waypoint decisions that can be efficiently converted into UAV control signals.

The system operates through three key stages: (1) Given $\ell$ and $I_t$, we use a VLM to produce structured spatial understanding, predicting 2D waypoints and step sizes; (2) We transform the predicted 2D waypoint into 3D displacement vectors, yielding executable low-level commands; (3) A lightweight reactive controller continuously updates observations and executes motion commands in closed-loop manner.

SPF Pipeline Overview

Pipeline Overview: A camera frame and user instructions enter a frozen vision-language model, which returns structured JSON with 2D waypoints and obstacle information. An Action-to-Control layer converts this into low-level velocity commands for the UAV.

VLM-based Obstacle-Aware Action Planning

We frame this stage as a structured visual grounding task, where the VLM $G$ processes an egocentric UAV camera observation $I_t$ alongside a natural language instruction $\ell$. The VLM outputs a spatial plan $O_t = \{u, v, d_\text{VLM}\}$ specifying a navigation target in image space, where $(u, v)$ are pixel coordinates and $d_\text{VLM} \in \{1, 2, \ldots, L\}$ is a discretized depth label representing the intended travel distance.

The key insight is that $d_\text{VLM}$ represents the VLM's prediction of appropriate movement magnitude along the UAV's forward direction, rather than a sensored depth measurement. When obstacle-avoidance is activated, the VLM generates waypoints that guide the UAV toward the goal while avoiding detected obstacles, enabling safe navigation through cluttered environments.

Adaptive Travel Distance and 3D Transformation

To address the limitation that VLMs may lack precise understanding of real-world 3D geometry, we employ an adaptive scaling approach. The discrete depth label $d_\text{VLM}$ is converted into an adjusted step size using a non-linear scaling curve:

$$d_\text{adj} = \max \left( d_{\min},\; s \times \left( \frac{d_\text{VLM}}{L} \right)^p \right)$$

This enables the UAV to take larger steps in open areas while executing smaller, more cautious movements near targets and obstacles. The predicted 2D waypoint $(u, v)$ and adjusted depth $d_\text{adj}$ are then unprojected through the camera model to obtain a 3D displacement vector $(S_x, S_y, S_z)$, which is decomposed into control primitives: yaw, pitch, and throttle commands.

By outsourcing high-level spatial reasoning to the VLM and employing a lightweight geometric controller, our method achieves robust zero-shot UAV navigation directly from language—without relying on skill libraries, external depth sensors, policy optimization, or model training.

SPF Method Details

Method Details: Our approach transforms 2D waypoint predictions into 3D displacement vectors through geometric projection, with adaptive step-size control for efficient navigation.

Quantitative Results

We evaluated SPF in both high-fidelity simulation and real-world environments across six task categories: Navigation, Obstacle Avoidance, Long Horizon, Reasoning, Search, and Follow. Our method significantly outperforms existing approaches, achieving 93.9% success rate in simulation and 92.7% in real-world experiments, compared to TypeFly (0.9%/23.6%) and PIVOT (28.7%/5.5%).

SPF demonstrates superior performance in complex scenarios requiring spatial reasoning and planning, with perfect navigation performance in simulation (100%) and strong capabilities in obstacle avoidance (92% vs. 16% for PIVOT), long-horizon tasks (92% vs. 28% for PIVOT), and search tasks (92% vs. 36% for PIVOT).

Real-world experiments confirmed our method's effectiveness with exceptional performance in reasoning tasks (100% success rate) and robust navigation across diverse environments.

Method Navigation Obstacle Avoid Long Horizon Reasoning Search / Follow Overall
Simulation
TypeFly 1/25 0/25 0/25 0/15 0/25 0.9%
PIVOT 11/25 4/25 7/25 2/15 9/25 28.7%
SPF (Ours) 25/25 23/25 23/25 14/15 23/25 93.9%
Real-world
TypeFly 1/5 3/10 5/10 2/20 2/10 23.6%
PIVOT 0/5 1/10 0/10 2/20 0/10 5.5%
SPF (Ours) 5/5 7/10 9/10 20/20 10/10 92.7%

Success rate (%) comparison across task categories: Our framework significantly outperforms TypeFly and PIVOT baselines in both high-fidelity simulation and real-world DJI Tello experiments. Note that Search tasks were exclusively evaluated in simulation, while Follow tasks were only tested in real-world settings due to environment constraints

Qualitative Comparison

Simulator Results

Simulator Obstacle Avoidance
Obstacle Avoidance
Simulator Pattern Search
Pattern Search
Simulator Target Identification
Target Identification

■ Our Method | ■ PIVOT | ■ TypeFly

Qualitative comparison of flight trajectories in the simulator: The absence of a colored path indicates the baseline failed to issue any fly command. Full prompts and videos are included below.


Real-World Results

Obstacle Avoidance Task

Our Method - Obstacle Avoidance PIVOT - Obstacle Avoidance
Our Method - Obstacle Avoidance TypeFly - Obstacle Avoidance

Reasoning Task

Our Method - Reasoning Task PIVOT - Reasoning Task
Our Method - Reasoning Task TypeFly - Reasoning Task

Following Task

Our Method - Following Task PIVOT - Following Task
Our Method - Following Task TypeFly - Following Task

Long Horizon Task

Our Method - Long Horizon Task PIVOT - Long Horizon Task
Our Method - Long Horizon Task TypeFly - Long Horizon Task

■ Take Off Trajectory | ■ Task Trajectory

Qualitative comparison of flight trajectories in the real-world: SPF demonstrates robust performance across diverse real-world scenarios, successfully completing complex tasks that require spatial reasoning, obstacle avoidance, and long-term planning. Full prompts and videos are included below.


Prompts and Videos

Ablation Studies

We conducted comprehensive ablation studies to evaluate the effectiveness of each component in our framework. Our analysis covers different VLM backends, prompting strategies, and the impact of adaptive step-size control.

VLM Backend and Prompting Strategy

Method Action Prediction VLM Model Success Rate (%)
Plain VLM Text Generation Gemini 2.0 Flash 7
PIVOT Visual Prompting Gemini 2.0 Flash 40
SPF (Ours) 2D Waypoint Labeling Gemini 2.0 Flash-Lite 87
Gemini 2.0 Flash 100
Gemini 2.5 Pro 100
GPT-4.1 100
Claude 3.7 Sonnet 93.3
Llama 4 Maverick 93.3

Structured Prompting and Grounding: We compared three VLM-based action prediction approaches: our method (prompting VLM to label 2D waypoints on images), plain VLM (predicting actions as text) and PIVOT (selecting from candidate 2D points on images). Our approach significantly outperforms alternatives with a success rate of 100% versus just 7% for plain VLM and 40% for PIVOT on navigation tasks, demonstrating the effectiveness of our structured visual grounding formulation.

VLM Generalization: Our method performs robustly across multiple VLMs: Gemini 2.5 Pro, Gemini 2.0 Flash, and GPT-4.1 all achieved 100% success rate; Claude 3.7 Sonnet and Llama 4 Maverick reached 93.3%; and even Gemini 2.0 Flash-Lite achieved 87%. This demonstrates our framework's effective generalization across vision-language models of varying capabilities.

Adaptive Step-Size Controller

Task Step Completion Time Success Rate
"Fly to the cones and the next." Fixed 61s 5/5
Adaptive 28s 5/5
"I'm thirsty. Find something that can help me." Fixed 50.25s 4/5
Adaptive 35.20s 5/5
"It's raining. Head to the comfiest chair that will keep you dry." Fixed 47s 5/5
Adaptive 30s 5/5
Completion Time Comparison
Completion Time Analysis

Adaptive Travel Distance Scaling: Our method significantly speeds up the travel time using the proposed adaptive distance scaling. It maintains navigation performance, while reducing the average completion time from 50.25 to 35.20 seconds. The results are presented in the table above. We refer to the supplementary material for more details of the experimental setup.

BibTeX

@inproceedings{hu2025spf,
	title        = {See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation},
	author       = {Chih Yao Hu and Yang-Sen Lin and Yuna Lee and Chih-Hai Su and Jie-Ying Lee and Shr-Ruei Tsai and Chin-Yang Lin and Kuan-Wen Chen and Tsung-Wei Ke and Yu-Lun Liu},
	year         = 2025,
	booktitle    = {9th Annual Conference on Robot Learning},
	url          = {https://openreview.net/forum?id=AE299O0tph}
}