Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics

1University College London, 2University of Glasgow, 3University of Bath

Motivation and Research Question

Comparison between 360-degree panoramic and perspective image-text pairs

Comparison between 360-degree panoramic image-text pairs and conventional perspective image-text pairs.

Motivation: 360-degree panoramic image-text pairs exhibit distinct semantic attributes in both textual and visual modalities, compared to perspective image-text pairs. Textually, prompts for 360-degree panoramic images often include explicit 360-degree panoramic format identifiers such as “a 360 degree view of” or “360 photo”, which convey what we define as 360-degree textual semantics. Visually, 360-degree panoramic images capture a complete spherical field of view (360° × 180°), which results in inherent semantic invariance under horizontal circular shifts; the scene content remains identical despite rotation. We term this invariant semantics 360-degree visual semantics.

Research Question: To what extent can standard CLIP models, predominantly trained on perspective image-text pairs, comprehend the distinct semantics inherent in 360-degree panoramic image-text pairs?

Probing Understanding of 360-Degree Textual Semantics

Probing understanding of 360-degree textual semantics

Overview of our framework to evaluate CLIP models’ understanding of 360-degree textual semantics. The format cue V* is a keyword explicitly identifying the 360-degree panoramic image format (e.g., “360 panorama”, “360 photo”), while U* is a generic cue (e.g., “photo”, “image”) that lacks specific 360-degree panoramic format information.

Statistical hypothesis test:
H0: ssu ≤ 0.

Finding: CLIP models can comprehend 360-degree textual semantics.

Probing Understanding of 360-Degree Visual Semantics

Probing understanding of 360-degree visual semantics

Overview of our framework to assess CLIP models’ understanding of 360-degree visual semantics. Image Iδ is obtained by applying a horizontal circular shift of δ pixels to I of size H × W.

Statistical hypothesis test:
H0,j: |ssδj| ≥ β.
Definition of the Stability Bound:
β = Q3 + 1.5 × IQR.

Finding: CLIP models lack a robust understanding of 360-degree visual semantics.

Improving Comprehension of 360-Degree Visual Semantics

Improving understanding of 360-degree visual semantics

Overview of the fine-tuning framework using LoRA.

Loss function:
LFT = λ · Lcharb(sΔθ, sθ) + (1 − λ) · Lcharb(sΔθ, s).

Finding: Fine-tuning that instills shift invariance can improve comprehension of 360-degree visual semantics, while incurring a slight degradation in baseline performance, highlighting the trade-off inherent in adapting CLIP for 360-degree panoramas.

BibTeX

TBD