2025-05-23 |
REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders |
Savya Khosla et.al. |
2505.18153v1 |
null |
2025-05-23 |
WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions |
Zizhang Li et.al. |
2505.18151v1 |
null |
2025-05-23 |
TokBench: Evaluating Your Visual Tokenizer before Visual Generation |
Junfeng Wu et.al. |
2505.18142v1 |
null |
2025-05-23 |
Boosting Open Set Recognition Performance through Modulated Representation Learning |
Amit Kumar Kundu et.al. |
2505.18137v1 |
null |
2025-05-23 |
VideoGameBench: Can Vision-Language Models complete popular video games? |
Alex L. Zhang et.al. |
2505.18134v1 |
null |
2025-05-23 |
BiggerGait: Unlocking Gait Recognition with Layer-wise Representations from Large Vision Models |
Dingqing Ye et.al. |
2505.18132v1 |
null |
2025-05-23 |
One RL to See Them All: Visual Triple Unified Reinforcement Learning |
Yan Ma et.al. |
2505.18129v1 |
null |
2025-05-23 |
Multi-Modal Spectral Parametrization Method (MMSPM) for analyzing EEG activity with distinct scaling regimes |
Frigyes Samuel Racz et.al. |
2505.18117v1 |
null |
2025-05-23 |
Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion |
Jacob Hansen et.al. |
2505.18115v1 |
null |
2025-05-23 |
From Temporal to Spatial: Designing Spatialized Interactions with Segmented-audios in Immersive Environments for Active Engagement with Performing Arts Intangible Cultural Heritage |
Yuqi Wang et.al. |
2505.18112v1 |
null |
2025-05-23 |
Adapting SAM 2 for Visual Object Tracking: 1st Place Solution for MMVPR Challenge Multi-Modal Tracking |
Cheng-Yen Yang et.al. |
2505.18111v1 |
null |
2025-05-23 |
Accelerating Learned Image Compression Through Modeling Neural Training Dynamics |
Yichi Zhang et.al. |
2505.18107v1 |
null |
2025-05-23 |
F-ANcGAN: An Attention-Enhanced Cycle Consistent Generative Adversarial Architecture for Synthetic Image Generation of Nanoparticles |
Varun Ajith et.al. |
2505.18106v1 |
null |
2025-05-23 |
Towards more transferable adversarial attack in black-box manner |
Chun Tong Lei et.al. |
2505.18097v1 |
null |
2025-05-23 |
DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations |
Ziqiao Peng et.al. |
2505.18096v1 |
null |
2025-05-23 |
CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays |
Hyungyung Lee et.al. |
2505.18087v1 |
null |
2025-05-23 |
The Noether formalism for constructing conserved quantities in teleparallel equivalents of general relativity |
E. D. Emtsova et.al. |
2505.18084v1 |
null |
2025-05-23 |
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding |
Xiaoyi Zhang et.al. |
2505.18079v1 |
null |
2025-05-23 |
DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation |
Junhao Chen et.al. |
2505.18078v1 |
null |
2025-05-23 |
Semantic Correspondence: Unified Benchmarking and a Strong Baseline |
Kaiyan Zhang et.al. |
2505.18060v1 |
link |
2025-05-23 |
A Foundation Model Framework for Multi-View MRI Classification of Extramural Vascular Invasion and Mesorectal Fascia Invasion in Rectal Cancer |
Yumeng Zhang et.al. |
2505.18058v1 |
null |
2025-05-23 |
FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation |
Zherui Zhang et.al. |
2505.18053v1 |
null |
2025-05-23 |
BOTM: Echocardiography Segmentation via Bi-directional Optimal Token Matching |
Zhihua Liu et.al. |
2505.18052v1 |
null |
2025-05-23 |
LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision |
Anthony Fuller et.al. |
2505.18051v1 |
null |
2025-05-23 |
SpikeGen: Generative Framework for Visual Spike Stream Processing |
Gaole Dai et.al. |
2505.18049v1 |
null |
2025-05-23 |
SHARDeg: A Benchmark for Skeletal Human Action Recognition in Degraded Scenarios |
Simon Malzard et.al. |
2505.18048v1 |
null |
2025-05-23 |
RestoreVAR: Visual Autoregressive Generation for All-in-One Image Restoration |
Sudarshan Rajagopalan et.al. |
2505.18047v1 |
null |
2025-05-23 |
Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation |
Li Zhong et.al. |
2505.18039v1 |
null |
2025-05-23 |
CAMME: Adaptive Deepfake Image Detection with Multi-Modal Cross-Attention |
Naseem Khan et.al. |
2505.18035v1 |
null |
2025-05-23 |
Mahalanobis++: Improving OOD Detection via Feature Normalization |
Maximilian Mueller et.al. |
2505.18032v1 |
null |