Currently on the Job Market

I am currently on the job market and seeking research positions in academia and industry, as well as roles related to large language models. Please feel free to contact me for further information.

I am a 4th year Ph.D. candidate in Fudan University, supervided by Prof. Wenqiang Zhang. I achieved my B.E. degress in Information Security from Fudan University in 2022.

I have published about 30 papers at top-tier venues, including 10 papers as first author / co-first author, wich 1000 + citations in total.

My research interest includes:

  • Agent
  • Multimodal Large Language Model
  • Computer vision
  • Video understanding

πŸ”₯ News

  • 2026.05: Β πŸŽ‰πŸŽ‰ Two paper is accepted by ICML 2026.

  • 2026.01: Β πŸŽ‰πŸŽ‰ One paper is accepted by ICLR 2026.

  • 2025.11: Β πŸŽ‰πŸŽ‰ One paper is accepted by AAAI 2026.

  • 2025.09: Β πŸŽ‰πŸŽ‰ One paper is accepted by NeurIPS 2025.

  • 2025.09: Β πŸŽ‰πŸŽ‰ LVOS V2 is accepted by T-PAMI 2025.

  • 2025.07: Β πŸŽ‰πŸŽ‰ I’m organizing the 7th Large-Scale Video Object Segmentation (LSVOS) Challenge! Welcome to attend!

  • 2025.06: Β πŸŽ‰πŸŽ‰ One paper is accepted by ICCV 2025. Congratulations to all co-authors!

  • 2025.01: Β πŸŽ‰πŸŽ‰ One paper is accepted by ICLR 2025.

  • 2024.09: Β πŸŽ‰πŸŽ‰ One paper is accepted by NeurIPS 2024.

  • 2024.08: Β πŸŽ‰πŸŽ‰ Two papers are accepted by ACM MM 2024.

  • 2024.07: Β πŸŽ‰πŸŽ‰ Two papers are accepted by ECCV 2024.

  • 2024.07: Β πŸŽ‰πŸŽ‰ I’m organizing the 6th Large-Scale Video Object Segmentation (LSVOS) Challenge! Welcome to attend!

  • 2024.04: Β πŸŽ‰πŸŽ‰ LVOS V2 has been released! Welcome for following!

  • 2024.03: Β πŸŽ‰πŸŽ‰ One paper is accepted by CVPR 2024 and is presented as Highlight! Congratulations to all co-authors!

  • 2023.09: Β πŸŽ‰πŸŽ‰ One paper is accepted by NeurIPS 2023.

  • 2023.08: Β πŸŽ‰πŸŽ‰ Three papers are accepted by ACM MM 2023. Congratulations to all co-authors!

  • 2023.07: Β πŸŽ‰πŸŽ‰ LVOS has been accepted by ICCV 2023.

  • 2022.11: Β πŸŽ‰πŸŽ‰ LVOS (the first long-term video object segmentation benchmark) has been public!

πŸ“ Publications

πŸ€– Multimodal Large Language Model

Arxix 2026
sym

REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents

Zheng Chu*, Xiao Wang*, Jack Hong*, Huiming Fan, Yuqi Huang, Yue Yang, Guohai Xu, Shengchao Hu, Dongdong Kuang, Chenxiao Zhao, Cheng Xiang, Ming Liu, Bing Qin, Xing Yu

[Paper] [Arxiv] [Homepage] [Github]

  • The first to match the performance of Gemini 3 Pro on complex multimodal search tasks.
  • Scalable task synthesis via graph-structured reasoning with topological complexity control.
  • Cost-efficient training via mid-training of core search-agent subskills.
  • SOTA performance across both text-only and multimodal benchmarks.
ICLR 2026
sym

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong*, Chenxiao Zhao*, ChengLin Zhu*, Weiheng Lu, Guohai Xu, Xing Yu

[Paper] [Arxiv] [Homepage] [Github]

  • The first to unify code execution and web search within a single reasoning loop like o3.
  • We construct a carefully curated training corpus through rigorous data filtering and cleaning.
  • DeepEyesV2 has strong reasoning and tool-usage ability.
  • We analyze the dynamics of tool-use behavior in DeepEyesV2, revealing task-adaptive patterns.
  • we find reinforcement learning can enable more complex tool combinations and adaptive, context-aware tool invocation.
ICLR 2026
sym

DeepEyes: Incentivizing β€œThinking with Images” via Reinforcement Learning

Ziwei Zheng*, Michael Yang*, Jack Hong*, Chenxiao Zhao*, Guohai Xu, Le Yang, Chao Shen, Xing Yu

[Paper] [Arxiv] [Homepage] [Github]

  • The first to β€œthinking with image” like o3.
  • We incentivize the ability to β€œthinking with images” via end-to-end reinforcement learning, without requiring a cold start.
  • We reveal the intriguing RL training dynamic, where active perception behavior under-goes distinct stages, evolving from initial exploration to efficient and accurate exploitation.
  • We observe diverse reasoning patterns, such as visual search, comparison, and confirmation.
ICLR 2026
sym

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie

[Paper] [Arxiv] [Homepage] [Github]

  • The first benchmark tailored for evaluating MLLMs’ ability on omni-modal video understanding.
  • WorldSense features integrated audio-visual inputs, diverse content, and high-quality question-answering annotations.
  • We expose a significant gap in real-world omni-modal reasoning.
  • We identify the key factors influencing omni-modal understanding.

🧭 Visual Object Tracking and Segmentation

ICCV 2025
sym

General Compression Framework for Efficient Transformer Object Tracking

Lingyi Hong, Jinglun Li, Xinyu Zhou, Shilin Yan, Pinxue Guo, Kaixun Jiang, Zhaoyu Chen, Shuyong Gao, Runze Li, Xingdong Sheng, Wei Zhang, Hong Lu, Wenqiang Zhang

[Paper] [Github]

  • General compression framework for efficient SOT.
  • Support any teacher and student structure, any input resolution, and any layer numbers.
  • Balance between efficiency and effectiveness (2.17 x speed up with 96% accuracy).
CVPR 2024 Highlight
sym

(Highlight) OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning

Lingyi Hong, Shilin Yan, Renrui Zhang, Wanyun Li, Xinyu Zhou, Pinxue Guo, Kaixun Jiang, Yiting Chen, Jinglun Li, Zhaoyu Chen, Wenqiang Zhang

[Paper]

  • The first one to unify RGB and RGB+X tracking in a general framework.
  • Introduce the foundation model and parameter-efficient tuning manner into object tracking and break traditional full finetuning stragety.
  • SOTA performance on 6 tracking task 11 benchmarks.
T-PAMI 2025 & ICCV 2023
sym

LVOS: A Benchmark for Large-scale Long-term Video Object Segmentation

Lingyi Hong, Zhongying Liu, Wenchao Chen, Chenzhi Tan, Yuang Feng, Xinyu Zhou, Pinxue Guo, Jinglun Li, Zhaoyu Chen, Shuyong Gao, Wei Zhang, Wenqiang Zhang

LVOS: A Benchmark for Long-term Video Object Segmentation

Lingyi Hong, Wenchao Chen, Zhongying Liu, Wei Zhang, Pinxue Guo, Zhaoyu Chen, Wenqiang Zhang

[Paper V2] [Paper V1] [Home Page] [Github]

  • The first long-term video object segmentation benchmark.

Others

πŸ“… Organizations

πŸ“– Educations

  • 2022.09 - Now, Ph. D. candidate, School of Computer Science, Fudan University, Shanghai China.
  • 2018.09 - 2022.06, Undergraduate, School of Computer Science, Fudan University, Shanghai China.

πŸ†š Contests

  • 2025.03: 1st Place, 2-st Cross-Domain Few-Shot Object Detection @ CVPR 2026.

  • 2024.06: 2nd Place, Roboflow-20VL Few-Shot Object Detection Challenge @ CVPR2025.

  • 2024.05: 4th Place, 1-st Cross-Domain Few-Shot Object Detection @ CVPR 2025.

  • 2024.08: 2nd Place, Global Multimedia Deepfake Detection Challenge @ Inclusion 2024.

  • 2022.06: 2nd Place, The 4th Large-scale Video Object Segmentation Challenge. CVPRW 2022.

πŸ—’ Services

  • Reviewer for TPAMI, TIP, TCSVT, ICML 2025 - 2026, NeurIPS 2024 - 2025, ICLR 2025 - 2026, CVPR 2024 - 2026, ICCV 2023 - 2025, ECCV 2024 - 2026, ACM MM 2023 - 2026.