CVM 2025 Conference Program

Day 1 Saturday, April 19, 2025
09:00 - 09:20 Opening Session (with HKUST AI Film Festival)
09:20 - 10:20 Keynote Speech I - Prof. Maneesh Agrawala (with HKUST AI Film Festival)
10:30 - 10:50 Tea Break
10:50 - 12:10 Paper Session 1: Geometric and Texture Reconstruction (5 papers, 15 minutes/paper)
Yu Chen, Hongwei Lin
Human Perception Faithful Curve Reconstruction Based on Persistent Homology and Principal Curve
Haichuan Song, Xinyi Chen
FEDNet: A Feature-Enhanced Diffusion Network for Efficient and Universal Texture Synthesis
Hongxiang Huang, Guoyuan An, Jingzhen Lan, Lingfei Wang, Yuchi Huo, Rui Wang
Ultra-High Resolution Facial Texture Reconstruction from a Single Image
Alakh Aggarwal, Ningna Wang, Xiaohu Guo
TexHOI: Reconstructing Textures of 3D Unknown Objects in Monocular Hand-Object Interaction Scenes
Ziqiang Dang, Wenqi Dong, Zesong Yang, Bangbang Yang, Liang Li, Yuewen Ma, Zhaopeng Cui
TexPro: Text-guided PBR Texturing with Procedural Material Modeling
12:10 - 13:30 Lunch
13:30 - 14:50 Paper Session 2: Rendering (5 papers, 15 minutes/paper)
Shi Mao, Chenming Wu, Zhelun Shen, Yifan Wang, Dayan Wu, Liangjun Zhang
NeuS-PIR: Learning Relightable Neural Surface using Pre-Integrated Rendering
Xiaowei Song, Ju Zheng, Shiran Yuan, Huan-ang Gao, Jingwei Zhao, Xiang He, Weihao Gu, Zhicheng Wang, Hao Zhao
SA-GS: Scale-Adaptive Gaussian Splatting for Training-Free Anti-Aliasing
Dongyu Chen, Haoxiang Chen, Qunce Xu, Tai-Jiang Mu
RS-SpecSDF: Reflection-Supervised Surface Reconstruction and Material Estimation for Specular Indoor Scenes
Qi-Yuan Feng, Hao-Xiang Chen, Qun-Ce Xu, Tai-Jiang Mu
SLS4D: Sparse Latent Space for 4D Novel View Synthesis (invited TVCG paper presentation)
Chenhui Wang
DC-APIC: A Decomposed Compatible Affine Particle in Cell Transfer Scheme for Non-sticky Solid-Fluid Interactions in MPM
14:50 - 15:40 Paper Session 3: 3D Generation (3 papers, 15 minutes/paper)
Rengan Xie, Wenting Zheng, Kai Huang, Yizheng Chen, Qi Wang, Qi Ye, Wei Chen, Yuchi Huo
LDM: Large Tensorial SDF Model for Textured Mesh Generation
Chen Wang, Guangshun Wei, James Kit Hon Tsoi, Zhiming Cui, Shuyi Lu, Zhenpeng Liu, Yuanfeng Zhou
Diff-OSGN: Diffusion-based Occlusal Surface Generation Network with Geometric Constraints
Zhicong Tang, Shuyang Gu, Chunyu Wang, Ting Zhang, Jianmin Bao, Dong Chen, Baining Guo
VolumeDiffusion: Feed-forward Text-to-3D Generation with Efficient Volumetric Encoder
15:40 -16:00 Tea Break
16:00 - 17:30 Poster Session 1: Detection, Segmentation and Medical Imaging
Zhiwei Dong, Genji Yuan, Jinjiang Li
AGTCNet: Hybrid Network Based on AGT and Curvature Information for Skin Lesion Detection
Xin Chi, Yu Sun, Yingjun Zhao, Donghua Lu, Jun Yang, Yiting Zhang
A Comprehensive Framework for Fine-Grained Object Recognition in Remote Sensing
Jiayong Zhu, Tao Zhang
SEA-Net: A Severity-Aware Network with Visual Prompt Tuning for Underwater Semantic Segmentation
Tingwei Wen, Yao Lu, Xiaosheng Chen, Xinhai Lu, Guangming Lu
Among General Spine Segmentation with Multi-scale and Discriminate Feature Fusion
Yiquan Wu, Zhongtian Wang, You Wu, Ling Huang, Hui Zhou, Shuiwang Li
Towards Reflected Object Detection: A Benchmark
Yujie Liu, Zhonghao Du, Xuanting Li, Zongmin Li, Jiayue Fan, Chaozhi Yang
SSCL: A Spatial-Spectral and Commonality Learning Network for Semi-Supervised Medical Image Segmentation
Lisha Cui, Helong Jiao, Tengyue Liu, Chunyan Niu, Ming Ma, Xiaoheng Jiang, Mingliang Xu
LAGNet: A Location-Aware Guidance Network for Weak and Strip Defect Detection
Di Zhou, Jiahui Li, Haiying Wang, Matthew Burns, Meng Liu
Consensus-aware Balance Learning for Sexually Suggestive Video Classification
Hanli Zhao, Binhao Wang, Wanglong Lu, Juncong Lin
Degradation-Aware Frequency-Separated Transformer for Blind Super-Resolution
Fuxian Sui, Hua Wang, Fan Zhang
A Multiscale Edge-Guided Polynomial Approximation Network for Medical Image Segmentation
Shiwei He, Yingjuan Jia, Hanpu Wang, Xinyu Liu, Jianmeng Zhou, Huijie Gu, Mengyan Li, Tong Chen
Weighted Spatiotemporal Feature and Multi-task Learning for Masked Facial Expression Recognition
Junsheng Chang, Qin Shi, Yijun Zhang, Zongtang Hu, Xulun Ye
MAAU-UIE : Multiple Attention Aggregation U-Net for Underwater Image Enhancement
Jinglong Tian, Tianze Zhao, Zhijun Fan, Linlin Shen, Jieyao Wei, Qiumei Pu
MANet-CycleGAN: An Unsupervised LDCT Image Denoising Method Based on Channel Attention and Multi-Scale Features
Zhou Yang, Hua Wang, Fan Zhang
HIFNet: Medical Image Segmentation Network Utilizing Hierarchical Attention Feature Fusion
Ke Xu, Min Li, Guangjian Liu, Chen Chen, Cheng Chen, Enguang Zuo, Xiaoyi Lv
MBGNet: Mamba-Based Boundary-Guided Multimodal Medical Image Segmentation Network
Longtao Chen, Jinjie Zheng, Fenglei Xu, Jing Lou, Huanqiang Zeng
MSD: Mask-Guided and Semantic-Guided Diffusion-Based Framework for Stone Surface Defect Detection
Wenzhe Meng, Xiaoliang Zhu, Yanxiang Li
YNet: medical image segmentation model based on wavelet transform boundary enhancement
Qichang Wang, Ruixia Liu
A New Heterogeneous Mixture of Experts Model for Deepfake Detection
Xinyu Yang, Xiaochen Ma, Xuekang Zhu, Bo Du, Lei Su, Bingkui Tong, Zeyu Lei, Jizhe Zhou
M3: Manipulation Mask Manufacturer for Arbitrary-Scale Super-Resolution Mask
Xinrong Hu, Chao Fang, Yu Chen, Kai Yang, Chun-Mei Feng, Ping Li
Agent-Conditioned Multi-Contrast MRI Super-Resolution for Cross-Subject
Xin Feng, Jie Wang, Siping Wang, Jiehui Zhang
LightStar-Net: A Pseudo-Raw Space Enhancement for Efficient Low-Light Object Detection
Zunwang Ke, YinFeng Wang, Run Guo, Minghua Du, Ji-Sheng Zhou, Gang Wang, Yugui Zhang
An Effective Algorithm for Skin Disease Segmentation Combining inter-channel Features and Spatial Feature Enhancement
Haodong Li, Haicheng Qu
DASSF: Dynamic-Attention Scale-Sequence Fusion for Aerial Object Detection
18:20 - 19:20 AI Film Festival Screening
19:30 - 21:00 Reception (with HKUST AI Film Festival)
Day 2 Sunday, April 20, 2025
09:00 - 10:00 Keynote Speech II - Inference-time Guided Generation Using Diffusion and Flow Models (by Prof. Minhyuk Sung)
10:00 - 10:20 Tea Break
10:20 - 11:30 Paper Session 4: Image/video Enhancement (4 papers, 15 minutes/paper)
Yong Liu, Qingji Dong, Chao Zhu, Yu Guo, Fei Wang
Towards Real-world Image Dehazing: A Tailored Dehazing Method and A High-Quality Dataset
Ji-Wei Wang, Li-Yong Shen
Temporal-Spatial Fusion Transformer for Video Demoiréing
Yue Zhao, Zhonggui Chen, Juan Cao
Palette-based Color Transfer for Images and Videos
Simin Kou, Fang-Lue Zhang, Jakob Nazarenus, Reinhard Koch, Neil A. Dodgson
OmniPlane: A Recolorable Representation for Dynamic Scenes in Omnidirectional Videos (invited TVCG paper presentation)
11:30 - 13:10 Lunch
13:10 - 14:30 Paper Session 5: Multimedia Generation (5 papers, 15 minutes/paper)
Sen Peng, Weixing Xie, Zilong Wang, Xiaohu Guo, Zhonggui Chen, Baorong Yang, Xiao Dong
RMAvatar: Photorealistic Human Avatar Reconstruction from Monocular Video Based on Rectified Mesh-embedded Gaussians
Chenxu Zhang, Chao Wang, Jianfeng Zhang, Hongyi Xu, Guoxian Song, You Xie, Linjie Luo, Yapeng Tian, Jiashi Feng, Xiaohu Guo
MagicTalk: Implicit and Explicit Correlation Learning for Diffusion-based Emotional Talking Face Generation
Boyao Ma, Yuanping Cao, Lei Zhang
Decoupled Two-Stage Talking Head Generation via Gaussian-Landmark-Based Neural Radiance Fields
Zian Wang, Shihao Zou, Shiyao Yu, Mingyuan Zhang, Chao Dong
Semantics-Aware Human Motion Generation from Audio Instructions
Xufei Guo, Xiao Dong, Juan Cao, Zhonggui Chen
CADTrans: A Code Tree-Guided CAD Generative Transformer Model with Regularized Discrete Codebooks
14:30 - 14:50 Tea Break
14:50 - 15:30 Paper Session 6: Action Analysis (2 papers, 15 minutes/paper)
Songmiao Wang, Ruize Han, Wei Feng
Concept-Guided Open-Vocabulary Temporal Action Detection
Yuqing Zhang, Chen Pang, Pei Geng, Xuequan Lu, Lei Lyu
Multi-scale adaptive large kernel graph convolutional network based on skeleton-based recognition
15:30 - 17:00 Poster Session 2: Generative Models, 3D and Geometry
Jingze Chen, Lei Li, Zerui Tang, Qiqin Lin, Junfeng Yao
CMU-Flownet: Exploring Point Cloud Scene Flow Estimation in Occluded Scenario
Ye Wang, Ruiqi Liu, Zili Yi, Tieru Wu, Rui Ma
SingleDream: Attribute-Driven T2I Customization from a Single Reference Image
Zhikun Wen, Honghua Chen, Zhe Zhu, Zeyong Wei, Liangliang Nan, Mingqiang Wei
CosCAD: Cross-Modal CAD Model Retrieval and Pose Alignment from a Single Image
Kangneng Zhou, Yaxing Wang, Shuang Song, Jie Zhang, Ping Li
3DFaceController: Region-Controllable Face Synthesis via Decomposed and Recomposed Neural Radiance Fields
Pengfei Deng, Tianjiao Zhang, Weize Quan, Hanyu Wang, Qinglin Lu, Zhifeng Li, Dong-Ming Yan
Concept-Edge Fusion: Background Generation for Product Presentation Based on Text-to-Image Model
Yishuo Fei, Chao Chen, Haipeng Liao, Mo Chen, Yuhui Yang, Dongming Lu
High-Quality and Efficient Inverse Rendering for Geometry, Material, and Illumination Reconstruction
Ran Zuo, Haoxiang Hu, Xiaoming Deng, Yaokun Li, Yu-Kun Lai, Cuixia Ma, Yong-Jin Liu, Hongan Wang
Sketch-Guided Scene-level Image Editing with Diffusion Models
Hongyu Chen, Xiao-Diao Chen
An efficient and robust tracing method based on matrix representation for surface-surface intersection
Hao Yu, Ruian Wang, Longdu Liu, Shuangmin Chen, Shiqing Xin, Zhenyu Shu, Changhe Tu
Completing Dental Models While Preserving Crown Geometry and Meshing Topology
Ziqi Zeng, Chen Zhao, Weiling Cai, Yuqing Guo
Semantic-guided Coarse-to-Fine Diffusion Model for Self-supervised Image Shadow Removal
Yujing Sun, Caiyi Sun, Yuan Liu, Yuexin Ma, Siu Ming Yiu
Extreme Two-View Geometry From Object Poses with Diffusion Models
Yu-Jie Yuan, Leif Kobbelt, Jie Yang, Yu-Kun Lai, Lin Gao
TPD-NeRF: Temporally Progressive Reconstruction of Dynamic Neural Radiance Fields from Monocular Video
Longdu Liu, Hao Yu, Shiqing Xin, Shuangmin Chen, Hongwei Lin, Wenping Wang, Changhe Tu
Direct Extraction of High-Quality and Feature-Preserving Triangle Meshes from Signed Distance Functions
Xiangyu Su, Sida Peng, Oliver van Kaick, Hui Huang, Ruizhen Hu
MTScan: Material Transfer from Partial Scans to CAD models
Qifeng Chen, Kai Huang, Yuchi Huo, Qi Wang, Wenting Zheng, Rong Li, Rengan Xie
HR Human: Modeling Human Avatars with Triangular Mesh and High-Resolution Textures from Videos
Xinqi Liu, Chenming Wu
VGA: Reconstructing Vivid 3D Gaussian Avatars from Monocular Videos
Shan Yue, Hai Huang, Zhenqi Tang, Yutong Zheng, Zhou Fang
TCDNet: Texture and Color Dynamic Network for Image Harmonization
Fuyang Liu, Jianjun Li
Unsupervised Monocular Depth Estimation for Foggy Images with Domain Separation and Self-depth Domain Conversion
Qingyi Zhu, Ruochen Jin, Zhiwei Zhang, Yishen Xue, Xin Tan, Lizhuang Ma
TAD: A plug-and-play Task Arithmetic approach for augmenting Diffusion models
Yu Liu, Fatimah Khalid, Cunrui Wang, Mas Rina Mustaffa, Azreen Azman
DiffVecFont: Fusing Dual-Mode Reconstruction Vector Fonts via Masked Diffusion Transformers
Zhenzhen Xiao, Heng Liu, Bingwen Hu
Unwarping Screen Content Images via Structure-texture Enhancement Network and Transformation Self-estimation
Qun-Ce Xu, Yan-Pei Cao, Weihao Cheng, Tai-Jiang Mu, Ying Shan, Yong-Liang Yang, Shi-Min Hu
High-accuracy Fractured Object Reassembly under Arbitrary Poses
17:20 Shuttle Bus Departure to the Banquet
18:30 - 20:00 Conference Banquet
Day 3 Monday, April 21, 2025
09:00 - 10:00 Keynote Speech III - Simulate Everything, Everywhere, All At Once (by Prof. Eitan Grinspun)
10:00 - 10:20 Tea Break
10:20 - 11:40 Paper Session 7: Geometry Processing (5 papers, 15 minutes/paper)
Di Shao, Yaping Jing, Xinkui Zhao, Shasha Mao, Lei Lyu, Xiao Liu, Xuequan Lu
DS-MAE: Dual-Siamese Masked Autoencoders for Point Cloud Analysis
Ao Zhang, Qing Fang, Peng Zhou, Xiao-Ming Fu
Topology-Controlled Laplace-Beltrami Operator on Point Clouds Based on Persistent Homology
Yuan-Yuan Cheng, Qing Fang, Ligang Liu, Xiao-Ming Fu
Developable approximation via Isomap on Gauss image
Gaoyang Zhang, Yingxi Chen, Hanchao Li, Xinguo Liu
Efficient and Structure-Aware 3D Reconstruction via Differentiable Primitives Abstraction
Jiachen Liu, Yuan Xue, Haomiao Ni, Rui Yu, Zihan Zhou, Sharon X. Huang
Computer-Aided Layout Generation for Building Design: A Review
11:40 - 13:00 Lunch
13:00 - 14:10 Paper Session 8: Optimization and Applicaiton (4 papers, 15 minutes/paper)
Gang Xu, Haoyu Liu, Biao Leng, Zhang Xiong
ImVoxelENet: Image to Voxels Epipolar Transformer for Multi-View RGB-based 3D Object Detection
Yanchao Bi, Yang Ning, Xiushan Nie, Xiankai Lu, Ruiheng Zhang, Huanlong Zhang
FGHDet: Delving Into Fine-grained Features With Head Selection for UAV Object Detection
Siying Huang, Xin Yang, Zhengda Lu, Hongxing Qin, Huaiwen Zhang, Yiqun Wang
L2-GNN: Graph Neural Networks with Fast Spectral Filters Using Twice Linear Parameterization
Zihan Zhou, Jiacheng Pan, Xumeng Wang, Dongming Han, Fangzhou Guo, Minfeng Zhu, Wei Chen
A Summarization-Based Pattern-Aware Matrix Reordering Approach
14:10 - 14:30 Tea Break
14:30 - 16:00 Poster Session 3: Multimodal Learning, Unsupervised Methods, and Applications
Wenbin Wu, Zhiwei Zhang, Xin Tan, Zhizhong Zhang, Lizhuang Ma
DepthFisheye: Efficient Fine-Tuning of Depth Estimation Models for Fisheye Cameras
Yongbiao Gao, Xiangcheng Sun, Guohua Lv, Deng Yu, Sijie Niu
Reinforced Label Denoising for Weakly-Supervised Audio-Visual Video Parsing
Jiangnan Xia, Zhiyuan Zhang, Yanyin Guo, Qilong Wu, Yi Li, Jianghan Cheng, Junwei Li
Bridging the Modality Gap: Advancing Multimodal Human Pose Estimation with Modality-Adaptive Pose Estimator and Novel Benchmark Datasets
Xiaole Zhu, Zongtao Duan, Junchen Huang, Xing Sheng
Momentum-Based Uni-Modal Soft-Label Alignment and Multi-Modal Latent Projection Networks for Optimizing Image-Text Retrieval
Keyang Lin, Zhijun Fang, Sicong Zang, Hang Wu
Learning Adaptive Basis Fonts to Fuse Content Features for Few-shot Font Generation
Xiaoyu Guan, Yihao Li, Tianyu Huang
TaiCrowd: A High-Performance Simulation Framework for Massive Crowd
Wei Ge, Yongwei Nie, Fei Ma, Keke Tang, Fei Richard Yu, Hongmin Cai, Ping Li
Training-Free Language-Guided Video Summarization via Multi-Grained Saliency Scoring
Benchao Li, Yun Zou, Ruisheng Ran
MCFG with GUMAP: A Simple and Effective Clustering Framework on Grassmann Manifold
Yun Zou, Benchao Li, Ruisheng Ran
Joint UMAP for Visualization of Time-Dependent Data
Chengrong Yang, Qiwen Jin, Xiaoguo Zhang, Yujue Zhou
Feature Disentanglement and Fusion Model for Multi-Source Domain Adaptation with Domain-Specific Features
Hao Tong, Jiawei Liu, Yong Wu, Guozhi Zhao, Fanrui Zhang, Zheng-Jun Zha
Multi-Granularity and Multi-Modal Prompt Learning for Person Re-Identification
Lu Xu, Shuaixin Li, Xin Zhou, Xiaozhou Zhu, Wen Yao
Local and Global Feature Cross-attention Multimodal Place Recognition
Hongchao Zhong, Li Yu, Longkun Zou, Ke Chen
Unsupervised Domain Adaptation on Point Cloud Classification via Imposing Structural Manifolds into Representation Space
Kailang Hu, Yixiao Lu, Huibing Li, Xuan Song
A Trademark Retrieval Method Based on Self-Supervised Learning
Shu Liu, Melikamu Liyih Sinishaw, Luo Zheng
DIMATrack: Dimension Aware Data Association for Multi-Object Tracking
Qinghua Song, Xiaolei Wang
Efficient Transformer Network for Visible and Ultraviolet Object Tracking
Zheng Zhang, RuiQing Yang, ChuanLei Zhang
IML-CMM - A Multimodal Sentiment Analysis Framework Integrating Intra-Modal Learning and Cross-Modal Mixup Enhancement
Junjiang Liu, Dandan Sun, Hailun Xia, Jiangtao Bai, Xinyue Fan
Weaken Noisy Feature: Boosting Semi-Supervised Learning by Noise Estimation
Ruizhong Du, Luman Zhao, Mingyue Li, Yidan Li, Shenyu Li, Caixia Ma
ADMMOA: Attribute-Driven Multimodal Optimization for Face Recognition Adversarial Attacks
Mingming Li, Fei Wu, Yinjie Wang
LightGR-Transformer: Light Grouped Residual Transformer for Multispectral Object Detection
Weiye Peng, Shenghua Zhong
Multi-Dimension Full Scene Integrated Visual Emotion Analysis Network
Shan Huang, Wenhua Qian
Gap-KD: Bridging the Significant Capacity Gap Between Teacher and Student Model
16:00 - 16:30 Closing Session

Keynote Speakers

Prof. Maneesh Agrawala, Stanford University, USA

Title:

Beyond Unpredictable Black Boxes: Designing Generative AI For Iterative Refinement

Abstract:

Human creation of high-quality content often requires significant iteration. People produce a coarse initial draft, and refine it step-by-step towards a final result. While modern generative AI tools are capable of producing surprisingly high-quality content from simple text prompts, they do not support this iterative refinement workflow. Instead today's AI tools are black boxes, making it impossible for users to build a mental/conceptual model that can predict how an input prompt will be transmuted into output content. The lack of predicatability forces users to rely on iterative trial-and-error, repeatedly crafting a prompt, using the AI to generate a result, and then adjusting the prompt to try again. In this talk I'll outline some features generative AI tools should provide to support iterative refinement workflows rather than iterative trial-and-error. These features include hierarchical decomposition of the creation task and consistency of the output content from iteration to iteration. Finally I'll suggest some approaches we might use to build generative AI tools that provide such features and demonstrate a few implementations of these ideas that we have developed in our lab at Stanford.


Speaker's Biography:

Maneesh Agrawala is the Forest Baskett Professor of Computer Science and Director of the Brown Institute for Media Innovation at Stanford University. He works on computer graphics, human computer interaction and visualization. His focus is on investigating how cognitive design principles can be used to improve the effectiveness of audio/visual media. The goals of this work are to discover the design principles and then instantiate them as constraints and controls in generative AI tools. Honors include an Okawa Foundation Research Grant (2006), an Alfred P. Sloan Foundation Fellowship (2007), an NSF CAREER Award (2007), a SIGGRAPH Significant New Researcher Award (2008), a MacArthur Foundation Fellowship (2009), an Allen Distinguished Investigator Award (2014) and induction into the SIGCHI Academy (2021). He was named an ACM Fellow in 2022.


Prof. Eitan Grinspun, University of Toronto, Canada

Title:

Simulate Everything, Everywhere, All At Once

Abstract:

Reduced order modelng (ROM) has long promised the ability to simulate complex dynamics at lightning speed—at the cost of specialization. Traditional ROMs are typically tied to a specific scenario, geometry, and discretization, making them brittle in the face of changing applications, shapes, or numerical methods.

If we could lift these restrictions, then we could unlock the potential of dramatically more expressive ROMS that can accelerate a wide range simulations and applications. These ROMS could absorb data from disparate simulations—even those conducted on different discretizations—and generalize across diverse shapes.

I will present first steps toward these goals. Using neural fields to build continuous, differentiable representations of physical phenomena, we can learn a low-dimensional manifold of kinematic representations not tied to one shape or discretization. These new ROMS enable fast, accurate simulations that can train on—and generalize across—grids, meshes, and point clouds alike, and their generalization across shapes has exciting connections to spectral geometry processing. I believe that these approaches point to a new kind of simulation engine: one that is fast, general, and geometry-aware, bringing us one step closer to simulating everything, everywhere, all at once.


Speaker's Biography:

Eitan Grinspun is Associate Chair, Communications, Mentoring and Inclusion at the Department of Computer Science at the University of Toronto, where he is also a Professor of Computer Science and Mathematics. He was previously the Co-Director of the Columbia Computer Graphics Group at Columbia University (2004-2021), Professeur d'Université Invité at l'Université Pierre et Marie Curie in Paris (2009), a Research Scientist at New York University's Courant Institute (2003-2004), a graduate student at the California Institute of Technology (1997-2003), and an undergraduate in Engineering Science at the University of Toronto (1993-1997). He has been an NVIDIA Fellow (2001), Eberhart Distinguished Lecturer (2003), NSF CAREER Awardee (2007), Alfred P. Sloan Research Fellow (2010-2012), one of Popular Science magazine's "Brilliant Ten Scientists" (2011), and Fast Company magazine's "Most Creative People in Business" (2013). Technologies developed by his lab are used in products such as Adobe Photoshop & Illustrator, at major film studios, and in soft matter physics and engineering research. He has been profiled in The New York Times, Scientific American, New Scientist, and mentioned in Variety. His film credits include The Hobbit, Rise of the Planet of the Apes, and Steven Spielberg's The Adventures of Tintin.


Prof. Minhyuk Sung, KAIST, South Korea

Title:

Inference-time guided generation using diffusion and flow models

Abstract:

Recent breakthroughs in generative AI have transformed the creative process, making it easier than ever to generate realistic images and videos. While the quality of generated outputs has reached unprecedented levels of realism, the challenge now lies in improving controllability and alignment with user preferences. Although text-to-image generative models have become prevalent, text input alone is often insufficient to provide precise spatial or stylistic control. Traditionally, users have provided such inputs through direct interactions, such as mouse clicks, but supporting these traditional input methods has become increasingly challenging. Moreover, despite the widespread use of text-to-image generation, text-image alignment remains far from perfect, particularly for complex prompts. To enhance controllability and alignment with user intent, recent advancements in LLMs have shifted focus beyond scaling training data and model size to scaling inference-time computation, as demonstrated by the AGI-level performance of models like GPT-4o and DeepSeek. In this talk, I will discuss inference-time generation techniques for guided image and video generation, categorized into three main approaches. First, noise manipulation leverages the observation that adjusting noise during the denoising process influences the final output, enabling alignment with user-defined spatial guidance. Second, gradient-descent-based algorithms utilize the expectation of the final output at an intermediate denoising step, which can be interpreted through the lens of score distillation and combined with it. Lastly, particle sampling exploits the stochastic nature of the generative process, branching out the generation to scale up and search for the desired output, albeit at the cost of increased inference computation. I will explore their capabilities, limitations, and future directions.


Speaker's Biography:

Minhyuk Sung is an Associate Professor in the School of Computing at KAIST, affiliated with the Graduate Schools of AI and Metaverse. Previously, he was a Research Scientist at Adobe Research. He earned his Ph.D. from Stanford University under Prof. Leonidas J. Guibas. His research focuses on generating, manipulating, and analyzing various visual data, including images, videos, and 3D data. He has served on program committees for SIGGRAPH Asia (2022-2025), Eurographics (2022, 2024-2025), Pacific Graphics (2023, 2025), ICCV (2025), ICLR (2025), and AAAI (2023, 2024). He received the Asia Graphics Young Researcher Award in 2024.