Hi! I am a second year PhD student in the Willow team at Inria and École Normale Supérieure in Paris, advised by Cordelia Schmid and Shizhe Chen. I am working on Diffusion models for vision-language generation.
I did my Masters by Research in Computer Science from CVIT IIIT Hyderabad advised by C.V. Jawahar and Makarand Tapaswi. My thesis was on Situation Recognition for Holistic Video Understanding, exploring structured representations to enable deeper semantic understanding of dynamic scenes. Prior to this I was a Research Assistant in the Computer Vision lab at IIT Gandhinagar, where I worked with Shanmuganathan Raman. My work centered around Computational Photography, specifically in high dynamic range (HDR) image and video reconstruction.
I am broadly interested in unified large multimodal diffusion models particularly at the intersection of vision and language for joint understanding and generation across text, images, and videos. Currently I am exploring compositional representations for high fidelity and interpretable text-to-image/video diffusion models.
CV / Google Scholar / Github / LinkedIn /
May, 2025 : Released our new work ComposeAnything, for compositional text to image gneration.
February, 2025 : One Paper accepted to CVPR 2025!. We create a challenging benchmark for compositional video understanding.
February, 2024 : One Paper accepted to CVPR 2024!. We design a new framework for Identity aware captioning of movie videos, we also propose a new captioning metric called iSPICE, that is sensitive to wrong identiities in captions.
September, 2023 : Started PhD in the Willow team of Inria Paris.
September, 2022 : One paper accepted to NeurIPS 2022! We formulate a new structured framework for dense video understanding and propose a Transformer based model, VideoWhisperer that operates on a group of clips and jointly predicts all the salient actions, Semantic roles via captioning and, spatio temporal grounding in a weakly supervised setting.
zeeshan.khan@inria.fr
Office: C-412
Address: 2 Rue Simone IFF, 75012 Paris France

We propose a data and dynamics driven constant stepsize for Adam that removes the need for costly tuning and avoids the pitfalls of decaying schedulers. Our estimate matches the best stepsizes found by exhaustive search, performs on par with popular schedulers, and scales to ImageNet and LLM fine-tuning. We further provide theory showing that Adam with our stepsize converges to critical points for smooth non-convex objectives
Alokendu Mazumder, RIshab Sabharwal, Bhartendu Kumar, Manan Tayal, Arnab Roy, Chirag Garg, Punit Rathore
In IEEE Transactions of Artificial Intelligence (IF: 6.7), 2026

We establish convergence guarantees for fractional gradient descent on matrix-smooth non-convex functions and show how matrix smoothness accelerates convergence. We introduce Compressed Fractional Gradient Descent (CFGD) with matrix-valued stepsizes, proving faster convergence than scalar stepsizes in both single-node and distributed settings. This is the first work to study fractional gradient descent in federated/distributed optimisation.
Alokendu Mazumder, Keshav Vyas, Punit Rathore
In IEEE Transactions of Neural Networks and learning Systems (IF: 8.9), 2025

Low-Rank Autoencoder (LoRAE) extends autoencoders with a low-rank regularizer to learn compact and implicit low-dimensional latent spaces. It achieves tighter theoretical error bounds and strong empirical performance in image generation and downstream classification. It even beats highly optimized GANs !
Alokendu Mazumder, Tirthajit Baruah, Bhartendu Kumar, Rishab Sharma, Vishwajeet Pattanaik, Punit Rathore,
In the Winter Conference on Applications of Computer Vision (WACV), 2024

We propose an efficient no-training perceptual quality metric for DIBR synthesized views that exploits saliency-guided deep features extracted from a pretrained VGG-16 network. By focusing on perceptually important regions and fusing feature maps using cosine similarity, the method effectively captures geometric distortions and outperforms existing state-of-the-art QA metrics on standard benchmarks.
Shubham Chaudhary, Alokendu Mazumder, Deebha Mumtaz, Vinit Jakhetiya, Badri Subudhi
In International Comference on Image Processing (ICIP), 2021