CReLeRI: Explainable, Concept-centric, Representation, Learning, Reasoning, and Interaction Video Analysis SystemMichael Francis Perez, Yichi Yang, Yuheng Zha, Enze Ma, Danish Tamboli, Haodi Ma, Reza Shahriari, Vyom Pathak, Dzmitry Kasinets, Rohith Venkatakrishnan, Daisy (Zhe) Wang, Jaime Ruiz, Eric D. Ragan, Zhiting Hu, Eric Xing, Jun-Yan Zhu
Existing video analysis models often lack explainability, perform poorly on long videos, and frequently hallucinate. Commercial solutions are closed-source and costly. We introduce CReLeRI, an open-source (https://github.com/michaelperez023/creleri-video) system for action detection in untrimmed videos. CReLeRI segments videos using scene and action transitions, detects actions and their arguments and grounds them in 3D space to improve interpretability and reduce hallucinations. The system promotes transparency and trust in AI-driven analysis of complex, real-world videos. A demonstration video is also available (https://youtu.be/XDCue9EYNTU). 
Citation
Michael Francis Perez, Yichi Yang, Yuheng Zha, Enze Ma, Danish Tamboli, Haodi Ma, Reza Shahriari, Vyom Pathak, Dzmitry Kasinets, Rohith Venkatakrishnan, Daisy (Zhe) Wang, Jaime Ruiz, Eric D. Ragan, Zhiting Hu, Eric Xing, and Jun-Yan Zhu. 2025. CReLeRI: Explainable, Concept-centric, Representation, Learning, Reasoning, and Interaction Video Analysis System. In Proceedings of the 33rd ACM International Conference on Multimedia (MM ’25). Association for Computing Machinery, New York, NY, USA, 13528–13530. https://doi.org/10.1145/3746027.3754479
Bibtex
@inproceedings{10.1145/3746027.3754479,
author = {Perez, Michael Francis and Yang, Yichi and Zha, Yuheng and Ma, Enze and Tamboli, Danish and Ma, Haodi and Shahriari, Reza and Pathak, Vyom and Kasinets, Dzmitry and Venkatakrishnan, Rohith and Wang, Daisy (Zhe) and Ruiz, Jaime and Ragan, Eric D. and Hu, Zhiting and Xing, Eric and Zhu, Jun-Yan},
title = {CReLeRI: Explainable, Concept-centric, Representation, Learning, Reasoning, and Interaction Video Analysis System},
year = {2025},
isbn = {9798400720352},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3746027.3754479},
doi = {10.1145/3746027.3754479},
abstract = {Existing video analysis models often lack explainability, perform poorly on long videos, and frequently hallucinate. Commercial solutions are closed-source and costly. We introduce CReLeRI, an open-source system for action detection in untrimmed videos. CReLeRI segments videos using scene and action transitions, detects actions and their arguments and grounds them in 3D space to improve interpretability and reduce hallucinations. The system promotes transparency and trust in AI-driven analysis of complex, real-world videos. A demonstration video is also available.},
booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia},
pages = {13528–13530},
numpages = {3},
keywords = {grounding, human-centered computing, interpretability, large language models, multimedia interaction, object detection, video action detection, vision-language models},
location = {Dublin, Ireland},
series = {MM '25}
}

 Michael Perez
Michael Perez Rohith Venkatakrishnan
Rohith Venkatakrishnan Jaime Ruiz
Jaime Ruiz