Existing video analysis models often lack explainability, perform poorly on long videos, and frequently hallucinate. Commercial solutions are closed-source and costly. We introduce CReLeRI, an open-source (https://github.com/michaelperez023/creleri-video) system for action detection in untrimmed videos. CReLeRI segments videos using scene and action transitions, detects actions and their arguments and grounds them in 3D space to improve interpretability and reduce hallucinations. The system promotes transparency and trust in AI-driven analysis of complex, real-world videos. A demonstration video is also available (https://youtu.be/XDCue9EYNTU).
  • Michael Perez
  • Rohith Venkatakrishnan
  • Headshot of Jaime Ruiz wearing a HololensJaime Ruiz
  • As well as: Yichi Yang, Yuheng Zha, Enze Ma, Danish Tamboli, Haodi Ma, Reza Shahriari, Vyom Pathak, Dzmitry Kasinets, Daisy (Zhe) Wang, Eric D. Ragan, Zhiting Hu, Eric Xing, & Jun-Yan Zhu

Michael Francis Perez, Yichi Yang, Yuheng Zha, Enze Ma, Danish Tamboli, Haodi Ma, Reza Shahriari, Vyom Pathak, Dzmitry Kasinets, Rohith Venkatakrishnan, Daisy (Zhe) Wang, Jaime Ruiz, Eric D. Ragan, Zhiting Hu, Eric Xing, & Jun-Yan Zhu. (2025). CReLeRI: Explainable, Concept-centric, Representation, Learning, Reasoning, and Interaction Video Analysis System. In Proceedings of the 33rd ACM International Conference on Multimedia.

@ARTICLE{Perez2025,
  author={Perez, Michael and Yang, Yichi and Zha, Yuheng and Ma, Enze and Tamboli, Danish and Ma, Haodi and Shahriari, Reza and Pathak, Vyom and Kasinets, Dzmitry and Venkatakrishnan, Rohith and Wang, Daisy (Zhe) and Ruiz, Jaime and Ragan, Eric D. and Hu, Zhiting and Xing, Eric and Zhu, Jun-Yan},
  journal={Proceedings of the 33rd ACM International Conference on Multimedia}, 
  title={CReLeRI: Explainable, Concept-centric, Representation, Learning, Reasoning, and Interaction Video Analysis System}, 
  year={2025},
  volume={},
  number={},
  pages={},
  keywords={Multimedia Interaction;Video Action Detection;Object Detection;Interpretability;Grounding;Vision-Language Models;Large Language Models;Human-Centered Computing},
  doi={10.1145/3746027.3754479}
}