Sarah is a PhD candidate in the Image and Video Computing Group in the Department of Computer Science at Boston University. She is an IBM PhD fellow and a Hariri PhD Fellow. Her research interests lie in the intersection of Computer Vision and Machine Learning. Sarah is advised by Prof. Stan Sclaroff; They are particularly interested in developing deep learning formulations for the analysis of human motion and activities in video. Sarah received her MSc from the American University in Cairo, after which she became a lecturer of Computer Science at the Gulf University for Science and Technology.
Spatio-temporal Grounding in Visual Data
Deep models are state-of-the-art for many vision tasks including video action recognition and video captioning. Models are trained to caption or classify activity in videos, but little is known about the evidence used to make such decisions. Grounding decisions made by deep networks has been studied in spatial visual content, giving more insight into model predictions for images. However, such studies are relatively lacking for models of spatiotemporal visual content - videos. In this work, we devise a formulation that simultaneously grounds evidence in space and time, in a single pass, using top-down saliency. We visualize the spatiotemporal cues that contribute to a deep model's classification/captioning output using the model's internal representation. Based on these spatiotemporal cues, we are able to localize segments within a video that correspond with a specific action, or phrase from a caption, without explicitly optimizing/training for these tasks.