It is always fun to read and digest research publications about action recognition and video tasks in general. Time is the extra ingredient to space. This is when you have to expand your thinking paradigms from space only approaches to space-time (spatio-temporal). In classic computer-vision algorithms and problems, the transition might not be as natural and smooth as it is nowadays in modern-learning-intensive approaches. Anyway, This paper from a prestigious group is all about how to define a spatio-temporal attention mechanism in neural networks to deal with multi-agent scenarios in videos.
A long-standing topic in vision is the role of context in modeling for different tasks. Whether it is object recognition using convolutional neural networks (ConvNets) and the aggregation of the spatial context naturally emerges in the definition of convolution operation. While this is amazing, it should not be respected as a winning ticket. Despite the implicit contextual aggregation role of convolutions, it is not strong and informative enough to handle and attend properly to various aspects of the domain. The argument definitely generalizes to the spatio-temporal domain and the weak aggregation role of 3-dimensional(3D) convolutions. There have been so many efforts to propose systematic approaches to deal with surrounding context in recognition tasks. This paper is a novel approach to integrate Transformers into neural networks for human action recognition and detection.
I have been invited by the AISC team to be one of the Computer Vision stream owners in the following months. One of the prerequisites is to present a recent paper in the field. Since I completed my internship 2 years ago in developing models for action recognition and detection, I decided to pick a recent paper in this field. Though it is fun to read and learn more about the field, video processing tasks are practically tough to work on. I remember one of the obstacles I had been facing with in my internship was to deal with large datasets (e.g. Kinetics) and handle neural network training and inference for such tasks.
In this presentation, I tried to start with a wide overview of the action recognition and gradually focus on the particular cases of spatio-temporal human action detection. Then I will talk about the novel approach proposed in the paper and then highlight details and discuss experimental setups and results towards the end of the talk. I would say it is always amazing to read such papers and illuminate aspects of novel approaches. The experimental evaluation section is overwhelmed with lots of details and is almost always fun to read. The fact that these models are difficult to work with and care must be given in the training phase is inevitable. The modularity of neural network research is very important and has played a significant role in the previous years. The fact that one module once designed properly would make a significant improvement in some other model for a different task has been very effective.
One of the good aspects of presenting at AISC, as I mentioned in the previous post, is that you always get exposed to people with strong academic and industrial backgrounds. The fact that you need to address questions from the both sides point of views is remarkable. For that, I tried to cover materials at both the large and small scale so then the audience would learn from the talk. Let’s hope this has been accomplished.
Last but not least, Alireza Darbehani was a great host at the event. It is always refreshing to hang out with him let alone having a wonderful discussion in such great event. You can enjoy reading the research paper presented at CVPR 2019 at this link and also watch my presentation at AISC in the YouTube video below. Let me know what you think of the talk. As always any sort of feedback and comments is very welcome.