Set Constrained Temporal Transformer for Set Supervised Action Segmentation

Action recognition problems in computer vision are always critical to solve. Time adds tremendous complexity to the tasks. Solving a task on still image domains is hard by itself when you want to have significant generalization at the end of the modeling phase. Once we take the same task with all ups and downs to the temporal domain, the problem becomes unbelievable. Imagine you have the object detection task in hand. Spatio-temporal object detection needs not only deal with spatial variations but also temporal variations and those spatial ones that emerge due to the additional temporal complexity. One the other hand, video processing demands a large amount of computational resources.

Action recognition tasks range heavily rely on motion pattern recognition in the time dimension. Not only spatial features are important, the temporal patterns they make over time adds up to the modeling complexity. Actions, activities, and events are the three major levels of spatio-temporal processing tasks. Actions are the smallest pieces in the time domain that assigns concepts/semantics to some spatio-temporal patterns. Activities are collections of actions happening all together. Events are all the way to the extreme end where collections of activities are happening simultaneously. Kicking a ball by a football player is an action. Playing football by a team is an activity. Finally, a long-shot of a stadium showing the final game of a football league with all the audience in the shot is an instance of an event.

Action classification is a task that is applied to a trimmed short video clip. The goal is to recognize motion patterns for atomic semantic labels. Running, jumping, standing, and walking are a few to name. However, in real life, videos are untrimmed and there must be a way of detecting those short video clips for classification. This problem is known as action detection. The input is a variable-length video sample, and the output is the location of known actions within the video range. Obviously, once the location is known, the clips are cut and fed into the classification model.

In Machine Learning, learning problems depending on the level of provided supervision are categorized into supervised, semi-supervised, and unsupervised problems. Supervised problems are further divided into fully-supervised and weakly-supervised problems. Weakly-supervised problems are interesting in the sense that they have supervision but it is weak. By weak we mean the supervision signal is weak. Let’s take a look at the object detection task. In this task, we need not only the class labels but also the instance locations in terms of bounding boxes. In contrast, In the weakly-supervised regime, labels are only given and location supervision is not available. You may ask, why that is, the answer is that labeling is a tedious and time-consuming and expensive task. The stronger the annotation, the more demanding the labeling process is. Therefore, the community has moved towards the approaches to leverage the weakest possible supervision signal in modeling the learning problems.

Weakly-supervised learning problems are almost always relevant in the computer vision tasks: localization, detection, segmentation. For instance in the weakly-supervised instance segmentation, the goal is given a set of object classes, find the semantic masks of the instances of the classes in the input image. It is indeed a hard problem. But the community has shown it is not unreachable.

In this session of the AISC computer vision series, I was hosting Mohsen Fayyaz. He is the first author of the “Set Constrained Temporal Transformer for Set Supervised Action Segmentation” paper which was presented recently at CVPR 2020. This paper extends action detection further to action segmentation. The difference is that rather than outputting action locations, they output dense action labels at each frame. The research objective of the paper is to predict action segmentation maps using only the set of action labels. As we know, elements in sets are unique and sets are orderless. Therefore, the proposed model has to not only predict the number of clips containing some action labels but also determine the order of actions. The paper presents a novel approach using a multi-stage processing regime that unforces the representation to be as constrained to the required space as possible. One interesting idea in this paper is that the paper avoids dealing with representation learning for spatio-temporal action recognition by using a pre-trained network. Nowadays this is becoming a common approach due to the computational demand of training/fine-tuning a network from scratch. Researchers try to build their models on top of some embedding derived from the pre-training networks. This also helps to concentrate on the problem at hand rather than dealing with the difficulties of training a large network from the scratch.
If you are interested to find out more about the paper and learn the details of the proposed approach, check out the paper at this link. The information about the event is provided in this link. You can watch the recorded video of the event in the YouTube stream below. I am going to post the subsequent events in the future. So wait for more cool stuff to get published.

Post Tags: action_recognition, action_segmentation, annotation, deep_learning, i3d, inflated_3d_networks, neural_networks, weakly_supervised

Mahdi Biparva

I have PhD in Computer Science. I live in Toronto, Canada. I have been doing research and development in Deep Learning, Computer Vision, and Pattern Recognition for more than ten years.