Exploiting Rich Long-term Memory Context in Spatiotemporal Vision Tasks without Annotations
Videos are one of the richest and vastly produced data types. Due to its higher dimensionality and complexity, it is challenging to model spatiotemporal vision tasks with classical approaches. Recent research has focused on solving these tasks with deep learning techniques. Supervised learning has been the most successful version of these techniques, but it requires annotations to train the model. As video data is massively growing with time, it is impossible to create labels for different tasks to capture a wide variety of patterns. The research aims to solve this annotation scarcity in spatiotemporal vision tasks with self-supervised and unsupervised learning paradigms. In order to exploit memory context, recent research has used various internal memory modules. However, these internal memory modules do not exploit rich long-term memory elements. The research aims to solve this issue by introducing memory networks. Memory networks contain an external memory module synchronized with deep learning architecture. Unlike internal memory modules, they capture rich long-term past knowledge. On the other side, excellent qualitative results can also be obtained from these memory modules, proving their transparency and explainability. For achieving both aims, this proposal suggests testing various hypotheses obtained from memory networks and related literature on three different spatiotemporal vision tasks: (i) self-supervised video object segmentation (Self-VOS), (ii) video prediction, and (iii) unsupervised video anomaly detection.