Читать книгу Cyberphysical Smart Cities Infrastructures - Группа авторов - Страница 40
3.3.3 Embodied Visual Recognition
ОглавлениеPassive or fixed agents may fail to recognize objects in scenes if they are partially or heavily occluded. Embodiment comes to the rescue here and gifts the possibility of moving in the environment to actively control the viewing position and angle to remove any ambiguity in object shapes and semantics.
Jayaraman and Grauman [37] started to learn representations that will exploit the link between how the agent moves and how it will affect its visual surrounding. To do this they used raw unlabeled videos along with an external GPS sensor that provided the agent's coordinates and trained their model to learn a representation linking these two. So, after this, the agent would have the ability to predict the outcome of its future actions and guess how the scene would look like after moving forward or turning to a side.
This was powerful and in a sense, the agent developed imagination. However, there was an issue here. If we pay attention, we realize that the agent is still being fed prerecorded video as the input and is learning similar to the observer kitten in the kitten carousel experiment explained above. So, following this, the authors went after this problem and proposed to train an agent that takes any given object from an arbitrary angle and then predict or better to say imagine the other views by finding the representation in a self‐supervised manner [38].
Up until this point, the agent does not use the sound of its surroundings while humans are all about experiencing the world in a multisensory manner. We can see, hear, smell, and touch all at the same time and extract and use the relevant information that could be beneficial to our task at hand. All that said, understanding and learning the sound of objects present in a scene is not easy since all the sounds are overlapped and are being received via a single channel sensor. This is often dealt with as an audio source separation problem, and lots of work has been done on it in the literature [39, 43].
Now it was the reinforcement learning's turn to make a difference. Policies have to be learned to help agents move around a scene, and this is the task of active recognition [44, 48]). The policy will be learned at the same time it is learning other tasks and representation, and it will tell the agent where and how to strategically move to recognize things faster [49, 50].
Results show that policies indeed help the agent to achieve better visual recognition performance, and the agents can strategize their future moves and path for better results that are mostly different from shortest paths [51].