Since not much computational power was available at the time of this shift, a big challenge for the researchers was the trade‐off between simplicity and the potential to operate in complex environments. An extensive amount of work has been done in this area to explore or invent ways to exploit natural body dynamics, materials used in the modules, and their morphologies to make the robots move and become able to grasp and manipulate items without sophisticated processing units [25, 27]. It goes without saying that the ones who could use the physical properties of themselves and the environment to function were more energy‐efficient, but they had their own limitations. Not being able to generalize well to complex environments was a major drawback. However, they were fast as the machines with huge processing units needed a reasonable amount of time to think and plan their next action and often move their rigid and non‐smooth actuators.
Nowadays, a big part of these issues are solved, and we can see extremely fast and smooth natural moving robots capable of doing different types of maneuvers [28], but yet it is foreseen that with the advances of artificial muscles, joints, and tendons, this progress can be further improved.
3.3 Breakdown of Embodied AI
In this section, we try to categorize a broad range of research that has been done under the field of embodied AI. Due to the huge diversity, each section will necessarily be abstract and selective and reflect the authors' personal opinion.
3.3.1 Language Grounding
Machine and human communication has always been a topic of interest. As time goes on, more and more aspects of our lives are controlled by AIs, and hence it is crucial to have ways to talk with them. This is a must for giving new instructions to them or receiving an answer from them, and since we are talking about general day‐to‐day machines, we desire this interface to be higher level than programming languages and closer to spoken language. To achieve this, machines must be capable of relating language to actions and the world. Language grounding is the field that tries to tackle this and map natural language instructions to robot behavior.
Hermann et al.'s study shows that this can be achieved by rewarding an agent upon successful execution of written instructions in a 3D environment with a combination of unsupervised learning and reinforcement learning [29]. They also argue that their agent can generalize well after training and can interpret new unseen instructions and operate in unfamiliar situations.
3.3.2 Language Plus Vision
Now that we know that machines can understand languages and there exist sophisticated models just for this purpose out there [30], it is time to bring another sense into play. One of the most popular ways to show the potential of joint training of vision and language is the image and video captioning [31, 35].
More recently, a new line of work has been introduced to take advantage of this connection. AbbtextVisual question answering (VQA) [17] is the task of receiving an image along with a natural language question about that image as an input and attempting to find the accurate natural language answer for it as the output. The beauty of this task is that both the questions and the answers can be open‐ended and also the questions can target different aspects of the image such as the objects that are present in them, their relationship or relative positions, colors, and background.
Following this research, Singh et al. [36] cleverly added an optical character recognition (OCR) module to the VQA model to enable the agent to read the texts available in the image as well and answer questions asked from them or use the additional context indirectly to answer the question better.
One may ask where the new task stands relative to the previous one. Do agents who can answer questions more intelligent than the ones who deal with captions or not? The answer is yes. In [17], the authors show that VQA agents need a deeper and more detailed understanding of the image and reasoning than models for captioning.
3.3.3 Embodied Visual Recognition
Passive or fixed agents may fail to recognize objects in scenes if they are partially or heavily occluded. Embodiment comes to the rescue here and gifts the possibility of moving in the environment to actively control the viewing position and angle to remove any ambiguity in object shapes and semantics.
Jayaraman and Grauman [37] started to learn representations that will exploit the link between how the agent moves and how it will affect its visual surrounding. To do this they used raw unlabeled videos along with an external GPS sensor that provided the agent's coordinates and trained their model to learn a representation linking these two. So, after this, the agent would have the ability to predict the outcome of its future actions and guess how the scene would look like after moving forward or turning to a side.
This was powerful and in a sense, the agent developed imagination. However, there was an issue here. If we pay attention, we realize that the agent is still being fed prerecorded video as the input and is learning similar to the observer kitten in the kitten carousel experiment explained above. So, following this, the authors went after this problem and proposed to train an agent that takes any given object from an arbitrary angle and then predict or better to say imagine the other views by finding the representation in a self‐supervised manner [38].
Up until this point, the agent does not use the sound of its surroundings while humans are all about experiencing the world in a multisensory manner. We can see, hear, smell, and touch all at the same time and extract and use the relevant information that could be beneficial to our task at hand. All that said, understanding and learning the sound of objects present in a scene is not easy since all the sounds are overlapped and are being received via a single channel sensor. This is often dealt with as an audio source separation problem, and lots of work has been done on it in the literature [39, 43].
Now it was the reinforcement learning's turn to make a difference. Policies have to be learned to help agents move around a scene, and this is the task of active recognition [44, 48]). The policy will be learned at the same time it is learning other tasks and representation, and it will tell the agent where and how to strategically move to recognize things faster [49, 50].
Results show that policies indeed help the agent to achieve better visual recognition performance, and the agents can strategize their future moves and path for better results that are mostly different from shortest paths [51].
3.3.4 Embodied Question Answering
Embodied Question Answering brings QA into the embodied world. The task starts by an agent being spawned at a random location in a 3D environment and asked a question in which its answer can be found somewhere in the environment. For the agent to answer it, it must first strategically navigate to explore the environment, gather necessary data via its vision, and then answer the question when the agent finds it [52, 53].
Following this, Das et al. [54] also presented a modular approach to further enhance this process by teaching the agent to break the master policy into subgoals that are also interpretable by humans and execute them to answer the question. This proved to increase the success rate.
3.3.5 Interactive Question Answering
Interactive Question Answering (IQA) is closely related to the Embodied version of it. The only main issue is that question is designed in a way that the agent must interact with the environment to find the answer. For example, it has to open the refrigerator or pick up something from the cabinet and then plan for a series of actions conditioned on the question [55].
3.3.6 Multi‐agent Systems