A smart city is an urban area that employs information and communication technologies (ICT) [1], an intelligent network of connected devices and sensors that can work interdependently [2, 3] and a distributive manner [4] to continuously monitor the environment, collect data, and share them among the other assets in the ecosystem. This uses all the available data to make real‐time decisions about the many individual components of the city to ease up the livelihood of its citizens and make the whole system more efficient, more environmentally friendly, and more sustainable [5]. This serves as a catalyst for creating a city with faster transportation, fewer accidents, enhanced manufacturing, more reliable medical services and utilities, less pollution [6], and so on. The good news is any city, even with traditional infrastructures, can be transformed into a smart city by integrating Internet of things (IoT) technologies [7].
An undeniable part of a smart city is its use of smart agents. These agents can vary a lot in sizes, shapes, and functionalities. They can simply be light sensors that along with their controller act as the energy‐saving agents or could be more advanced machines, with complicated controllers and interconnected components that are capable of tackling more advanced problems. The latter agents usually come with an embodiment with numerous sensors and controllers built in them, enabling them to perform high‐level and human‐level tasks such as talking, walking, seeing, and complex reasoning along with the ability to interact with the environment. Embodied artificial intelligence is the field of study that takes a deeper look into these agents and explores how they can fit into the real‐world and how they can eventually act as our future community workers, personal assistants, robocops, and many more.
Imagine arriving home after a long working day and seeing your home robot waiting for you at the entrance door. Although it is not the most romantic thing ever, you then walk up to it, and ask it to make a cup of coffee for you and also add two teaspoons of sugar if there is any in the cabinet. For this to become reality, the robot has to have a vast range of skills. It should be able to understand your language and be able to translate questions and instructions to the action. It should be able to see its surroundings and have the ability to recognize objects and scenes. Last but not the least, it must know how to navigate in a big dynamic environment, interact with the objects within it, and be capable of doing long‐term planning and reasoning.
In the past few years, there has been significant progress in the fields of computer vision, natural language processing, and reinforcement learning, thanks to the advancements in deep learning models. Many things are now possible because of these that seemed impossible a few years ago. However, most of the work has been done in isolation from other lines of work. It means that the trained model can only take one type of data (e.g. image, text, video) as the input and perform a single task that it is asked for. Consequently, such a model acts as a single‐sensory machine as opposed to a multi‐sensory one. Also, for the most part, they all belong to Internet artificial intelligence (AI) rather than embodied AI. The goal of Internet AI is learning patterns in text, images, and videos from the datasets collected from the Internet.
If we zoom out and look at the way models in Internet AI being trained, we realize that generally supervised classification is the way to go. For instance, we provide a certain number of dog and cat photos along with the corresponding labels to a perception model. Moreover, if the number is large enough, the model then can successfully learn the differences between these two animals and discriminate between them. Learning via flashcards falls under the same umbrella for humans.
Extensive amount of time has been devoted in the past years to gather and build huge datasets for the imaging and language communities. A few considerable markers of this can be ImageNet [8], MS COCO [9], Sun [10], Caltech‐256 [11], and Places [12] created for vision tasks; SQuAD [13], GLUE [14], and SWAG [15] built for language objectives; and Visual Genome [16] and VQA [17] datasets created for joint purposes to name a few.
Apart from playing a pivotal role in the recent advances of the main fields, these datasets also proved to be useful when used with transfer learning methods to help underlying disciplines such as biomedical imaging [18, 19]. However, the aforementioned datasets are prune to restrictions. Firstly, at times it can get extremely costly, both in terms of time and money, to gather all the required data for the collection and label them. Secondly, the collection has to be monitored constantly to assure that they follow certain rules to avoid creating biases that could lead to erroneous results in future works [20] and also make sure that the collected data are all normal and uniform in terms of attributes such as background, size, position of the objects, lighting conditions, etc. However, in contrast, we know that in real‐world scenarios, this cannot be the case, and robots have to deal with a mixture of unnormalized noisy irrelevant data along with the relevant well‐curated ones. Additionally, the agent would be able to interact with the objects in the wild (e.g. picking it up and looking at the object from another angle) and also use its other senses such as smell and hearing to collect information (Figure 3.1).
Figure 3.1 Embodied AI in smart cities.
Humans do learn from interactions, and it is a must for true intelligence in the real world. In fact, it is not only humans but also animals. In kitten carousel experiment [21], Held and Hein exhibited this beautifully. They studied the visual development of two kittens in a carousel over time. One of them had the ability to touch the ground and control its motions within the restrictions of the device, while the other was just a passive observer. At the end of the experiment, they found out that the visual development of the former kitten was normal, whereas for the latter one it was not, even though they both saw the same thing. This proves that being able to physically experience the world and interact with it is a key element for learning [22].
The goal of embodied AI is to bring the ability to interact and being able to use multisenses simultaneously into play to enable the robot to continuously learn in a lightly supervised or even unsupervised way in a rich dynamic environment.
3.2 Rise of the Embodied AI
In the mid‐1980s, a major paradigm shift took place toward embodiment, and computer science started to become more practical than theoretical algorithms and approaches. Embedded systems started to appear in all kinds of forms to aid humans in everyday life. Controllers for trains, airplanes, elevators, air conditioners, and software for translation and audio manipulation are some of the most important ones, to name a few [23].
Embodied artificial intelligence is a broad term, and those successes were for sure great ones to start with. Yet, it could clearly be seen that it was a huge room for improvement. Theoretically, the ultimate goal of AI is to not only master any given algorithm or task that is given to but also gain the ability to multitask and get to human‐level intelligence, and that as mentioned requires meaningful interaction with the real world. There are many specialized robots for a vast set of tasks out there, especially in large industries, which can do the assigned task to perfection, let it be cutting different metals, painting, soldering circuits, and many more. However, until one single machine emerges to have the ability to do different tasks or at least a small subset of them by itself and not just by following orders, it cannot be called intelligence.
Humanoids are the main thing that comes to mind when we talk about robots with intelligence. Although it is the ultimate goal, it is not the only form of intelligence on Earth. Other animals, such as insects, have their own kind of intelligence, and due to being relatively simpler compared to humans, they are a very good place to begin with.
Rodney Brooks has a famous argument that says it took the evolution much longer to create insects from scratch than getting to human‐level intelligence from there. Consequently, he suggested that these simpler biorobotics should be first dealt with in the road to make much more complex ones. Genghis, a six‐legged walking robot [24], is one of his contributions to this field.