34 34 Gao, L., Guo, Z., Zhang, H. et al. (2017). Video captioning with attention‐based LSTM and semantic consistency. IEEE Transactions on Multimedia 19 (9): 2045–2055.
35 35 Yang, Y., Zhou, J., Ai, J. et al. (2018). Video captioning by adversarial LSTM. IEEE Transactions on Image Processing 27 (11): 5600–5611.
36 36 Singh, A., Natarajan, V., Shah, M. et al. (2019). Towards VQA models that can read. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8317–8326.
37 37 Jayaraman, D. and Grauman, K. (2017). Learning image representations tied to egomotion from unlabeled video. International Journal of Computer Vision 125 (1–3): 136–161.
38 38 Jayaraman, D., Gao, R., and Grauman, K. (2018). Shapecodes: self‐supervised feature learning by lifting views to viewgrids. Proceedings of the European Conference on Computer Vision (ECCV), pp. 120–136.
39 39 Gao, R., Feris, R., and Grauman, K. (2018). Learning to separate object sounds by watching unlabeled video. Proceedings of the European Conference on Computer Vision (ECCV), pp. 35–53.
40 40 Parekh, S., Essid, S., Ozerov, A. et al. (2017). Guiding audio source separation by video object information. 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), IEEE, pp. 61–65.
41 41 Pu, J., Panagakis, Y., Petridis, S., and Pantic, M. (2017). Audio‐visual object localization and separation using low‐rank and sparsity. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 2901–2905.
42 42 Parekh, S., Essid, S., Ozerov, A. et al. (2017). Motion informed audio source separation. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 6–10.
43 43 Asali, E., Shenavarmasouleh, F., Mohammadi, F. et al. (2020). DeepMSRF: A novel deep multimodal speaker recognition framework with feature selection. ArXiv, abs/2007.06809.
44 44 Aloimonos, J., Weiss, I., and Bandyopadhyay, A. (1988). Active vision. International Journal of Computer Vision 1 (4): 333–356.
45 45 Ballard, D.H. (1991). Animate vision. Artificial Intelligence 48 (1): 57–86.
46 46 Ballard, D.H. and Brown, C.M. (1992). Principles of animate vision. CVGIP: Image Understanding 56 (1): 3–21.
47 47 Bajcsy, R. (1988). Active perception. Proceedings of the IEEE 76 (8): 966–1005.
48 48 Roy, S.D., Chaudhury, S., and Banerjee, S. (2004). Active recognition through next view planning: a survey. Pattern Recognition 37 (3): 429–446.
49 49 Tung, H.‐Y.F., Cheng, R., and Fragkiadaki, K. (2019). Learning spatial common sense with geometry‐aware recurrent networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2595–2603.
50 50 Jayaraman, D. and Grauman, K. (2018). End‐to‐end policy learning for active visual categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (7): 1601–1614.
51 51 Yang, J., Ren, Z., Xu, M. et al. (2019). Embodied visual recognition.
52 52 Das, A., Datta, S., Gkioxari, G. et al. (2018). Embodied question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2054–2063.
53 53 Wijmans, E., Datta, S., Maksymets, O. et al. (2019). Embodied question answering in photorealistic environments with point cloud perception. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6659–6668.
54 54 Das, A., Gkioxari, G., Lee, S. et al. (2018). Neural modular control for embodied question answering. arXiv preprint arXiv:1810.11181.
55 55 Gordon, D., Kembhavi, A., Rastegari, M. et al. (2018). IQA: Visual question answering in interactive environments. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4089–4098.
56 56 Hinchey, M.G., Sterritt, R., and Rouff, C. (2007). Swarms and swarm intelligence. Computer 40 (4): 111–113.
57 57 Wang, J., Feng, Z., Chen, Z. et al. (2018). Bandwidth‐efficient live video analytics for drones via edge computing. 2018 IEEE/ACM Symposium on Edge Computing (SEC), IEEE, pp. 159–173.
58 58 Camazine, S., Visscher, P.K., Finley, J., and Vetter, R.S. (1999). House‐hunting by honey bee swarms: collective decisions and individual behaviors. Insectes Sociaux 46 (4): 348–360.
59 59 Langton, C.G. (1995). Artificial Life: An Overview. Cambridge, MA: MIT.
60 60 Hara, F. and Pfeifer, R. (2003). Morpho‐Functional Machines: The New Species: Designing Embodied Intelligence. Springer Science & Business Media.
61 61 Murata, S., Kamimura, A., Kurokawa, H. et al. (2004). Self‐reconfigurable robots: platforms for emerging functionality. In: Embodied Artificial Intelligence, (Fumiya Iida, Rolf Pfeifer, Luc Steels et al.), 312–330. Springer.
62 62 Steels, L. (2001). Language games for autonomous robots. IEEE Intelligent systems 16 (5): 16–22.
63 63 Steels, L. (2003). Evolving grounded communication for robots. Trends in Cognitive Sciences 7 (7): 308–312.
64 64 Durrant‐Whyte, H. and Bailey, T. (2006). Simultaneous localization and mapping: Part I. IEEE Robotics and Automation Magazine 13 (2): 99–110.
65 65 Gupta, S., Davidson, J., Levine, S. et al. (2017). Cognitive mapping and planning for visual navigation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616–2625.
66 66 Zhu, Y., Mottaghi, R., Kolve, E. et al. (2017). Target‐driven visual navigation in indoor scenes using deep reinforcement learning. 2017 IEEE International Conference on Robotics and Automation (ICRA), IEEE, pp. 3357–3364.
67 67 Pomerleau, D.A. (1989). Alvinn: An autonomous land vehicle in a neural network. In: Advances in Neural Information Processing Systems, 305–313. https://apps.dtic.mil/sti/pdfs/ADA218975.pdf.
68 68 Sadeghi, F. and Levine, S. (2016). CAD2RL: Real single‐image flight without a single real image. arXiv preprint arXiv:1611.04201.
69 69 Wu, Y., Wu, Y., Gkioxari, G., and Tian, Y. (2018). Building generalizable agents with a realistic and rich 3D environment. arXiv preprint arXiv:1801.02209.
70 70 Kolve, E., Mottaghi, R., Han, W. et al. (2017). AI2‐THOR: An interactive 3D environment for visual AI. arXiv preprint arXiv:1712.05474.
71 71 Xia, F., Zamir, A.R., He, Z. et al. (2018). Gibson Env: real‐world perception for embodied agents. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9068–9079.
72 72 Yan, C., Misra, D., Bennnett, A. et al. (2018). CHALET: Cornell house agent learning environment. arXiv preprint arXiv:1801.07357.
73 73 Savva, M., Chang, A.X., Dosovitskiy, A. et al. (2017). MINOS: Multimodal indoor simulator for navigation in complex environments. arXiv preprint arXiv:1712.03931.
74 74 Savva, M., Kadian, A., Maksymets, O. et al. (2019). Habitat: A platform for embodied AI research. Proceedings of the IEEE International Conference on Computer Vision, pp. 9339–9347.
75 75 Datta, S., Maksymets, O., Hoffman, J. et al. (2020). Integrating egocentric localization for more realistic point‐goal navigation agents. arXiv preprint arXiv:2009.03231.
76 76 Song, S., Yu, F., Zeng, A. et al. (2017). Semantic scene completion from a single depth image. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1746–1754.
77 77 Chang, A., Dai, A., Funkhouser, T. et al. (2017). Matterport3D: learning from RGB‐D data in indoor environments. arXiv preprint arXiv:1709.06158.
78 78 Jaderberg,