Traditional digital maps have a refresh cycle of 6–12 months. However, to make sure the HD maps contain the most up-to-date information, the refresh cycle for HD maps should be shortened to no more than one week. As a result, operating, generating, and maintaining HD maps can cost upwards of millions of dollars per year for a mid-size city.
1.2.3 Computing Systems
The planning and control algorithms and the object recognition and tracking algorithms have very different behavioral characteristics which call for different kinds of processors. HD maps, on the other hand, stress the memory [9]. Therefore, it is imperative to design a computing hardware system which addresses these demands, all within limited computing resources and power budget. For instance, as indicated in [9], an early design of an autonomous driving computing system was equipped with an Intel® Xeon E5 processor and four to eight Nvidia® K80 graphics processing unit (GPU) accelerators, connected with a Peripheral Component Interconnect-E (PCI-E) bus. At its peak, the whole system, while capable of delivering 64.5 Tera Operations Per Second (TOPS), consumed about 3000 W, consequently generating an enormous amount of heat. Also, at a cost of $30 000, the whole solution would be unaffordable (and unacceptable) to the average consumer.
1.3 Achieving Affordability and Reliability
Many major autonomous driving companies, such as Waymo, Baidu, and Uber, and several others are engaged in a competition to design and deploy the ultimate ubiquitous autonomous vehicle which can operate reliably and affordably, even in the most extreme environments. Yet, we have just seen that the cost for all sensors could be over $100 000, with the cost for the computing system another $30 000, resulting in an extremely high cost for each vehicle: a demo autonomous vehicle can easily cost over $800 000 [10]. Further, beyond the unit cost, it is still unclear how the operational costs for HD map creation and maintenance will be covered.
In addition, even with the most advanced sensors, having autonomous vehicles coexist with human-driven vehicles in complex traffic conditions remains a dicey proposition. As a result, unless we can significantly drop the costs of sensors, computing systems, and HD maps, as well as dramatically improve localization, perception, and decision-making algorithms in the next few years, autonomous driving will not be universally adopted.
Addressing these problems, a reliable autonomous vehicle has been developed by us and for low-speed scenarios, such as university campuses, industrial parks, and areas with limited traffic [11,12]. This approach starts with low speed to ensure safety, thus allowing immediate deployment. Then, with technology improvements and with the benefit of accumulated experience, high-speed scenarios will be envisioned, ultimately having the vehicle's performance equal that of a human driver in any driving scenario. The keys to enable affordability and reliability include using sensor fusion, modular design, and high-precision visual maps (HPVMs).
1.3.1 Sensor Fusion
Using LiDAR for localization or perception is extremely expensive and may not be reliable. To achieve affordability and reliability, multiple affordable sensors (cameras, GNSS receivers, wheel encoders, radars, and sonars) can be used to synergistically fuse their data. Not only do these sensors each have their own characteristics, drawbacks, and advantages but they complement each other such that when one fails or otherwise malfunctions, others can immediately take over to ensure system reliability. With this sensor fusion approach, sensor costs are limited to under $2000.
The localization subsystem relies on GNSS receivers to provide an initial localization with sub-meter-level accuracy. Visual odometry can further improve the localization accuracy down to the decimeter level. In addition, wheel encoders can be used to track the vehicles' movements in case of GNSS receiver and camera failures. Note that visual odometry deduces position changes by examining the overlaps between two frames. However, when a sudden motion is applied to the vehicle, such as a sharp turn, it is possible that visual odometry will fail to maintain localization due to the lack of overlapping regions between two consecutive frames.
The active perception subsystem seeks to assist the vehicle in understanding its environment. Based on this understanding and a combination of computer vision and of millimeter wave (mmWave) radars to detect and track static or moving objects within a 50 m range, the vehicle can make action decisions to ensure a smooth and safe trip. With stereo vision, not only can objects including pedestrians and moving vehicles be easily recognized but the distance to these detected objects can be accurately pinpointed as well. In addition, mmWave radars can also detect and track fast-moving objects and their distances under all weather conditions.
The passive perception subsystem aims to detect any immediate danger and acts as the last line of defense of the vehicle. It covers the near field, i.e. a range of 0–5 m around the vehicle. This is achieved by a combination of mmWave radars and sonars. Radars are very good moving object detectors and sonars are very good static object detectors. Depending on the current vehicle speed, when something is detected within the near field, different policies are put into place to ensure the safety of the vehicle.
1.3.2 Modular Design
In the recent past, designs of autonomous driving computing systems have tended to be costly but affordable computing solutions are possible [9]. This has been made possible by the application of modular design principles which push computing to the sensor end so as to reduce the computing demands on the main computing units. Indeed, a quad-camera module such as the DragonFly sensor module [11] alone can generate image data at a rate of 400 Mbps. If all the sensor data were transferred to the main computing unit, it would require this computing unit to be extremely complex, with many consequences in terms of reliability, power, cost, etc.
Our approach is more practical: it entails breaking the functional units into modules and having each module perform as much computing as possible. This makes for a reduction in the burden on the main computing system and a simplification in its design, with consequently higher reliability. More specifically, a GPU SoM (System on Module) is embedded into the DragonFly module to extract features from the raw images. Then, only the extracted features are sent to the main computing unit, reducing the data transfer rate a 1000-fold. Applying the same design principles to the GNSS receiver subsystem and the radar subsystem reduces the cost of the whole computing system to less than $2000.
1.3.3 Extending Existing Digital Maps
Creating and maintaining HD maps is another important component of deployment costs. Crowd-sourcing the data for creating HD maps has been proposed. However, this would require vehicles with LiDAR units, and we have already seen that LiDARs are extremely expensive and thus not ready for large-scale deployment. On the other hand, crowd-sourcing visual data is a very practical solution as many cars today are already equipped with cameras.
Hence, instead of building HD maps from scratch, our philosophy is to enhance existing digital maps with visual information to achieve decimeter-level accuracy. These are called HPVMs. To effectively help with vehicle localization, HPVMs consists of multiple layers:
1 The bottom layer can be any of the existing digital maps, such as Open Street Map; this bottom layer has a resolution of about 1 m.
2 The second layer is the ground feature layer. It records the visual features from the road surfaces to improve mapping resolution to the decimeter level. The ground feature layer is particularly useful when in crowded city environments where the surroundings are filled with other vehicles and pedestrians.
3 The third layer is the spatial feature layer, which records the visual features from the environments; this provides more visual features compared with the ground feature layer. It also has a mapping resolution at the