Energy-autonomous always-on cognitive and attentive cameras for distributed real-time vision with milliwatt power consumption

Project Overview

The grand goal of “CogniVision” is to enable the unprecedented capability of performing ubiquitous real-time vision through novel silicon chips that are untethered, always-on and nearlyperpetual, ultra-miniaturized (<100 mm3 ), inexpensive (~1$). From a broad viewpoint, CogniVision introduces a new class of cameras that are “cognitive” and “attentive”. CogniVision cameras are cognitive as they are able to constantly make sense of the scene through extremely energyefficient circuits for best-in-class machine learning algorithms, i.e. deep learning based on convolutional networks. In the last few years, deep learning and convolutional networks have been extensively demonstrated to achieve outstanding accuracy, and to exhibit an uncommon degree of flexibility as they can be restructured (e.g., adjusting number of layers and weight values) to perform a very wide range of vision tasks. Indeed, deep learning has become the de facto standard framework for image and video processing, with remarkable success in content understanding, face detection, object detection and tracking, image classification and segmentation, pedestrian detection, loiterer detection, abandoned luggage detection.

Also, this approach would prohibit important capabilities such as

1) respond to time-varying requirements of the “cloud” server gathering the output of many cameras (e.g., request to perform a new task or occasionally send entire frames, as triggered by events captured by neighboring cameras, based on global understanding of the cloud)

2) upgrade the neural network, using its innate ability to be refined via retraining with new data

3) save power when degraded quality in processing (e.g., approximations) is tolerable for less visually demanding tasks (e.g., optical character recognition simpler than object detection).

A suitable approach to achieve these capabilities is to allow the cloud to push neural network configurations onto individual cameras, which in turn need to be responsive and receptive of the related commands from the cloud. Accordingly, cognitive cameras also need to be attentive, i.e. listen to commands wirelessly sent by the cloud, hence requiring an always-on radio receiver. In general, nearly-perpetual always-on operation is pursued by harvesting power from the environment, which limits the power consumption of CogniVision cameras to ~1 milliwatt to maintain the system volume well below 100mm3 (e.g., provided by a 0.1-mm thick, 5-cent, 1-2 cm wide organic photovoltaic foil attached to a wall, with a stacked 0.4-mm equally sized battery and on-foil printed antenna, all commercially available). Reducing the power consumption of cognitive cameras down to the 1mW range is the fundamental objective of this project. This entails a power reduction by at least 20-30X compared to the most power-efficient existing cameras that constantly monitor the scene with resolution and frame rate that are adequate for distributed monitoring and surveillance (e.g., VGA resolution, 30 frames/s). Cognitive cameras with power down to 1mW will be enabled by drastically limiting the amount of data transmitted wirelessly to the server cloud that makes sense of the scene, thus substantially reducing the traditionally large power due to the transmission of entire video frames (e.g., 40-50 mW with MPEG-compressed VGA frame, Bluetooth Low Energy transmission). This is accomplished by embedding substantial sensemaking capability (e.g., object detection) into the camera silicon chip, leveraging the recent impetuous advances in deep learning and convolutional neural networks (widely adopted by Google, Facebook, Microsoft). As paradigm shift, CogniVision moves sensemaking from the cloud to cognitive cameras, keeping the power in the mW range in spite of the traditionally high computational complexity of deep learning. This will be achieved via innovation on energy-efficient circuits/architectures for sensemaking (see “Approach” section), including a novel digital energyquality scalable architecture for general-purpose on-chip acceleration of convolutional networks with energy efficiency of 50TOPS/W or better, i.e. 10-20X more energy-efficient than the state of the art. Its ability to execute any convolutional network makes it applicable to the very wide (and ever-expanding) range of applications of convolutional networks, as long as the network fits the on-chip available memory and processing array size, as discussed in the “Subprojects” section.

Being “attentive”, CogniVision cameras have also the capability to be responsive to the cloud, and occasionally be reprogrammed by the cloud in the following ways:

1) transmit a short series of frames to be processed directly by the cloud (e.g., if the visual task exceeds the cognitive capabilities of the camera);

2) update the neural network to a different one (i.e., uploading layer structure and weights), when the cloud requests a substantial change in the visual task executed by the camera (e.g., the cloud needs to identify very specific objects in a given area being covered by some of the cameras);

3) statically adjust on-chip energy-quality knobs that can save energy in vision tasks where lower processing accuracy or arithmetic precision are tolerable (e.g., less demanding visual tasks such as optical character recognition, as compared to more challenging tasks such as object detection.

As side benefit, cognitive cameras solve the traditional issue of data deluge in distributed vision systems. Indeed, frames from cameras are traditionally transmitted wirelessly to the cloud, involving large data volumes (~20 cameras exhaust the capacity of a wireless LAN, Internet video traffic is increasing alarmingly fast). This is avoided in cognitive cameras, as the transmitted data volume is reduced by several orders of magnitude (from preliminary simulations, they transmit at a data rate of ~1-10kbps on average, as opposed to several MBs in traditional cameras). Regarding the timeliness of the CogniVision project, embedding vision in energy-autonomous nodes has been pursued for a decade with very limited success, due to the excessive power consumption required by on-chip processing. We are now witnessing the convergence of three technology trends, which are reshaping the areas of machine learning for computer vision and ultra-low power chips. On one hand, deep convolutional neural networks have made tremendous advances in terms of vision capability, although at substantial power and memory cost that is beyond the capabilities of energy-autonomous systems. Their power power is now reaching the tens of mW range after two very intense years of research in deep learning accelerators. Simultaneously, fundamental advances have been recently made in the area of energy-quality scalable integrated circuits and systems (including deep learning accelerators and vision processors), where substantial reduction in the intensity of computation and energy is achieved when moderate reduction in the quality of processing/sensing (e.g., arithmetic precision) is tolerable by the vision task at hand. Also, fundamental advances have been recently made in image sensor design, introducing the ability to embed simple in-sensor processing with low energy cost, limiting the expensive centralized processing requiring full frame readout. As convergence of the above trends, CogniVision leverages the well-known exceptional robustness of deep learning/vision against inaccuracies to exploit energyquality scaling and simple in-sensor processing, which justify the timeliness of the project. Recent market trends confirm the timeliness of CogniVision, and the expectable importance that smart untethered cameras will have in the years to come. For example, in December 2017 Amazon has acquired the wireless camera company Blink; in October 2017 Google has released the CLIPS wireless camera. Although the capabilities of such cameras are currently limited (e.g., actual lifetime from 3-5 hours with continuous shooting to 2- 5 weeks, they simply record clips when event occur), this clearly shows a technological and market interest in ubiquitous vision. In 2017 Qualcomm announced the intention to pursue a research project on low-resolution (320×240) cameras for smart toys/appliances with low recognition capabilities (e.g., single object detection, ambient light sensing). None of the available cameras can interact with the cloud in real time (i.e., they are not “attentive”). As another example, in March 2018 Sony and other companies formed the NICE alliance to support the creation of a prospective generation of cameras with on-board analytics. Ubiquitous cognitive cameras can provide novel technological capabilities and societal benefits, enabling for the first time situational awareness with fine spatial granularity across wide areas (from building to city scale). Examples of targeted applications are ubiquitous/augmented surveillance, vehicle/pedestrian detection, intelligent transportation, crowd monitoring, industrial plant monitoring, warehouse management, detection of dangerous objects, disaster management, among the others. In short, CogniVision empowers the Internet of Things (IoT) (i.e., ubiquitous sensor augmentation of the Internet) with the sense of vision, for the first time. As IoT is the next “big wave” of technology (45% annual growth, global value of 11T$ by 2025), CogniVision will leverage its capabilities and potential growth to create economic value in Singapore, accelerating the Smart Nation vision.

The success of CogniVision will provide a unique technological competitive advantage, in view of the demonstration of the first camera chip with nearly-perpetual operation, fully untethered, energy-harvested, millimeter-sized, capable of on-chip real-time sensemaking, low cost ($ range). The on-chip sensemaking also fundamentally solves the challenges of data delouge and privacy, which are currently faced with distributed (tethered) cameras. Accordingly, CogniVision accelerates the Smart Nation vision, and contributes to make Singapore a global hub for IoT sensing technologies, and in particular high added-value technologies such as visual sensing. To reach the intended impact, local enterprises working on or using distributed sensors (e.g., belonging to the recently formed IoT Consortium of the Singapore Semiconductor Industry Association (SSIA)), will be engaged during the project via demonstration in our labs. On a global scale, the Embedded Vision Alliance will be engaged to reach out to leading companies in image sensing applications. These companies can indeed be technological or venture partners in the successive translation of CogniVision into a commercial technology. The support of agencies is key to the success of the project, as Singapore is a natural testbed for CogniVision, and will benefit from the introduction of ubiquitous vision capability in the Smart Nation vision . Their expertise will facilitate alignment with compelling applications and use cases. At the end of the project, a workshop will be organized to share findings and to demonstrate the outcomes of CogniVision. To make our technologies widely available, we will consider the opportunity of spinning off a company based in Singapore for commercialization of CogniVision. The CogniVision project will leverage the synergy with local industry in the IoT space, starting from the project industrial partners, which cover the key areas related to CogniVision, i.e. system integration (Panasonic) and chips for IoT (Mediatek). As key factor that promises significant impact of CogniVision is the relevance to a very wide range of diverse applications and verticals, ranging from consumer to security, smart cities, industry, and others.

CogniVision project in numbers

Universities	2
Years of execution	5
Research publications	5
Patent applications	0
Public speeches	10