Non classé

Building Smart Camera applications at an industrial scale by leveraging cutting-edge deep learning techniques

wintics building smart camera applications

Computer vision is a fast-moving field of research

Smart Cameras are all about Computer Vision, the sub-field of AI that aims to build machines having human-level understanding of images or videos. It underwent a series of major breakthroughs in many vision tasks (like image classification, object detection, etc) since 2012. Part of the progress made came from research advances on how to train bigger and bigger neural networks. Over the past few years, researchers and engineers have been working on how to make deep learning more efficient and how to run neural networks using fewer resources. It is for instance today’s AI industry practice to reduce the arithmetic precision of real numbers when running neural nets in production. Roughly, we use 32 bits for representing neural nets weights and outputs during training, but we use much less (16 or 8 bits) in production. Despite all the research done so far, for numerous use cases neural networks still demand too much computing power to meet low latency constraints.

 

 

Hardware producers keep on innovating to follow algorithms breakthroughs

It is worth noting that the major part of the workload to run neural networks is linear algebra (matrix and vector operations). Companies like Nvidia, Google and Intel have been developing specialized software and hardware to perform this kind of arithmetic very efficiently. In 2017, Nvidia introduced the Volta micro-architecture containing Tensor Cores. These Tensor Cores are programmable chips allowing fused matrix-multiply-and-accumulate units that run concurrently alongside regular GPU cores. Tensor Cores implement new 16-bit floating-point HMMA (Half-Precision Matrix Multiply and Accumulate) and 8-bit integer IMMA (Integer Matrix Multiply and Accumulate) instructions for accelerating dense linear algebra computations. Google introduced in 2016 the Tensor Processing Unit (TPU) which was specifically designed for their deep learning platform called TensorFlow. Each TPU consists of two Tensor Cores that each consist of scalar, vector and matrix units (MXU). Currently, TPUs are not commercially available for purchase but can be used as a cloud service. The edge computing market has been targeted by NVIDIA and Google as well: Nvidia promotes the Jetson product line of embedded computing boards integrating small GPUs, and Google announced in March 2019 the Coral lightweight TPU.

wintics edge computing hardware

In addition to the hardware trend, we have witnessed the rise of software designed for optimizing neural networks for production, like Intel’s OpenVINO and NVIDIA’s TensorRT. More recently, Facebook open-sourced FBGEMM, a high-performance linear algebra library for running neural nets on CPU-servers. All these developments make new industrial AI use cases possible and generate great opportunities for innovation.

 

 

At Wintics, we combine the lastest innovations in both hardware and software to deliver high-performance solutions

At Wintics, we leverage NVIDIA technology. In terms of software, we have been extensively using their main high-performance low-level libraries in all our products: CUDA, CuDNN and TensorRT. In terms of hardware, we leverage high-performance GPUs for server applications and the Jetson product line for edge computing. We are currently able to process thousands of images per second on a single GPU, or even, to run real-time high-level directional traffic analysis from surveillance cameras on the lightweight Jetson Nano.

Here we share the results of some of our current works on Optimizing Siamese Neural Networks. This architecture has become popular in single object tracking in the past couple of years. We present the latency gains on the SiamRPN architecture brought by porting the neural network to C++, by optimizing it using TensorRT and by reducing arithmetic precision.

We performed benchmarks on 4 pieces of hardware: RTX 2080ti, Jetson Xavier, Jetson TX2 and Jetson Nano. Each device presents different architectures and delivers different computing power capabilities.

  Specs
Hardware Computing Power Architecture
RTX 2080Ti 4352 cores + 544 Tensor Cores Turing
Jetson Xavier 512 cores + 64 Tensor Cores Volta
Jetson TX2 256 cores Pascal
Jetson Nano 128 cores Maxwell

 

We measure latency by the number of frames per second (fps) and the tracking accuracy is quantified by the mean intersection over union (mIoU) with the ground truth labels. We use three levels of precision: 32-bit and 16-bit floating-points (FP32 and FP16, respectively) and 8-bit integer numbers (INT8). The benchmark below compares an open source code against our in-house optimized implementation.

Note: Most real-time applications require at least 30 fps.

  Frames per second (fps)
GPU Device Open source  

(Python & FP32)

C++ & TensorRT & FP32 C++ & TensorRT & FP16 C++ & TensorRT & INT8
RTX 2080ti 111.1 126.7 147.1 177.3
Jetson Xavier 20.8 29.8 73.6 92.2
Jetson TX2 9.1 10.6 17.3 INT8 not supported
Jetson Nano 3.7 4.7 8.0 INT8 not supported

 

In order to assure that these speed gains come with no accuracy loss, we tested all the implementations on the OTB-2013 dataset. The table below shows the results.

  Score (mIoU)
GPU Device Open source  

(Python & FP32)

C++ & TensorRT & FP32 C++  & TensorRT & FP16 C++ & TensorRT & INT8
RTX 2080ti 0.6445 0.6409 0.6406 0.6446
Jetson Xavier 0.6458 0.6431 0.6483 0.6344
Jetson TX2 0.6443 0.6396 0.6487 0.6371
Jetson Nano 0.6422 0.6401 0.6470 0.6431

 

In summary, Computer Vision is still a fast moving field: we still witness major breakthroughs coming from research and engineering every day. At Wintics, we pay attention to all new developments and adapt the best algorithms to our products. We invest in applied research to develop in-house high-performance algorithms. Those algorithms are natively and specifically designed to analyze urban scenes and are therefore able to deliver unmatched performance on their mobility and smart city use cases.

We illustrated how SiamRPN can be accelerated using C++, TensorRT and reducing precision. As a fruit of our specialization, we currently have an in-house customized implementation of SiamRPN which is even faster and supports multiple object tracking.

It is challenging to keep the pace and provide cutting-edge deep learning-based technology today. We tackle this with a very modular software stack that welcomes changes and updates. This is how we develop smart camera specialized technology, aiming to leverage AI to contribute to a more sustainable future.