My deep dive into the introduction a core suite of Hitachi Video Analytics (HVA) last year, as presented in my article Digital Video Analytics - Test Results, opened the door to a flood of inquiries and use cases. Fortunately, my hundreds of testing exercises gave me insight on how the analytical engine for each module could be calibrated for specific use cases and be able to identify if something within our expanding portfolio could meet the requirements, including the introduction of the Hitachi ToF 3D Sensor, which improves HVA by delivering a digital video stream that is completely unaffected by lighting, shadows and color variances.
In a continued effort to expand our video intelligence capabilities and flexibility, I tested several video analytics, including licensed, third party commercial-off-the-shelf (COTS) and our own proprietary solutions for a number of specific use cases, using a few different variables in camera distance, angle, focal length and lighting. This exercise uncovered valuable insights on the differences in analytical engines and the limitations in how algorithms read pixel color, contrast and frames per second. This effort confirmed that one size does not fit all from both analytical capabilities, flexibility and solution architecture.
A popular question was “can HVA Object Detector send an alert if it recognizes X,” with “X” being an classified object (e.g. gun, knife, box, etc.). HVA Object Detector identifies objects based on calibration variables which includes a static object identified by the number of overall blocks or connected blocks (pixel size) within the restricted zones using machine learning. HVA Object Detector learns the objects within a zone, without the need to classify them, and continues to learn through variances in illumination states and motion. That’s all it does, and it does it very well, so it can cover a wide variety of use cases, but not all of them. Object recognition and identification requires deep learning, even when using the Hitachi 3D LiDAR sensor (more on this later).
Below is an example of how HVA Object Detector recognizes a black back-pack on a black bench. I created yet another example that effectively sent an alert when a white garbage bag was added to the trashcan, peeking behind the station structure.
The following examples, which were completely out of the recommended system requirements (experiments) uncovered more insight on HVA Object Detector effectiveness and capabilities. I used HVA Object Detector to turn the standard CIF resolution camera stream monitoring the pantograph of a train car to send real-time alerts when damaged. Although there is constant motion within the video, and the specifications require a fixed field of view, I limited the zone for analysis to outlining the pantograph itself, while ignoring the motion in the scene.
System requirements for HVA Object Detector includes a static mounted fixed field of view camera, with a minimum of CIF 352x240 resolution and one frame per second. The example above and below do not meet all those requirements, but with some persistence, I was able to get positive results.
The unique configuration variables also provides the ability to use HVA Object Detector for loitering alerts (HVA Loitering Detector is scheduled to be released soon).
This can also work for object theft, as the demo video presents below. Given enough frames for HVA Object Detector to learn the entire scene and its objects, it will send an alert if those objects are removed or disappear.
Another early use case question was -- “Can HVA Traffic Analyzer provide alerts for restricted or double parking?” HVA Traffic Analyzer monitors traffic and classifies vehicles moving pixels as a car, truck, bus or bike.
While HVA Traffic Analyzer may not be engineered to identify double parked vehicles, HVA Object Detector can be calibrated to ignore moving objects, and even static objects for a predefined time before sending an alert. This could then take both the time for double parking and even the vehicles waiting at a traffic signal into consideration. I needed to try it. I was only able to dig up this old CIF resolution footage from 2008 of a person moving their vehicle from a restricted parking area. Even with the poor, standard resolution, and people walking across the restricted zone, I could calibrate the demo video for an accurate alert. This opens the opportunity to turn decade old traffic cameras into IoT sensors by scheduling pan-tilt-zoom pre-sets within the video management software with the automated enable/disable function of HVA.
Although the HVA Object Detector may not provide actual object identification or classification, it’s flexible enough, and smart enough for many unique use cases.I also tested five different people counting video analytics products in four different "people counting" use case scenarios. This was to review configuration and calibration capabilities, algorithm responsiveness, and performance. It was also to prove my theory that the term “people counting” is too vague when it comes to video analytics. There are too many scenarios, environmental factors, camera angles and specific use cases to believe that one size fits all. This is just people counting, not people tracking or people identification or people falling down or running. Just a people-sized pixel object crossing a virtual line or zone in either direction with tailgating alert capabilities.One of the four different people counting use case scenarios presented here is the typical top-down use case, which provides the best separation of individual people for the maximum performance. However, I have done live demonstrations on live CCTV cameras, and even a demo clip at night, with a typical 45 degree angle of coverage with some success (depending on the field of view). These experiments used a top-down view calibrated on each solution by object size from the same Axis camera. The top-down field of view negates any recreation of the three dimensional space and so this is typically done by intelligent people-size identification. All the solutions required some unique calibration for each location, with performance based on how well the algorithm can filter luggage, shadows and reflections. The test results of this simple example proved that, given the calibration effort, all five video analytics people counting products could provided 100% accuracy. Please note that no one was holding an infant child during the making of these films. A top-down field of view, from a single camera source, makes consistent 100% accuracy a challenge with any people counter analytics, not only because of parents carrying children but also the variables at each area of coverage and multiple illumination states. A single people counting solution may work well in five different locations, but how about 50, or 500 dramatically different locations? The reflection of the light off the floor, the shadow on the door, different illumination states, windows with sunlight at different times of the day, during different seasons, all affect the pixel colors and contrast and require continuous machine learning or re-calibration. Although this was a simple example, it wasn't necessarily an easy calibration.
The only method of increasing accuracy and greatly reducing calibration time for any of the more challenging environments, with any lighting, is by using a 3D LiDAR (TOF) sensor rather than a RGB CCTV camera. The Hitachi 3D LiDAR sensor is integrated into the Hitachi Video Analytics suite and provides the "perfect" image for analysis. Although the Hitachi 3D LiDAR sensor, with its SDK can do so much more than simple people counting, tailgating, loitering or intrusion detection (more on Sensor Fusion in the future), the video stream is unaffected by lighting, shadows, reflections because it generates it's own light source. High resolution pulses of infrared light that is measured as it bounces off objects. So even when I turned my office lab into a disco by flipping my lights on and off, there was no change to the video image. It continued to provide 100% accuracy, out of the box. Although it would not be able to identify the infant child cradled in mother's arms (that would need deep learning), it would continue to effectively count people no matter what color or lighting changes to the environment may occur and took only a couple minutes to setup and calibrate.
Now, the integration of the Hitachi 3D LiDAR sensor into HVA provides a solution to those problematic locations with wildly changing lighting, contrasts and shadows, extending its usability for even more challenging environments.Another one of the five specific use case tests included using a typical field of view. Below are two of the five COTS video analytics that were able to provide a somewhat accurate count of the people crossing the virtual line. The video sample is a purchased stock video clip and required a considerable amount of effort that included cropping out the giant LCD billboards above the platform, testing multiple locations of the virtual counting line, configuring the simulation of three-dimensional space, adding a pre-recorded video loop for machine learning, the calibration of the people size, contrast, etc. Moving the camera angle down another 10 degrees or so would've been so much easier. The field of view is crucial for the analytical engines to identify people on each side of the virtual counting line and provided more separation of groups, but alas, this was a stock video clip.
If we move from "machine teaching" using finite number of variables, to deep learning, using thousands or millions of sample images for training and analyzing the video stream at multiple layers for a better understanding of its objects using grayscale edge detection, a deeper analysis of color using the RGB dimensions for image segmentation and layered classes, you can achieve a multi-point people counter/tracker, even within a crowded scene.
Above samples are examples of how Deep Learning goes beyond typical pixel analysis and can meet some of the more challenging requirements.