How to Solve Video Stream Recognition Without Using Any 3D Operators

I previously implemented a video stream recognition system based on the unet3d model for 10-frame video streams. The main model implemented 2 tasks:

Video stream classification
Video stream segmentation

By segmenting moving objects in the video stream and classifying the behavior in the video stream, some business requirements were fulfilled.

However, although the unet3d model showed good results on video stream recognition tasks with acceptable computational cost, it’s basically impossible to deploy on edge hardware.

Through research, I found that almost all edge hardware on the market doesn’t support 3D operators, or even 5D data formats.

At that time, RKNN completely didn’t support any 3D operators and 5D data formats. Qualcomm’s SNPE seemed to support individual 3D operators, but it wasn’t very useful and would likely hit roadblocks.

This meant our model couldn’t use these operators: BatchNorm3d, Conv3d, Dropout3d, MaxPool3d, ConvTranspose3d.

So the first thing was that I needed to redesign a model to achieve video stream recognition. I couldn’t use any 3D operators or 5D data formats.

Model Design

Constraints:

The model must be lightweight
Inference speed must be within 500 ms
No support for any 5D data
No support for any 3D operators
One model must satisfy 2 tasks: classification and segmentation

Previously, the unet3d model used an input format of [1, 3, 10, 224, 224]. Since 5D data formats aren’t supported, my idea was to merge the channel dimension with the time dimension, becoming an input format of [1, 30, 224, 224]. That is:

[batch, channel, time, height, width] => [batch, channel * time, height, width]

In fact, after channel merging, data conversion within the model is quite troublesome because the model interior also can’t have any 5D data formats. So could I change the channel to 1? Using grayscale images for training would reduce information but also reduce computation.

That is, [1, 30, 224, 224] => [1, 10, 224, 224]

The output has 2 outputs: one segmentation head: [1, 10, 224, 224], and one classification head: [1, 3].

Of course, considering the quality of the 10-frame video frame data produced by the previous hardware, I actually doubled the frame count to 20 frames here, which is [1, 20, 224, 224].

The main network design still follows the UNet structure idea, but also adds some spatial attention mechanisms to enhance the model’s feature extraction capability for moving parts.

Classification Head

What’s interesting is the classification head. In unet3d, the classification head’s input actually comes from the feature layer shared with segmentation. Under the powerful feature extraction capability of 3D operators for video streams, the classification head could be trained simultaneously with the segmentation head.

However, without using 3D operators and 5D data formats, using the above method for design makes it almost impossible to converge, and the training difficulty is very high.

My attempt was to change the classification head’s input to the segmentation head’s output. Since this is a sequential relationship, for the classification head to achieve good results, it must ensure the segmentation head’s results are good.

Therefore, my training actually became two-stage training:

First stage training: train the main network and segmentation head.
Second stage training: freeze the main network and segmentation head, only train the classification head.

Loss Function

The segmentation head’s loss function uses a modified Dice loss function and BCE loss function. But to improve the model’s attention to moving regions, I also designed an inter-frame motion difference loss, implemented by comparing inter-frame differences with the target mask. The classification head’s loss function uses cross-entropy loss.

Since the business implementation doesn’t require very high segmentation accuracy, only requiring the segmented region to be the moving region, the effect is acceptable.

Overall Effect

The overall effect is as follows:

result

Other Ideas

Actually, using object detection might have less computation than segmentation, after all, the final precision we need is just detecting the position of moving regions. It’s just that object detection is more complex to train than segmentation, and some post-processing is more troublesome than segmentation.

Can we use object detection + tracking algorithm to implement it?

My understanding is that with our business requirements and the complex usage environment of shopping carts, it’s not very feasible. The business needs to focus on items entering and leaving the cart, but the camera is directly facing already existing items.

Object detection + tracking means needing to identify and track already existing items, which is very difficult in a complex environment like a shopping cart. Even more terrifying is that users also have inventory management behaviors.