Where is the current SOTA?
- Fastest NNs are 30-80fps, and costs are >$4000
- Image sensors produce decent-resolution video feeds (640x480 for VGA EMOS), and costs are <$1
- There is cost imbalance between sensor and processing units!
- Can we bring down NN inference cost similar to sensor unit cost?
High-level idea of NoScope
- Use pre-trained networks as reference model, and train a sequence (cascade) of smaller, cheaper models specialized to target query
- Each smaller NN is trained only on video feed and particular class of objects
- Additionally, learn a difference detector - determines if object is still in frame
- Since these models are learned, they are data-dependent. To fix this, inference-optimized model search is used to find
a model cascade fit to the query and accuracy threshold (explained further)
- Currently only object detection queries are supported (presence/absence)
Model Specialization
Key Assumption: only interested in identifying a small number of objects
- Models are trained directly on target video
- Uses shallow NNs (!)
- Each NN output has an associated confidence level, which is used to defer to reference model if low confidence about object
- If confidence $c < c_{low}$, NoScope thinks no object in frame. If $c > c_{high}$, NoScope thinks object in frame. So if $c \in [c_{low}, c_{high}]$, NN defers to reference model.
- NoScope has a basic automated model search (2/4 conv layers, 32/64 conv units, 32/64/128/256 dense units)
Difference Detection
Two different kinds of difference detectors:
- A fixed reference image that contains no objects. This is created by averaging frames with no objects (as predicted by reference model). If differences are not significant, the detector will return no label.
- An earlier frame $t_{diff}$ seconds in the past. If differences are not significant, the detector will return the same label as the earlier frame.
Difference detectors use MSE to measure difference between frames.
- To help with frames that do not contain relevant information, DD uses a blocked comparison (sub-divides frame into blocks, and computes MSE on each block. a logistic regression classifier is used to weight each block to compute a final difference estimate).
- Why is this useful? For example, traffic signals will change colors, but not useful for detecting vehicle motion.
- Detector fires for some difference $\delta_{diff}$, which depends on the video and query.
- In practice, MSE and blocked MSE have similar performance, and get close to the best method
Difference detectors perform a difference check every $t_{skip}$ frames
Inference-optimized Model Search
This search method automatically identifies and learns an efficient model cascade which is able to reproduce the reference model's predictions on the video data, up to some desired accuracy.
This is formalized as the following optimization problem:
\(\max \qquad \mathbb{E}(throughput) \\\ \textnormal{s.t.} \ \textnormal{false positive} < FP^* , \textnormal{false negative} < FN^* \)
It searches the complete space of combinations of difference detectors and specialized models, and estimates the inference cost of each tool. It only considers spaces with one difference detector and one specialized model, since stacking model was not observed to help with performance.
- CBO can search the complete space, still at a fraction of the cost labelling all video frames with Yolov2!
The inference cost model is given by the following equation
\(
\begin{align}
f_sT_{MSE} + f_sf_mT_{specialNN} + f_sf_mf_cT_{fullNN}
\end{align}
\)
- $fs$ is the fraction of frames left after skipping $t{skip}$. These frames are evaluated by the MSE difference detector, which takes $T_{MSE}$ time to execute.
- $f_m$ is the fraction of those frames that are pass the firing threshold for the DD, and are evaluated by the specialized NN.
- $fc$ is te fraction of those frames whose confidence $c \in [c{low}, c_{high}]$, and are evaluated by the reference NN.
The flow of models used in NoScope is as follows:
graph LR;
DD-->specialNN;
specialNN-->fullNN;
Model Search
Challenges:
- Complex, nonlinear models
- Specialized models and DDs may be dependent on each other's performance
- Large search space, since the thresholds $\delta_{diff}, c_{low}, c_{high}$ must be learned, and are continuous values.
Solutions:
- NoScope uses grid search on different NN configurations to find the best model combination
- Run each trained model on every frame of an evaluation model, to determine selectivity (TNR) and sensitivity (TPR), to set thresholds $\delta_{diff}, c_{low}, c_{high}$.
- Linear sweep algorithm fo feasible threshold values.
Linear sweep
- For each difference detector D, sort the frames in decreasing order $\delta$.
- Compute the FPR and FNR for a given $\delta_{diff}$ threshold.
- For each specialNN, sort and set the $c_{low}, c_{high}$ to the desired FPR and FNR.
- Compute expected throughput using the cost model equation (1).
Runtime
CBO complexity is $O(n_dn_cn_t)$, where $n_d$ is the number of DD configurations, $n_t$ is the number of specialNN configurations,
$n_t$ is the number of firing thresholds.
In practice, this is fast because $n_d,n_c,n_t$ are all less than $10^2$.
Limitations
Possible ideas for exploration:
- Only tested on fixed-angle video (viewed from a single angle)
- Does not account for data drift (if scene changes, like day/night)
- Image batching processes 100 frames for a 30fps video (so adds a 3.3s delay for the first 30 frames)