An Object Detection Algorithm Based on Deep Learning and Salient Feature Fusion for Roadside Surveillance Camera

Download the Audio (Right-click, Save-As)

In July 2015, the University of Michigan debuted Mcity. It's a 32-acre mock town, complete with roads, intersections, traffic signals, and buildings. What do you think it's for? Is it a movie set? A theme park? No. It's a "proving ground" for connected and autonomous vehicles. A place for them to drive and communicate, and test their ability to connect and interface with new infrastructure. Over the last decade Ford has used it for their autonomous vehicle testing program, and the site has hosted dozens of other industry and academic partners. It has served as a testbed for connected vehicle safety systems, for smart intersections, and for new wireless protocols.

All of this is in service of an idea called V2I: vehicle-to-infrastructure communication. It's the idea that cars and roadside systems can exchange information with each other and the surrounding environment to create safer, more efficient and more coordinated transportation systems.

Imagine, for example, that you're driving a big-rig around the city streets, making deliveries. Your ability to see the cars, pedestrians, and cyclists around you is critical, but your blind spots are huge. What if you could use V2I to tap into the roadside cameras that are pointed in your direction, and use their feeds to help you navigate an upcoming turn. What if they could identify and alert you to pedestrians crossing behind you or cyclists that may be entering your path soon.
Imagine that you're on your commute home and every single light seems to turn green for you right before you get there. It's not magic, your autonomous car has been receiving timing information over the air, and has been adjusting your speed so that you arrive at each intersection at exactly the right time.

And those are just two examples. They are just part of the promise of V2I. This technology also has the potential to reshape how we design our cities, how we coordinate emergency response, how we optimize logistics and deliveries, and how we monitor public safety. But there's still a long way to go before we get there. Right now, V2I systems are still experimental and the supporting infrastructure is still being designed and built. There are thousands of individual things (pieces of software, pieces of hardware, pieces of legislation) that need to be created to get from where are today (isolated, vehicle-centric perception systems), to real towns that actually function like Mcity.

And that's the context for today's paper. In it, the authors are trying to check one small thing off from that laundry list. Let's go back to that big-rig example from earlier (the one where the driver gets help executing the turn). In order for that to come to fruition, the stationary cameras pointed at the truck need to be able to identify different traffic participants and objects that appear in the scene. This may sound like a simple classification problem, but it's actually fairly complex. You see, when you're training a computer vision classifier to distinguish between an apple and a banana, that's fairly easy. Those objects don't share many traits. But distinguishing between a banana and a plantain is much harder. They share a lot of physical features.

And you hit this same issue when building traffic monitoring cameras. A stationary bicycle locked to a rack shares a lot of features with a moving cyclist who's lane-splitting through traffic. And that cyclist shares a lot of features with the motorcycle coming up behind it. That person sitting on the motorcycle shares features with the person sitting at the bus stop, who shares features with the other person walking down the street, or entering the crosswalk. But if you get any of those IDs wrong, you will critically misunderstand the physical movement that that object may be about to engage in. You accommodate an incoming cyclist very differently than a person sitting at a bus stop. And when you see the motorcycle coming, you factor in the fact that it could accelerate very quickly at any time, and adjust your plan accordingly.

In this scenario, misclassification is a real problem. And a real impediment to the reliable deployment of V2I. So what can we do about it? The authors' solution is called salient feature fusion. Their framework augments a standard YOLO-based object detector with a second branch that extracts boundary and texture features, then fuses them with RGB features to better distinguish between objects. On today's episode we'll walk through their system design, and find out how it works. Let's dive in.

When a deep learning model looks at an image, every region gets processed through a series of layers. This gradually builds up a set of learned features. A feature might represent an edge, or a texture, or a shape. And eventually the model takes all of those features together and says: that cluster of features over there looks like a pedestrian, or that one looks like a bicycle. But what happens when these features overlap? Well, the model tries to extract features for a given object, but picks up features belonging to the other. Then it gets confused, and can end up outputting a false detection. It thinks it sees a cyclist when there isn't one, or it misclassifies a stationary bike as a pedestrian. In V2I, this kind of false detection could trigger an unnecessary warning to a connected vehicle, and cause it to take the wrong course of action.

Existing solutions to this kind of problem generally fall into three camps.

Multi-task learning, which trains a single model on multiple related objectives simultaneously with the hope that this pressure forces it to build more robust internal representations.
Attention mechanisms, which let the model selectively up-weight parts of an image that are more relevant to the prediction.
Multi-source data fusion, which brings in information from additional sensors (like LiDAR or radar) to compensate for what a single camera might miss.

Each of these works for a certain context, but they each also have practical tradeoffs: multi-task learning increases model complexity and depends heavily on loss balancing. Attention mechanisms improve feature weighting but do not directly resolve boundary ambiguity. And multi-source fusion introduces additional hardware requirements and synchronization costs that make deployment more difficult.

The authors wanted a different answer. One that works with a single camera, requires no additional sensors, and is fast enough to run in real time. Their solution is to extract a richer set of information from the same image the model was already looking at, and feed that information in through a separate non-weight-sharing branch alongside the original image. They call this additional input the "salient feature map".

You see, a standard RGB image carries more signal than a typical detection pipeline extracts. The challenge is getting that information into a form the model can act on. To build their map, the authors ran every image through a secondary pipeline before it reached the detector, and extracted two categories of information.

The first was boundary information, pulled out using the Canny edge detector, a method that applies Gaussian smoothing to suppress noise. This matters in outdoor environments because roadside scenes have variable lighting, moving shadows, and changing weather.
The second was texture. They used the gray-level co-occurrence matrix (GLCM) to analyze how pixel intensity values relate to their neighbors across a region of the image. Rather than describing a region by its average brightness or color, this captures the statistical structure of how intensity varies locally. Why? Because this is a meaningful signal for distinguishing object types even when their appearance overlaps.

After these extractions, they had a sizable pool of candidate features to work with. But feeding too many features into the model would introduce redundancy and dilute the signal with noise. So they ran a selection process to identify which features were genuinely the most informative. Features that contributed most to preserving the reconstruction were retained, and the others were discarded. Running this iteratively, over and over, they eventually converged on just three total features. One capturing object boundaries, one measuring local textural uniformity, and one measuring the complexity or randomness of texture in a region. Together, these three data points are what form the "salient feature map" that gets passed into the model alongside the original image.

Now all they needed to do was adapt their detection model to use both channels. They ended up with a two-branch network. The first branch processes the original RGB image through a standard deep detection backbone. The second processes the feature map through a lighter architecture designed specifically for those features. And importantly, these two branches don't share weights. You see, when two neural network branches share weights, they're constrained to learn roughly the same transformation from their respective inputs. That makes sense when both inputs are the same kind of data, but here, they're not. The RGB image and the salient feature map carry fundamentally different types of information, and forcing a single shared set of weights to handle both would end up serving neither. Keeping the branches separate allows each one to specialize. Their outputs are then fused together at specific stages of the pipeline within the modules called ELAN: Efficient Layer Aggregation Networks. This gives the detector access to both representations simultaneously when making its predictions.

On top of this the authors also embedded an attention module called CBAM into the layers of the network that are responsible for predicting object boundaries. Why? Because by the time features reach the prediction layers, they're organized into a multi-dimensional grid where each dimension, or channel, has learned to respond to a different type of pattern. One channel might fire on smooth surfaces. Another on high-contrast edges. Another on a specific texture. CBAM looks across all of those channels and asks which ones are most relevant for this particular image. It compresses the spatial information in each channel down to a single value, then uses those values to reweight the channels, amplifying the ones that matter and suppressing the ones that don't. Then, it shifts its focus to the spatial dimension, asking which locations in the image deserve the most attention, regardless of which channel you're looking at. The result is a feature map that has been actively recalibrated, both in terms of what types of patterns to prioritize and where in the image to look for them. This is precisely the kind of adaptive focus that helps the model avoid being misled by conflicting signals.

To test their creation the authors mounted a real camera in a real city, and collected footage across multiple lighting conditions, times of day, and combinations of categories. Then they trained and evaluated their pipeline on this dataset using metrics like precision and recall, and compared it against baseline detectors to see whether it improved anything. In the end, their new system achieved higher accuracy and better overall detection performance across the board, particularly in scenes where object classes shared overlapping features, or where only partial features were visible to the model. So it sure seems like it works. By explicitly guiding the model toward more informative features, they appear to have made it more reliable on the exact edge cases where standard detectors tend to fail.

If you want to go deeper, make sure you download the paper. The authors include an appendix with structural diagrams of every module in the architecture, a rundown of how the attention module slots into the prediction layers, and a visualization of the fusion module that shows, step by step, how the model's attention distribution shifts as the salient branch is activated.