Self-Driving Cars: Implementing Real-Time Traffic Light Detection and Classification in 2017

Today, basic traffic light detection is a solved problem. Innovations in deep learning and computer vision exist in the form of robust algorithms.

They work without developing code to manually determine the color, position, or location of the light. For example optimized R-CNN models can obtain state-of-the-art accuracy at real time speed.

So how does it work? Let’s explore and find some traffic lights!

Quick, where are the traffic light(s)?

What the AI thinks:

The top image was one of the results from the bosch test data set. This image was not available to the network at training time. Running time was 77ms. It also shows that while the system has has incredible performance, much much more efforts would be needed for a production level car. For example at this distance it did not detect the traffic light on the left at a confidence level greater than 50%. Code available here.

Google’s approach, circa 2011

Image credit Fairfield et. al.

A team at Google used the approach of extracting the detected traffic light first, then running a second classifier on it. That approach provides flexibility; however, depending on the implementation it may come at the cost of added pipeline complexity and computation.

Perhaps more importantly it seems to rely on prior knowledge of expected traffic light locations. And more generally, doing classification as a second step adds a second network to train, test, etc.

Can it be done in one network, exclusively with an image and no prior information?

I first started by exploring Single Shot Detection (SSD) and ended up using Faster R-CNN due to it’s superior performance with small objects. I somewhat painfully recreated existing implementations to teach myself how it worked.

I then switched to using the open source tensorflow object detection api. This recently released toolset provides faster turnaround time for testing models and comes ready with popular pre-trained weights. It allowed me to focus more on the engineering implementation and less on the specifics of each neural network implementation.

Left: Internal network architecture, credit tensorflow team, screenshot for a Faster R-CNN sub component. Right: Credit Huang arrows added.

In this paper they discuss performance trade offs of different approaches. For example, SSD (similar to YOLO) is great for medium to large objects, however fares significantly worse than Faster R-CNN for small objects. I confirmed this as in practice we had trouble getting SSD to converge well on the Bosch small traffic light dataset. In contrast, Faster R-CNN with Resnet got great results.

A far away light, only a few pixels wide being detected.

Want to see more? Check out the full videos: Test video. or Train video.

Adapting Bosch data for the Udacity self driving car

Image credit Udacity

As part of a team of students from around the world we have been working on a limited test of a self driving car. On a tiny closed track the car must successfully follow a set of waypoints and identify a traffic light.

If you’re interested in the technical details check out our team’s code.

Double-transfer learning

We leaned heavily on transfer learning due to the limited amount of data available. The pipeline looks like this:

  1. COCO pre-trained network
  2. Bosch traffic light data
  3. Udacity real data (150 samples) or sim data (260 samples)

Using this method we got great results:

Results from Udacity self driving car test site. The model was trained on 14 classes. So it has additional classes like Off, Red Left, etc.

Why a deep learning based approach?

Traffic lights come in many different quantities, positions, shapes, sizes, and layouts. With a deep learning based approach these differences are “easy” — simply collect examples of the types of traffic lights in the area the car will be driving.

Credit wikipedia. The left most image is a traffic light that uses different shapes to aid colour blind recognition. The center example is from new york where vertical mounting is common and the right most image from Canada where it’s common to see horizontally mounted traffic.

Motivation for high accuracy localization:

A high accuracy bounding box allows high accuracy distance estimation. The better the distance estimation, the closer we can match to other data points. For example, is a traffic light on the near side or far side of an intersection?

For example, in this image, the nearer traffic lights are clearly detected with a larger bounding box than the far intersection. A less accurate localization would be more prone to errors here. In a more sophisticated system this could be used for all manner of things such as confirming the light is indeed for the lane the car is in, that the light matches expected locations based on prior mapping knowledge, etc.

Real time performance (10+ Hz)

At first we were seeing inference times of around 220 ms. While this is fast compare to say a sliding window approach, I personally wouldn’t really consider 3–4 frames per second as real time.

Based on the papers suggestions we reduced the number of region proposals from the author’s original suggestion of 300, to 50.

This gave us a ~3x speed up in inference time. (~220 ms to ~80 ms) with similar accuracy.

This is predicting traffic lights that take up less than 1% of a 1280 x 720 image. For example in the above google paper they used images of 2040x1080 or 2.3x number of pixels.

Spectacular failure cases.

There are many examples where the system is not ready for production use. For example here it thinks it’s a yellow light!

That said, I can see many of these cases being overcome with more data, or simply more training. For example we trained to around 20,000 iterations, which is likely around 1/10 what’s needed for true convergence (ie most optimal model weight values).

One last thing

During testing I accidentally ran the network trained on simulated images (left) on real images.

Somehow it worked! And it worked well enough that it took a few odd failure cases to realize what was wrong!

Check out the results in this one example below:

  • Left: Bosch trained (different style of image) = no prediction over 50% confidence
  • Center: Sim trained, (image above) = correct prediction
  • Right: Real data trained (after bosch data) = incorrect prediction

It’s a reminder of an interesting opportunity. In theory, you could simulate any situation you wished, feed it to a deep learning system, and then have it generalize to a real life situation.

* Update Feb 2019 * If you are working on a deep learning system you may like Diffgram: Plug and play for computer vision!

Special thanks to Neil Hiddink and Cahya Ong for reviewing an earlier draft of this!

This is meant as a broad introductory exploration and it was not intended to be academically rigorous. Training was done without data augmentation (besides augmentation concepts inside the neural network itself and tensorflow object api), without dropout, etc. Ran on a GTX 1070 / Core i5. I used sloth to annotate the data.

Links summary: