In the case of the autonomous driving, given an front camera view, the car needs to know where is the road. In this project, we trained a neural network to label the pixels of a road in images, by using a method named Fully Convolutional Network (FCN). In this project, FCN-VGG16 is implemented and trained with KITTI dataset for road segmentation.
(click to see the full video)
Miniconda is used for managing my dependencies.
(1) Download KITTI data (training and testing)
(2) Load pre-trained VGG
helper.py will do
it automatically for you.
(3) Run the code:
(4) Use my trained model to predict new images
training_flag = False
Then run the code by:
FCNs can be described as the above example: a pre-trained model, follow by 1-by-1 convolutions, then followed by transposed convolutions. Also, we can describe it as encoder (a pre-trained model + 1-by-1 convolutions) and decoder (transposed convolutions).
The Semantic Segmentation network provided by this paper learns to combine coarse, high layer informaiton with fine, low layer information. The pooling and prediction layers are shown as grid that reveal relative spatial coarseness, while intermediate layers are shown as vertical lines
we can approach training a FCN just like we would approach training a normal classification CNN.
In the case of a FCN, the goal is to assign each pixel to the appropriate class, and cross entropy loss is used as the loss function. We can define the loss function in tensorflow as following commands.
logits = tf.reshape(input, (-1, num_classes)) cross_entropy_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, labels))
Then, we have an end-to-end model for semantic segmentation.
There are 4833 testing images are processed with the trained models. 4543 frames from are a video and other 290 images from random places in Karlsruhe.
Some key parameters in training stage, and the traning loss and training time for each epochs are shown in the following table.
epochs = 37 batch_size = 8 learning_rate = 0.0001
With only 384 labeled training images, the FCN-VGG16 performs well to find where is the road in the testing data, and the testing speed is about 6 fps in my laptop. The model performs very well on either highway or urban driving. Some testing examples are shown as follows:
Based on my test on 4833 testing images. There are two scenarios where th current model does NOT perform well: (1) turning spot, (2) over-exposed area.
The bad performance at the turning spots might be caused by the fact of lacking training examples that from turning spots, because almost all the training images are taken when the car was driving straight or turning slightly. We might be able to improve the performance by adding more training data that are taken at the turning spots. As for the over-exposed area, it is more challenging. One possible approach is to use white-balance techniques or image restoration methods to get the correct image. The other possible approach is to add more training data with over-exposed scenarios, and let the network to learn how to segment the road even under the over-expose scenarios.