### Introduction

(If you are looking to dive into the technical straight up, be my guest here or check out the source code here)

In this challenge we are given a labelled video and the speed of the vehicle at every frame. The task is to predict the speed of the vehicle at every respective frame of another unlabelled video.

Before we jump into anything ML-sey or technical, let’s see what we might need. We can get the information of frame rate, i.e the number of frames in a video per second. In our case the number of frames are 20. We have speed for every frame in the video, therefore we now need a way to calculate distance between every consecutive frame.

This problem is harder than it might seem, as told by Nicolo Valigi

That’s because geometry is unforgiving, and a single camera lacks the sense of depth needed to estimate speed without any assumption about the environment. It just so happens that humans subconsciously use higher-level visual clues that help them disambiguate the absolute scale of the scene (i.e. road lanes are a few metres across).

https://nicolovaligi.com/car-speed-estimation-windshield-camera.html

### Few Approaches Worth Considering

1. Implicitly measure movement between 2 frames:
1. Stacking two images
• Input : Stack two images, i.e 6 * W * H
• Pass it to NN, then flatten it and give output as the speed
2. Have two different inputs
• Input : 2 images, i.e 3 * W * H
• Pass both to the same NN, aka feature extractor.
• Merge the extracted features (preferably using a * b instead of stacking them… reasoning below).
2. Explicitly measure movement between 2 frames :
1. (Dense) Optical Flow
• Optical flow captures the movement in two consecutive frames. It takes input as 2 images/frames and outputs 2 matrices, magnitude and angle.
• This magnitude and angle, tells us how much movement in which angle per pixel.
• Pass it to a NN and output a speed.
2. No Neural Net
1. As tried by Nicolo Valigi, the results are decent at low speeds, but deteriorate quite a lot at highway speeds

Few things, I’ve learnt over my experience at Johns Hopkins fiddling around with datasets. The reasoning below helps me decide my choice of final architecture

• Skip Connections help a lot
• I used it in my research on VAEs. And just adding skip connections boosted results tremendously. Well, better back propagation is not a joke.
• This motivated me to use ResNet inspired architecture.
• End-to-end DL technique (point 1 above) would work great if we have a lot of data.
• Under Constrained Environment a * b performs better than np.stack([a,b])
• I don’t have a solid reasoning here, but I learnt this on my experimentation with the VQA2 dataset.
• Given 10 epochs, with default hyperparameter, np.stack([a,b]) would require us to have more parameters, which in turn might require time. a*b will give us a gradient on the backflow of either a or b.
• This is not to say that a*b is better always, but under a limited constrained environment, better gradients would give better results quicker.

I ran, all the experiments above and found that a ResNet style architecture on Optical Flow as input performed the best. I used a pretrained resnet for weight initalization of the ResNet, but allowed backprop to update the weights. Reason being, ResNet is trained on ImageNet, where as the output of OpticalFlow, can not be interpreted as an image, since the values are magnitude of movement at a specific angle per pixel.

Unlike other experiments, I don’t however convert my optical from HSV to RGB. I pass in the raw values of my magnitude and angular as input. In fact, I am not convinced why manually convert 2 channels to RGB yourself, and not let the NN learn itself, what it needs to extract.

I also perform a bit of data augmentation which I found possible :

• Enhance Brightness by a random factor
• Enhance Color by a random factor
• Flip Left Right with a probability of 0.5

I split the data into 3 halves, and not 2.

We look at the plot, and split datasets into 3 parts:

• Train Set : 80% randomly shuffled frames of the first 19000 frames.
• Validation Set : Inspired from Deep Learning Yearning I decided to move a step forward and split into Validation and Heldout set (only to realise later that it was called Holdout set in the book). The reason here is, my model has 800k parameters, it’ll be easy to overfit on the 0.8 * 19k = ~15k frames. Hence we need the remaining 20% to validate our data on. But if we think about it, let’s say we have frame 1 to 10. Assume frame no. 1,3,4,5,7,8,9,10 belong to train, and frame number 2 and 6 belong to val. Even though frame 2 and 6 are technically not in the train set, but for our dataset, it’s within few meters (1/20th second away from previous frame) from the previous frame. More importantly this means, that the environment the driver was in frame 1, most certainly continues to exist in frame 2. So Validation does help us see if we’re clearly overfitting, but the bad thing about this particular validation set is that it comes from the same data distribution as • Heldout set : To prevent the problem described above, we take the last 1200 values from the dataset. The rationale is that we want a certain set of frames of video/speed that the model has never seen. And if we can fit our speed here well, then it means we are doing a good job.

Some of the models, where I took two inputs and passed into a feature extractor and merged later (Model in point 1 above) performed horribly on Heldout set. They perfectly fit to our Validation set, giving me a MSE of 0.5, however on Heldout set they gave me a loss at the least of 6.5

### Implementation Details

First we get the frames out of the video for both the train and test.

ffmpeg -i train.mp4 train_thumbs%05d.jpg
ffmpeg -i test.mp4 test_thumbs%05d.jpg

Then create a training pandas dataframe, with one column for location of first frame, second column for location of second frame and third column which acts as our y, i.e the speed. Now that we created an array, let’s load it in a DataFrame and then split into our three parts.

trainset, i = [], 0
with open("data/train.txt", 'r') as file:

trainset = []

while i < len(content)-1:
trainset.append(["data/frames/train_thumb{:05d}.jpg".format(i+1),\
"data/frames/train_thumb{:05d}.jpg".format(i+2), float(content[i+1])])
i += 1

df = pd.DataFrame(trainset, columns=["img1","img2","speed"])
heldout = df.iloc[-1200:]
trainval = df.iloc[:-1200]
train = trainval.sample(frac=0.8,random_state=200)
val = trainval.drop(train.index)

test_set = []
for i in range(10798 - 1):
test_set.append(["data/frames/test_thumb{:05d}.jpg".format(i+1),
"data/frames/test_thumb{:05d}.jpg".format(i+2),
None])
test = pd.DataFrame(test_set, columns=["img1", "img2", "speed"])

So, now we have a dataframe which has two file locations and their respective speed. Before we create a PyTorch Dataset, we need to define an Optical Flow function, that takes in input as two images (W * H * 3) and outputs (2 * W * H), containing magnitude and angle.

import cv2
def opticalFlowDense(image_current, image_next):
"""
Args:
image_current : RGB image
image_next : RGB image
return:
optical flow magnitude and angle and stacked in a matrix
"""

gray_current = cv2.cvtColor(image_current, cv2.COLOR_RGB2GRAY)
gray_next = cv2.cvtColor(image_next, cv2.COLOR_RGB2GRAY)
flow = cv2.calcOpticalFlowFarneback(gray_current, gray_next, None, 0.5, 1, 15, 2, 5, 1.3, 0)
return flow

After that we create a PyTorch Dataset to help us load our csv that we created.

import numpy as np
from PIL import Image, ImageEnhance
import torchvision.transforms.functional as TF
from torch.utils.data import Dataset

"""Comma AI Challenge Dataset."""

def __init__(self, csv_file, augment=True, normalize=None):
"""
Args:
csv_file (string): Path to the csv file with annotations.
augment (boolean): if we want to perform data augmentation (True for trainset).
normalize (callable, optional): Optional normalization to be applied
on a sample.
"""
self.augment = augment
self.normalize = normalize
def __len__(self):
return len(self.df)

def __getitem__(self, idx):
row = self.df.iloc[idx]
img1 = Image.open(row['img1']).crop((0, 170, 640, 370)).resize((160,50))
img2 = Image.open(row['img2']).crop((0, 170, 640, 370)).resize((160,50))
if self.augment:
brightness = np.random.uniform(0.5, 1.5)
img1 = ImageEnhance.Brightness(img1).enhance(brightness)
img2 = ImageEnhance.Brightness(img2).enhance(brightness)

color = np.random.uniform(0.5, 1.5)
img1 = ImageEnhance.Brightness(img1).enhance(color)
img2 = ImageEnhance.Brightness(img2).enhance(color)

img1 = np.asarray(img1)
img2 = np.asarray(img2)
flow = opticalFlowDense(img1, img2)
#         if self.normalize:
#             img1, img2 = self.normalize(img1), self.normalize(img2)
return TF.to_tensor(flow), row['speed'], row+1

Now we create the data loader, define our model, and train it for ~40 epochs with Adam optimizer with a learning rate of 1e-4 and others are default settings of PyTorch.

In the following two images, you can see that as training progresses our prediction for heldout set improve drastically.

Since we can see that our prediction (i.e in blue) are very volatile, one thing we can to do is to take a mean over previous k frames to get a smoother output.

But why? The reason is, only two frames are able to capture the velocity, but not the acceleration. If we remember the formula from high school physics which is We know that we need initial velocity as well as accelration to predict correctly the time at a new timestep. I believe feeding an RNN an input of previous timesteps would be a good experiment. Since that would allows us to map .

Taking the mean over previous k values, does some sort of mathematical mapping of previous velocities to current. For instance, for k = 2, we say that we define a to be :    Clearly, this isn’t a correct measure of a, and we would want our neural network to learn the dependencies between every u, but here an approximate measure yields in decent performance.

Therefore using this technique, we after every epoch measure the heldout loss, and then see for what value of k (k previous frames for the mean), does it give the least value.

When we mean over 24 frames we observe nearly 1.01 MSE Loss, this in reallife for inference might not be the best idea. (For 1st second we wouldn’t have any speeds).

Hence I plot the values of different k’s and their actual predictions to show that the real results do not differ tremendously but vary from 1.01 `- 1.9 on varying k from 24 to 6.

I save my model, which produces the least heldout loss for a some value of k, and use it to run for inference. I received the least loss on epoch 26 with k value as 24.

On a Titan Black I could achieve these results within 45 minutes of training. Also, on inference time, I could process close to 250 frames in one second.

### Conclusion and Future Thoughts

Lately there has been an improvement wrt Optical Flow and Deep Learning, where DL models are used to produce Optical Flow between two images. However using it for this project, would mean that we would need to build an end-to-end system, one that is resposoble for capturing important features between two frames and then computing distance between them, and later mapping them to a speed. Again from Deep Learning Yearning, one thing to learn is that, end-to-end models are good, given enough data.

Hence, our choice of giving input an optical flow, almost seems justified given the dataset size.

But if there were infinite data, I’d definitely like to start from segmentation techniques (U-Net, YOLO etc) so that our model can distinguish roads, cars etc. Then probably try an end-to-end, and experiment with FlowNet 2.0 on both non/segmented input. Also, try with RNNs with the hope that they won’t result in such volatile output (because they’ll capture acceleration from previous frames too).

Published in Computer Vision Deep Learning Python