79st Place Silver Solution in Kaggle’s Deepfake Detection Competition!
Happy to say that our team got 79st place in Kaggle’s Deepfake Detection competition! This is sufficient for a silver medal. In this blog post I will give an overview of our solution. In essence, we tried to include as much model diversity in our ensemble as possible in order to survive a leaderboard shakeup.
For us (Koen Botermans, Kevin Delnoye and I) this competition was 3 months of blood, sweat and tears with ups and downs and a radical switch from Keras to Pytorch halfway in the competition. I hope the insights in this post will be helpful to other competitors.
As many other competitors we also use a frame-by-frame classification approach for detected faces. We agonized over adding the audio data in our model or doing sequence modeling (e.g. LSTM cells), but ended up not experimenting with it. This gave us a lot of time to focus on creating datasets and training classification models.
The final kernel that we used for our submission can be found here:
Over the course of the competition we created several image datasets of face detections. The first baseline (ImagesFaces1) only contained the first face detection in a video using MTCNN as our face detector. Fine-tuning EfficientNetB5 on this dataset already gave us a pretty reasonable baseline (0.23083 local validation, 0.39680 on public leaderboard). Half-way we switched to RetinaFace as face detector as it had better performance and was faster. The first dataset after this switch (ImagesFaces3) contains 200K faces as training data. The final dataset (ImagesFaces6) featured 2M face detections from the training videos. Our final solution has both models that are trained on ImagesFaces3 and ImagesFaces6. It is arguable that 2M face detections is overkill, but luckily we had sufficient computational resources to experiment with it. In all datasets we took a 20% sample as validation data avoiding data leakage due to the limited number of actors. The datasets we created were balanced so we had 50% real and 50% fake faces.
We started out with MTCNN as our face detector, but we thought it was too slow and that we could have better performance. Eventually we settled on RetinaFace as it had better performance and faster inference time.
The MTCNN library we used is built on Tensorflow so naturally our classification models where based on Tensorflow/Keras. We did experiments using XCeptionNet and EfficientNetB0-B8.
When we switched to RetinaFace we could only find a Pytorch implementation and as both Tensorflow and Pytorch allocate all GPU memory we felt forced to choose one framework for face detection and classification models. With 1.5 months left we therefore chose to go with RetinaFace and start experimenting with Pytorch models.
Models and Ensembling strategy:
Our final 81st place solution is a hotchpotch of EfficientNet Models. This haphazard combination was intentional to include diversity and to have a good chance at generalizing well to new data. The final models that were used:
EfficientNetB6, 200×200 resolution, ImagesFaces6
EfficientNetB5, 224×224 resolution, ImagesFaces3
EfficientNetB5, 224×224 resolution, ImagesFaces6
EfficientNetB6, 224×224 resolution, ImagesFaces6, finetuned for one epoch on our validation data.
EfficientNetB4, 224×224 resolution, ImagesFaces6, data augmentation and label smoothing.
All models are trained with a starting learning rate of 0.001 and using a “Reduce on Plateau” learning rate schedule with a patience of 2 and with a multiplier of 0.5. All models have a dropout of 0.4 before the final layer and group normalization so we don’t lose performance with small batch sizes (e.g. 20).
We settled on a simple mean from all models for each video. We made 25 predictions per video on evenly spread frames. No weighting of models was done. The predictions were clipped between 0.01 and 0.99.
In order to speed up the inference process we decreased the frame resolution by a factor of 2. Most original video frames were already high resolution so the face detector was still effective on a frame with a reduced resolution.
Throughout the competition we generally had a large correlation between our local validation scores and public leaderboard scores. A lower log loss on the local validation generally meant a better leaderboard score, but our final validation scores were extremely low (approx. 0.068 log loss). This means that we most likely had some data leakage in our training data, but we don’t know exactly where it was coming from.
One thing I regret is that we underestimated the power of data augmentation for this competition. With only a month left we started using rotation (15 degrees) and flipping to augment the data, but looking at other public solutions we probably could have gotten a much higher score with more extensive data augmentation.
What we would have liked to try but didn’t:
– LSTM cells
– Mixed precision (with NVIDIA DALI or apex).
– Include audio data and extract features using LSTM cells.
– Including ResNeXT101_wsl models. We trained a few models, but the weight files were too large to add it to the ensemble.
– Stochastic Weight Averaging (SWA).
– Fine-tuning “Noisy Student” weights (The Noisy Student paper was fairly new and not implemented in the libraries we used yet.)
– Training with Mixup.
– Taking the difference between frames and include it as a channel.
What didn’t work:
– Taking the median of predictions for a video.
– Naïve postprocessing (Changing a prediction of 0.8 to 0.95, etc.)
– Clipping more than 0.01 (Sometimes there was a leaderboard improvement, but we decided it was too risky).
– Test Time Augmentation (TTA) with rotations.
I hope you got some insights from this solution overview! Feel free to ask questions or give feedback on this approach!
Our original discussion post on Kaggle can be found here: