Article written for Craig Glastonbury’s blog

Competition Background

In this article, I will try to describe the approach we used in Intel & MobileODT Cervical Cancer Screening Kaggle competition. Aim of the competition was to develop an algorithm which could identify woman’s cervix type based on the pictures taken during examination. This is a very important cause, which deserves attention because proper identification should help with selection of proper treatment method and thus potentially saving woman’s health or even life.

This was a competition with 2 stages. 1st stage test data was made public at the beginning of the competition and Kaggle’s public leaderboard (LB) consisted of participants’ models results on this dataset. We were able to tweak, validate our models during the 1st stage and, according to the rules, a full pipeline used for training and prediction had to be created during this stage, so it could be used afterwards, in 2nd stage, when our models had to be frozen and no further hyperparameter manipulation was possible.

Creating 2-staged competitions is essential in cases, where the size of data is pretty small, like it was here. Test set consisting of a few hundred examples can possibly be labeled by hand or it’s labels can be mined using feedback loop based on Kaggle’s public LB. A tool to mine the true labels was developed and published by Oleg Trott during Data Science Bowl 2017 in his kernel. In order to avoid this, a final test dataset is released during 2nd stage of the competition, in the last week, when the pipelines are already frozen, for the participants to solely predict the 2nd stage test data and submit their predictions to Kaggle. Based on 2nd stage predictions final, private, LB is created.

We were given training data, which consisted of a set of best picture chosen for each of the patients. In addition to that, there was also additional train dataset, which consisted of all pictures, so each patient’s cervix appeared usually more than once. There were three classes to differentiate from and a quick tutorial (NSFW) was presented in order to understand more about the various cervix types. The problem was that for most of participants (including us) it was very hard to confidently tell, which image belonged to which class. During the competition itself organizers themselves made a few changes to the labels for our training set, so it appeared that differentiation between the types wasn’t hard just for us.

Pictures themselves varied significantly in their quality - some were very sharp and thus contained a lot of information for a potential model to base its prediction on and some were very blurry, possibly only introducing noise to the data. A few were bugged, so had to be skipped during the data processing and model training.

There was no consensus between participants whether to use additional data or not - on one hand it significantly increased size of the training set (from 1481 images in original training set to 8430 images in just additional set - according to EDA kernel) but on the other it introduced further complications in data split methods, because when using additional set, the data should be split by patient, so the almost identical images won’t occur both in training and validation sets, biasing the loss.

Background introduction is made, now let’s get to the more interesting part - how did we approach the problem?

Our Approach

We were working in a team of four, so to begin with, big thanks to all my teammates - Abhishek, Craig and Florian!

Due to the nature of images, where only a part containing cervix should be important to the model, we should start with getting rid of the background, which can fool the model and cause it to learn background features, treating them as the ones important to differentiate between the classes instead of those truly important cervical features. That is why we decided to approach the problem from an object detection perspective. In that, we had some experience from the previous NCFM competition, which Craig wrote about. After potential ROI (region of interest) is extracted from a detection model, a classifier based on CNN can be used to output probabilities of an image belonging to each of the classes. There is also a possibility to use an end-to-end approach, where a detector will be trained and the classes probabilities it assigns to each of the pictures will be extracted and used in the submission. This is feasible due to the fact that nowadays most of the state-of-the-art object detection models are also provided with class labels, they learn to distinguish between classes, so in their final output you get not just the ROI but also the class predictions.

To train a detector we needed bounding boxes coordinates. Fortunately, those were prepared by Paul and published in the discussions.

Cervix Detection

Object detection methods we used:

  • Florian’s VGG-based Bounding Box Regression Model
  • YOLOv2
  • Faster R-CNN

YOLO and R-CNN were trained using Paul’s annotations and Florian made his own bounding boxes coordinates for his VGG-based localization model. The biggest difference between regression model and the other two was the fact that frameworks such as YOLO or R-CNN can find a few potential objects on the image, whereas bounding box regression models usually use MAE or MSE to find four coordinates on the image, assuming that there’s only one desired object on the image. For this problem, finding just one ROI was perfectly sufficient.

During NCFM I had some fun with Darknet’s YOLO, so I had an almost-ready pipeline for training and prediction using this detection model. It’s one of the best (at least it’s been during the competition) object detection systems (worth taking a look - YOLO9000 paper), so we decided it’ll be worth to give it a try as the baseline approach. It gave pretty good results, being able to accurately localize cervix on most, more than 90% at least, of the images, except some blurry ones and those, which could not be processed.

After some time, I found a Faster R-CNN implementation in Keras and I decided it’ll be worth to try it too. Preparing the pipeline for data processing and feeding the data into this one proved quite easy and in a day our next object detection model was ready to be trained (paper - Faster R-CNN). ROI output of this one seemed even more accurate than YOLO crops, so it proved to be a good time investment to prepare the R-CNN.

Cervix Classification

We had three sets of cervix ROI images coming from three of our detection models, most of the background noise was now excluded, it was time to classify.

Because the training data wasn’t big, bagging each classification model a few times plus using an ensemble of models for the final submission should improve the score due to removing some of the variance. Every time a deep neural network is trained, especially on smaller datasets and when using cuDNN (which is non-deterministic), it will learn something a bit different, so when at least a few runs are averaged (bagging), the overall performance should improve. Sometimes such changes are more, sometimes less significant but when you’re fighting for 0.001 or even 0.0001 improvement on Kaggle, it’s worth to remember the trick! Therefore, being aware of the need to bag and ensemble models, I decided to create a comfortable pipeline for model training, where you could easily change parameters such as crop set, on which the model will be trained or the model architecture itself.

During training Keras’ ImageDataGenerator was used to perform real-time data augmentation, which was supposed to make the model invariant to subtle data perturbations, such as zoom, random channel shifts or horizontal flips. During test data predictions, we used ImageDataGenerator with the same parameters as for the training, creating 25 differently augmented test images sets and averaging predictions between them - another way to smooth the predictions out and offset for the small data size. We trained on 299x299 images. The other very important data manipulation was classes oversampling connected with data augmentation. I prepared a script, which created as many batches of differently augmented data for a specific class, as was needed to achieve almost-equally sized all classes. This enabled us to notably increase size of our training dataset, partly offsetting for the fact of not using additional dataset. Models trained on oversampled data achieved much better accuracy and were much more stable. Oversampling trick was probably the most important one used during the competition.

Considering the models themselves, a technique named transfer learning was used, where we made use of ImageNet weights in our models and trained them with lower learning rate to fine-tune the network on our Cervix dataset.

Classification models were trained using Keras framework, so we made use of Keras pretrained models, where for our ensemble architectures based on:

  • ResNet 50
  • Inception v3
  • Xception

were selected.

I experimented with adding Average Pooling and Dense layers as the final layers in the networks, models based on Average Pooling or Global Average Pooling were performing much more stable, achieving losses between 0.6-0.7 in most cases, with some outliers on both sides, where a few runs attained loss around 0.5 and a few around 0.8. Dense-based NN’s were not consistent in their performance, some converged well, to around 0.6-0.7 losses, similar to AvgPool architectures and some couldn’t cross even the 1.0 log loss boundary. This was the reason to keep only the pooling models in the final ensemble. Validation split for each model was random, without a strictly specified seed. That was the cause of the high loss variance during the runs but that also enabled each model to learn different features based on different subset of data, what I deemed important considering the small data size. Six runs, each 5 times bagged, were finally run, three on Faster R-CNN crops, two on YOLO crops and one on VGG Localization network crops. Selection was made based on earlier runs, in which validation loss between a few bags was averaged.


Our team achieved 18th place out of 848 teams in the competition, so we got a silver medal and missed the gold tier by around 0.015 (0.84791 vs 0.83367). Final score most probably could have been better if we introduced models such as DenseNet or ResNet 101/152. Well, at least we will know what to do next time :). Big thanks for reading and in case of any questions, I’ll be happy to answer them!