When we want to work on Deep Learning projects, we have quite a few frameworks to choose from nowadays. Some, like Keras, provide higher-level API, which makes experimentation very comfortable. Others, like Tensorflow or Pytorch give user control over almost every knob during the process of model designing and training. There are cases, when ease-of-use will be more important and others, where we will need full control over our pipeline.

Whenever a model will be designed and an experiment performed, one question will still remain - what is the speed of model training and whether it can be trained faster. This aspect is especially important, when we are training big models or have a big amount of data.

Personally, I Kaggle a lot, so more often than not I have to use ensembles of various models. When a lot of models are trained, training time is the key - the quicker they can be trained, the bigger amount of them can be put into my ensemble. That is why I decided to pick three currently post popular frameworks for Deep Learning:

  • Tensorflow
  • Pytorch
  • Keras

and measure training speed of a few most widely known models using their official (or as close to official as possible) implementations.

Keras is a wrapper around Tensorflow, so I thought it will be even more interesting to compare speed of theoretically the same models but with different implementations and different training API.


Comparison is performed on Convolutional Neural Networks, because there are reference implementations of a few of most popular models, picked models are used in a variety of tasks, so this can be of practical usability for people working with CNN’s and considering picking a specific framework for their project.

Library versions:

  • Tensorflow 1.4.0
  • Keras 2.1.1
  • Pytorch 0.2.0+f964105


  • Training is performed on a single GTX1080
  • Training time is measured during the training loop itself, without validation set
  • In all cases training is performed with data loaded into memory
  • The only layer that is changed is the last dense layer to accomodate for 120 classes


Dataset: Kaggle Dog Breed Identification

All models are trained on the exact same data, where the same method of data loading & preprocessing is applied.

  • Data is loaded into memory as RGB images using scipy.misc.imread function
  • Images are fed into models in float32 format
  • All images are normalized info 0-1 range
  • All images are resized to (224, 244)

Framework details


  • Models from torchvision are used
  • Images are fed in default NCHW format
  • Custom Dataset is written to load data from np.array, which uses .ToTensor() as the only transformation
  • Dataset API is used to wrap the Dataset into DataLoader. DataLoader parameters are:
train_loader =, batch_size=args.batch_size,
                                           shuffle=True, pin_memory=True,
  • During training cudnn_benchmark is set to True


  • Models from Tensorflow-Slim are used
  • Images are fed in default for TF-Slim format: NHWC
  • XLA JIT is not used
  • Dataset API is used for data loading:
dataset =, y_tr))
dataset = dataset.repeat(None)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(batch_size * 4)
iterator_tr = dataset.make_initializable_iterator()

with, feed_dict={_data_tr: X_tr,
                                                     _labels_tr: y_tr})

during training.

  • tf.nn.softmax_cross_entropy_with_logits_v2 is used as loss function



For Tensorflow and Keras 5 models were picked:

  • VGG16
  • VGG19
  • ResNet50
  • Inception V3
  • InceptionResNet V2

For Pytorch 3 models were picked:

  • VGG16
  • VGG19
  • ResNet50

Ad Pytorch models:

  • Inception V3 did not work, when last layer was changed, so the model was omitted in order not to skew the results, as changes would have to be done to the reference implementation
  • InceptionResNet V2 was not implemented, so the model was omitted


  • All models are trained from scratch, without ImageNet weights
  • Training is performed for 10 epochs
  • Each model is trained with Adam and SGD, with batch size = 4 and batch size = 16, this results in 4 runs per model per framework
  • On plots when there is no grouping by batch size or optimizer, the result is an average of those 4 runs for each model


In theory, with more parameters in a model, more operations will be needed to perform each gradient update, therefore we expect that with growing number of parameters, training time will also grow.


We can see that InceptionV3 and ResNet50 have the lowest amount of parameters, 22 and 23 millions each. InceptionResNetV2 has around 55 millions of parameters. Both VGG models have by far the highest number of parameters, VGG16 around 135mil and VGG19 140mil.

We will see, whether this is true also in practice.

Model training duration



InceptionResNet V2 takes longest time for epoch, the difference can be seen especially for batch size of 4 (left facet). In this configuration the training is more than 50% longer for IncResNet than for VGG19, which comes at 2nd place. What is more interesting, this difference becomes much less significant when batch size is set to 16, then there’s only around 14% difference.



In Tensorflow VGG19 trains for the longest, whereas InceptionResNet seems to be better optimized and is quicker than both VGG16 and VGG19. There’s also much less significant difference between InceptionResNet trained on batch size 4 and 16.



In Pytorch Inception models were not trained, therefore only ResNet and VGG’s are available for comparison. Here the recurring trend can also be seen - ResNet is the fastest, whereas VGG’s take longer to train.

Training time Comparison

By framework


X-axis labels are omitted for clarity of presentation.

When models are grouped by framework, it can be seen that Keras training duration is much higher than Tensorflow’s or Pytorch’s. Here, mean values representing 4 runs per model are shown (Adam & SGD optimizers, batch size 4 & 16). ResNet50 trains around 80% faster in Tensorflow and Pytorch in comparison to Keras. When comparing TF with Keras, big differences occur for both Inception models (V3: 11.6 vs 16.3s, IncResNetV2: 16.9 vs 33.5s). Smallest differences are present for VGG family, where difference between Keras and the other two framework are smaller than 25%.

By model


Now, we group frameworks by models to see, which models were fastest using which framework. In case of Inception models, only TF can be compared to Keras and in both cases Tensorflow is faster. ResNet50 achieves lowest training time when Tensorflow is used. VGG models stand in opposition to that, because both are trained quickest in Pytorch.

First epoch vs mean training time


Finally, all model runs per framework were averaged to show just a simple plot, which can conclude the whole experiment. Difference between Tensorflow and Pytorch is negligible - 1% but when those frameworks are compared to Keras, a significant difference can be seen. Mean training time for TF and Pytorch is around 15s, whereas for Keras it is 22s, so models in Keras will need additional 50% of the time they train for in TF or Pytorch.

In addition to that, every Keras user has probably noticed that first epoch during model training is usually longer, sometimes by a significant amount of time. I wanted to capture this behavior by plotting averaged time of 10 epochs versus time of just the first epoch.

Here, the difference is expressed even more clearly - first epoch in Keras takes more than 80% longer than in TF and more than 70% longer than in Pytorch. Winner of the 1st epoch time is Tensorflow, although for both TF and Pytorch difference between first and later epochs isn’t very significant.


It seems that there is no significant difference in speed between Pytorch and Tensorflow, when training well-known CNN’s. But there is one, which will be felt, when Keras is chosen over those ones.

Like everywhere, there must be a trade-off, simplicity comes at a cost. If your data is not very big or you need to focus mostly on rapid experimentation and want a framework that will be elastic and let you perform easy model training, pick Keras. In 95% of cases it will give you all the tools you need and those few minues or even hours of training time difference will be made up by easier experiment design and pipeline creation.

On the other hand, when you need high-performance models, which can probably be further optimized and speed is of the utmost importance, consider spending some time on developing your Tensorflow or Pytorch pipeline.

Number of parameters does in fact increase training time of a model in most cases. VGGs need more time to train than Inception or ResNet with the exception of InceptionResNet in Keras, which needs more time than the rest, altough it has lower number of parameters.

Further remarks

Pytorch and Tensorflow pipelines can probably be better optimized, therefore I am not saying that it’s 100% of performance that I have squeezed out of those frameworks. I wanted to provide a perspective of how those frameworks perform almost out-of-the-box, when most of the parameters, such as image data format (channel configuration) or model definitions, aren’t fully optimized. For example, in case of TF, XLA can be used and NCHW configuration, which is recommended one for training on GPU’s.

Leave a Comment