Sign in

A Deep Learning and Medical Imaging enthusiast. Master student in Biomedical Engineering at FH Aachen University of Applied Sciences, Germany.

Understanding Vision Transformers: Transformers for Image Recognition

Link to paper —

Nowadays in Natural Language Processing (NLP) tasks, transformers have become the goto architecture (such as BERT, GPT-3, and so on). On the other hand, the use of transformers in computer vision tasks is still very limited. Most researchers use convolutional layers directly, or add certain attention blocks along with the convolutional blocks, for computer vision applications (such as Xception, ResNet, EfficientNet, DenseNet, Inception, and so on). The paper on Vision Transformer (ViT) implements a pure transformer model, without the need for convolutional blocks, on image sequences to classify images. …


Understanding the best and the most efficient CNN model present currently — The EfficientNet

When convolutional neural networks are developed, they are done so at a fixed resource cost. These networks are scaled up later to achieve better accuracies when more resources are available. A ResNet 18 model can be scaled up to a ResNet 200 model by adding more layers to the original model. In most situations, this scaling technique has helped provide better accuracies on most benchmarking datasets. But the conventional techniques of model scaling are very random. Some models are scaled depth-wise, and some are scaled widthwise. Some models simply take in images of a larger resolution to get better results…


Thanks for the update. I had used Tensorflow 2.1.0, i guess, while running this code.


Learning to write custom loss using wrapper functions and OOP in python

Figure 1: Gradient descent algorithm in action ( Source: Public Domain, )

A neural network learns to map a set of inputs to a set of outputs from training data. It does so by using some form of optimization algorithm such as gradient descent, stochastic gradient descent, AdaGrad, AdaDelta or some recent algorithms such as Adam, Nadam or RMSProp. The ‘gradient’ in gradient descent refers to error gradient. After each iteration the network compares its predicted output to the real outputs, and then calculates the ‘error’. Typically, with neural networks, we seek to minimize the error. As such, the objective function used to minimize the error is often referred to as a…


Ten years of work in less than ten months — Research, Trials, and Approval

Figure 1: The SARS-CoV2 virus that causes COVID-19 disease (Source: By CDC/ Alissa Eckert, MS; Dan Higgins, MAM — media comes from the Centers for Disease Control and Preventions Public Health Image Library (PHIL), with identification number #23312. This file was derived from SARS-CoV-2 (CDC-23312).png: Public Domain, )

Making a vaccine is typically a very long process and can take up to 10 years from the start of research to actually distributing it to the public. The process involves various steps:

  1. Research
  2. Pre-Clinical Trials
  3. Phase 1 trials
  4. Phase 2 trails
  5. Phase 3 trials
  6. Manufacturing
  7. Approval by the governing body
  8. Distribution to public

These steps ensure that the vaccine is safe (has none or minimal side effects) and effective.

More than 72 million people are infected already, and more than 1.6 million people have died of the pandemic. A vaccine was the only hope humanity had. And scientists around…


Understanding the Inception (Google LeNet) Architecture

Figure 1. Google LeNet (Inception) architecture (Source: Image from the original paper)

The LeNet architecture used 5x5 convolutions, AlexNet used 3x3, 5x5, 11x11 convolutions and VGG used some other mix of 3x3, and 5x5 convolutions. But the questions that deep learning scientists were worried about were which combination of convolutions to use in different datasets to get the best results.

For example, if we pick 5x5 convolutions, we end up with a fair number of parameters, there are a lot more multiplications involved, and they need a lot of parameters and are very slow, but on the other hand, it is very expressive. But if we pick 1x1 convolutions, it is much…


We will see how to implement VGG16 from scratch using Tensorflow 2.0

Figure 1. VGG 16 architecture (Source: Image created by author)

LeNet-5 was one of the oldest convolutional neural network architectures, designed by Yann LeCun in 1998, which was used to recognize handwritten digits. It used 5x5 filters, average pooling, and no padding. But by modern standards, this was a very small neural network and had only 60 thousand parameters. Nowadays, we see networks that have a range of 10 million to a few billion parameters. The next big Convolutional neural network that revolutionized the use of a convolutional network was AlexNet which had approximately 60 million parameters. The first layer of AlexNet uses 96 filters with kernel size 11x11, with…


Even better than Inception

Figure 1. Xception architecture (Source: Image from the original paper)

Convolutional Neural Networks (CNN) have come a long way, from the LeNet-style, AlexNet, VGG models, which used simple stacks of convolutional layers for feature extraction and max-pooling layers for spatial sub-sampling, stacked one after the other, to Inception and ResNet networks which use skip connections and multiple convolutional and max-pooling blocks in each layer. Since its introduction, one of the best networks in computer vision has been the Inception network. The Inception model uses a stack of modules, each module containing a bunch of feature extractors, which allow them to learn richer representations with fewer parameters.

Xception paper —


We will see how to implement ResNet50 from scratch using Tensorflow 2.0

Figure 1. Residual Blocks and Skip Connections (Source: Image created by author)

It is seen that often deeper neural networks perform better than shallow neural networks. But, deep neural networks face a common problem, known as the ‘Vanishing/Exploding Gradient Problem’. To overcome this problem the ResNet network was proposed.

Link to original paper —

Residual Blocks:

ResNets contain Residual blocks. As seen in Figure 1, there is an activation ‘al’ followed by a linear layer with the ReLU non-linearity, ‘al+1’. It is followed by another linear layer, with another non-linearity, ‘al+2’. This is what a normal or plain neural network looks like. What ResNet adds to this is the skip-connection. In ResNet, the…


This is part 5 of the application of Deep learning on X-Ray imaging. Here the focus will be on various ways to implement data augmentation.

We saw in the previous part — Part 4— how to tackle the Class Imbalance Problem. In this section, we will focus on image normalization and data augmentation.

After the class imbalance problem is taken care of, next we look at ways to improve the performance of the neural network and also make it faster. We already have a similar number of images in the three classes in the training data— 1. Normal (no infection), 2. Bacterial Pneumonia, 3. Viral Pneumonia.

Barchart of the number of images in each class- Image from Part 4 (Source: Image created by author)

Image Scaling/Normalization:

Neural networks work best when all the features are on the same scale. Similarly, optimization algorithms…

Arjun Sarkar

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store