The Discriminator Networks
Basic Idea
CycleGAN is introduced in paper Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks.
Note: Please refer to this post for the technical understanding of GANs in general if you are not familiar with it. Also, don’t forget to check out our previous blogs.
The CycleGAN paper uses the architecture of $70 \times 70$ PatchGANs introduced in paper Image-to-Image Translation with Conditional Adversarial Networks for its discriminator networks. The experimental results show that PatchGANs can produce high quality results even with a relatively small patch size
PatchGANs
The idea of PatchGANs is to split the raw input image into some local small patches, run a general discriminator convolutionally on every patch, and average all the responses to obtain the final output indicating whether the input image is fake or not.
The main difference between a PatchGAN and a regular GAN discriminator is that the latter maps an input image to a single scalar output in the range of $[0, 1]$, indicating the probability of the image being real or fake, while PatchGAN provides an array as the output with each entry signifying whether its corresponding patch is real or fake.
According to paper Image-to-Image Translation with Conditional Adversarial Networks, using a PatchGAN is sufficient because the problem of bulrry images caused by failures at high frequencies like edges and details can be alleviatd by restricting the GAN discriminator to only model high frequencies, and PatchGAN is designed to stress this.
The reason why the CycleGAN paper implements PatchGAN as its discriminator is that it has fewer parameters than a full-image discriminator, and thus runs very fast, being able to work on arbitrarily large images.
Our Implementation
At this point, we implemented a simplified CycleGAN discriminator, which is a network of 5 convolution layers (Figure 1), including:
- 4 layers to extract features from the image, and
- 1 layer to produce the output (whether the image is fake or not).
We haven’t included the structure of PatchGAN at this point. We plan to do it after testing the performance of this simplified version. To further understand how the PatchGAN works, we may use our current implementation as a baseline and test them on more datasets if time allowed.
Hyperparameters
The main hyperparameters for the discriminator are, namely, number of output filters, kernel size and stride. A trivial configuration is shown in Table 1. Further tuning is needed when training the model.
Layer | Number of output filters | Kernel size | Stride |
---|---|---|---|
1 | 64 | 4*4 | 2 |
2 | 64*2=128 | 4*4 | 2 |
3 | 64*4=256 | 4*4 | 2 |
4 | 64*8=512 | 4*4 | 1 |
5 | 1 | 4*4 | 1 |
We also use padding to maintain the information of pixels on the boundary of the image.
Network Layout
Following is the description of the overall discriminator network
layer by layer configuration. This is the exact model that we have implemented in our generator code.
Layer Number | Layer Type | Kernel Size | Stride | Dimension I/O | Channels I/O |
---|---|---|---|---|---|
1 | Conv2d | 4 | 2 | 256—128 | 3—64 |
2 | LReLU | - | - | 256—128 | 3—64 |
3 | Conv2d | 4 | 2 | 128—64 | 64—128 |
4 | BatchNorm2d | - | - | 128—64 | 64—128 |
5 | LReLU | - | - | 128—64 | 64—128 |
6 | Conv2d | 4 | 2 | 64—32 | 128—256 |
7 | BatchNorm2d | - | - | 64—32 | 128—256 |
8 | LReLU | - | - | 64—32 | 128—256 |
9 | Conv2d | 4 | 1 | 32—31 | 256—512 |
10 | BatchNorm2d | - | - | 32—31 | 256—512 |
11 | LReLU | - | - | 32—31 | 256—512 |
12 | Conv2d | 4 | 1 | 31—30 | 512—1 |
Code Snippet
A code snippet for our simplified discriminator is shown below.
Note: Codes are inspired from leehomyc and hardikbansal amazing implementations.
import tensorflow as tf
def build_gen_discriminator(input_images, num_filters=64, scope="discriminator"):
"""Build model: A simplified discriminator.
Args:
input_images: [batch_size, img_width, img_height, img_channel]
where img_channel refers to layers like R, G, B.
num_filters: Number of output filters for the very first layer.
For other layers, the multiplication factors of num_filters depends on the strides.
scope: Change it according to the role of the discriminator.
Returns:
layer5: Decision of the input images.
"""
with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
filter_size = 4
layer1 = _conv2d_layer(input_images, num_filters, filter_size, filter_size, 2, 2, 0.02,
"SAME", "conv1", do_norm=False, do_relu=True, relu_alpha=0.2)
layer2 = _conv2d_layer(layer1, num_filters * 2, filter_size, filter_size, 2, 2, 0.02,
"SAME", "conv2", do_norm=True, do_relu=True, relu_alpha=0.2)
layer3 = _conv2d_layer(layer2, num_filters * 4, filter_size, filter_size, 2, 2, 0.02,
"SAME", "conv3", do_norm=True, do_relu=True, relu_alpha=0.2)
layer4 = _conv2d_layer(layer3, num_filters * 8, filter_size, filter_size, 1, 1, 0.02,
"SAME", "conv4", do_norm=True, do_relu=True, relu_alpha=0.2)
layer5 = _conv2d_layer(layer4, 1, filter_size, filter_size, 1, 1, 0.02,
"SAME", "conv5", do_norm=False, do_relu=False)
return layer5
def _conv2d_layer(input_conv, num_filter=64, filter_h=4, filter_w=4, stride_h=1, stride_w=1, stddev=0.02,
padding="VALID", name="conv2d", do_norm=True, do_relu=True, relu_alpha=0):
"""Convolution layer for discriminator.
Supports normalization for image instance and leaky ReLU.
Note:
relu_alpha: Slope when x < 0, used in max(x, alpha*x).
"""
with tf.variable_scope(name):
conv = tf.contrib.layers.conv2d(input_conv, num_filter, filter_h, stride_h, padding, activation_fn=None,
weights_initializer=tf.truncated_normal_initializer(stddev=stddev),
biases_initializer=tf.constant_initializer(0.0))
if do_norm:
conv = _normalization(conv)
if do_relu:
if(relu_alpha == 0):
conv = tf.nn.relu(conv, "relu")
else:
conv = _leaky_relu(conv, relu_alpha, "leaky_relu")
return conv
def _normalization(x):
"""Adapted from https://github.com/hardikbansal/CycleGAN."""
with tf.variable_scope("instance_norm"):
epsilon = 1e-5
mean, var = tf.nn.moments(x, [1, 2], keep_dims=True)
scale = tf.get_variable('scale', [x.get_shape()[-1]],
initializer=tf.truncated_normal_initializer(mean=1.0, stddev=0.02))
offset = tf.get_variable('offset', [x.get_shape()[-1]],
initializer=tf.constant_initializer(0.0))
out = scale * tf.div(x - mean, tf.sqrt(var + epsilon)) + offset
return out
def _leaky_relu(x, relu_alpha, name="leaky_relu"):
with tf.variable_scope(name):
return tf.maximum(x, relu_alpha * x)
With this we are done with the proper implementation of the discriminator network.
Feel free to reuse our Discriminator code, and of course keep an eye on our blog. Comments, corrections and feedback are welcome.
Sources
- Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks
- Image-to-Image Translation with Conditional Adversarial Networks
- Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
- Notes on the Pix2Pix (pixel-level image-to-image translation) Arxiv paper
- Convolutional Neural Networks (CNNs / ConvNets)