Anomaly Detection Using a Variational Autoencoder, Part II

Source: Open AI Dall-E 2, prompt: "A dog in a bottleneck".

Part II: Implementation

Anomaly detection is one of the most widespread use cases for unsupervised machine learning, especially in industrial applications. In Part I, we motivated the use of variational autoencoders for this task and provided intuitive explanations of how to do it with references to dogs and sheep. In Part II, we are going through a concrete implementation of this idea with code extracts.

Both parts are based on a notebook published on Kaggle. Head there to look at the complete code and run it.

Prerequisites: Basic data science principles, a general understanding of probability distributions, and some experience with neural networks.

Let's start with a brief summary of the main ideas discussed in Part I. Industrial applications of anomaly detection are too many to list: fraud detection in banking, preventive maintenance in heavy industries, threat detection in cybersecurity, etc. In all these problems, defining outliers explicitly can be very challenging. Variational autoencoders (VAEs) automatically learn the general structure of the training data to isolate only its discriminative features, which are summarised in a compact latent vector. The latent vector constitutes an information bottleneck that forces the model to be very selective about what to encode. We train an encoder to produce the latent vector and a decoder to reconstruct the original data from the latent vector as faithfully as possible. When presented with an out-of-distribution sample (an outlier), the system will not be able to make an accurate reconstruction. By detecting inaccurate reconstructions, we can tell which examples are outliers. Please refer to Part I for a more in-depth and intuitive discussion and to [1] for more information about VAEs.

In this second part, we will apply these principles to build an anomaly detection machine based on a VAE. I will not describe the entire code in detail but provide the most relevant portions. You can see the rest and play with it here.

Overview

For this project, we will use one of the public datasets available on Kaggle: https://www.kaggle.com/boltzmannbrain/nab. It contains many datasets for anomaly detection. In this example, we use machine_temperature_system_failure.csv, a one-dimensional time series tracking the internal temperature of some sizeable industrial machine about which we know nothing. We will engineer features and make the data multi-dimensional to make things more interesting and relevant.

We will follow the following steps:

  • Gather and preprocess data, including train/test split,
  • Build a VAE and train it on the training set,
  • Pass test samples to the VAE and record the reconstruction loss for each,
  • Identify test samples with a high reconstruction loss and flag them as anomalies.

Data Preprocessing and Feature Engineering

The entire time series is shown below, with anomalies circled in red:

Temperature tracking of a machine component. The anomalies are shown in red.

The paper that introduces this time series [2] explains:

The first anomaly was a planned shutdown. The third anomaly was a catastrophic system failure. The second anomaly, a subtle but observable change in the behavior, indicated the onset of the problem that led to the eventual system failure.

The data looks like this:

Original data. `value` corresponds to the temperature readings.

We do not have much information about what kind of machine or industry we are dealing with, which is an issue when developing helpful features. We will therefore have to make assumptions.

The timestamps cover the Christmas and New Year holidays. Since we are dealing with an industrial machine, it stands to reason that its workload might be affected by holidays and maybe even by the proximity (in time) to a holiday. Without additional information, we will assume that the applicable holidays are those typical in Europe and the Americas, i.e. Christmas and New Year's Day. Likewise, we might need to know the day of the week or the hour of the day and have to assume that weekends are Saturday and Sunday. After some manipulation, we get the following data frame:

Dataset after feature engineering. `holiday` is equal to 1 for Christmas and New Year's Day. `hour_min` is a float representation of the time of day. `gap_holiday` is the distance in days to the nearest holiday. `t` is the integer representation of the timestamp divided by 10¹¹.

Next, we split the data into a training and a test set. Since we are dealing with a time series, we take the last 30% of the observations (in chronological order) as our test set. The test period represents around 6,800 timestamps, leaving about 15,900 for training. We can move on to the encoding of the categorical variables.

We have four categorical variables: day, month, day_week, holiday. We need to make sure that there are no values of categorical variables that are present in the test set and not in the training set. But, as we split the data chronologically, this is very likely to happen with the variables day and month. We could try to address that through further feature engineering, but here, we will simply ignore these two features when calling the data during training. We encode day_week from 0 to 6, and holiday as 0 or 1. We pass these categorical variables as embeddings, which we will detail later.

When using a neural network model, it is essential to normalise the continuous variables: The weights of a neural network are randomly initialised from a shared distribution, so they all tend to have similar scales initially. Variables with values spread across different orders of magnitude would therefore cause severe difficulties in learning the optimal weights. The test data must be normalised using statistics observed on the train set to avoid leaking information from the test set into the model.

The preprocessed training set.

A custom PyTorch Dataset class will serve the data. We use the categorical features day_of_week, holiday and the continuous features gap_holiday, t, value. As previously mentioned, we drop day, month when passing the data to the model.

VAE Model

A reminder of the general layout of a VAE and its loss function. X is the input from the dataset, which the encoder turns into a compact latent vector z, which the decoder tries to reconstruct into (an approximation of) the original input. The model is trained by minimising the difference between the original and the reconstruction, plus the "distance" between the distribution of z and a standard normal distribution.

Now we move on to the fun part: the model itself. It comprises two parts: the encoder and the decoder. In terms of architecture, we define it as mirror images of one another for simplicity. The core of each neural network is a sequence of fully connected layers whose number and sizes are specified by the hyperparameter layer_dims (a list of integers). After some experimentation, I settled for 64,128,256,128,64, i.e. five layers. Each layer can be batch-normalised. The encoder's first layer received a concatenation of the continuous variables and embedding vectors encoding the categorical variables.

Embedding vectors are heavily used in Natural Language Processing and, more generally, when working with discrete data. They are learnable vectors (of dimension 16 in our experiments) that express some non-explicit features of the variables. For instance, the vector day_of_week can take seven values (one for each day), each 16-dimensional. Depending on the value of this variable for each sample, the corresponding vector is retrieved in a lookup table, passed to the network, and updated during backpropagation. After training, the vector for day_of_week==0 might, for example, capture the fact that the machine's activity is lower on Sundays. The important thing is that this learning is done without our intervention, and there is no easy way to interpret the learned embeddings.

Representation of embeddings for categorical variables. Each embedding vector is 16-dimensional (i.e., embedding dimension = 16). The selected vector depends on the variable's value for each sample and each of the variables. These vectors are learned during training.

Let's first define a layer as a sequence of {fully-connected unit, batch normalisation, leaky-ReLU activation}:

class Layer(nn.Module):
'''
A single fully connected layer with optional batch
normalisation and activation.
'''
def __init__(self, in_dim, out_dim, bn = True):
super().__init__()
layers = [nn.Linear(in_dim, out_dim)]
if bn: layers.append(nn.BatchNorm1d(out_dim))
layers.append(nn.LeakyReLU(0.1, inplace=True))
self.block = nn.Sequential(*layers)

def forward(self, x):
return self.block(x)

We can use this object to build the encoder:

class Encoder(nn.Module):
'''
The encoder part of our VAE. Takes a data sample and returns the
mean and the log-variance of the vector's distribution.
'''
def __init__(self, **hparams):
super().__init__()
self.hparams = Namespace(**hparams)
self.embeds = nn.ModuleList(
[
nn.Embedding(
n_cats, emb_size) for (n_cats, emb_size) \
in self.hparams.embedding_sizes
)
]
)
# The input to the first layer is the concatenation
# of all embedding vectors and continuous values
in_dim = sum(emb.embedding_dim for emb in self.embeds) \
+ len(self.hparams.cont_vars)
layer_dims = [in_dim] \
+ [int(s) for s in self.hparams.layer_sizes.split(',')]
bn = self.hparams.batch_norm
self.layers = nn.Sequential(
*[Layer(layer_dims[i], layer_dims[i + 1], bn) \
for i in range(len(layer_dims) - 1)],
)
self.mu = nn.Linear(layer_dims[-1], self.hparams.latent_dim)
self.logvar = nn.Linear(
layer_dims[-1],
self.hparams.latent_dim,
)

def forward(self, x_cont, x_cat):
x_embed = [
e(x_cat[:, i]) for i, e in enumerate(self.embeds)
]
x_embed = torch.cat(x_embed, dim=1)
x = torch.cat((x_embed, x_cont), dim=1)
h = self.layers(x)
mu_ = self.mu(h)
logvar_ = self.logvar(h)

# we return the concatenated input vector for use in loss
return mu_, logvar_, x

… and the decoder:

class Decoder(nn.Module):
'''
The decoder part of our VAE. Takes a latent vector (sampled from
the distribution learned by the encoder) and converts it back
to a reconstructed data sample.
'''
def __init__(self, **hparams):
super().__init__()
self.hparams = Namespace(**hparams)
hidden_dims = [self.hparams.latent_dim] \
+ [
int(s) for s in \
reversed(self.hparams.layer_sizes.split(','))
]
out_dim = sum(
emb_size for _, emb_size in self.hparams.embedding_sizes
) + len(self.hparams.cont_vars)
bn = self.hparams.batch_norm
self.layers = nn.Sequential(
*[Layer(hidden_dims[i], hidden_dims[i + 1], bn) \
for i in range(len(hidden_dims) - 1)],
)
self.reconstructed = nn.Linear(hidden_dims[-1], out_dim)

def forward(self, z):
h = self.layers(z)
return self.reconstructed(h)

To make the code easier to write and read, I define the VAE as a Pytorch-Lightning LightningModule. Pytorch-Lightning is a handy layer on top of Pytorch that removes the need for much of the boilerplate code associated with training models while allowing for much flexibility. I strongly recommend you take a look if you don't know it.

class VAE(pl.LightningModule):
def __init__(self, **hparams):
super().__init__()
self.save_hyperparameters()
self.encoder = Encoder(**hparams)
self.decoder = Decoder(**hparams)

def reparameterize(self, mu, logvar):
'''
The reparameterisation trick allows us to backpropagate
through the encoder.
'''
if self.training:
std = torch.exp(logvar / 2.)
eps = torch.randn_like(std) * self.hparams.stdev
return eps * std + mu
else:
return mu

def forward(self, batch):
x_cont, x_cat = batch
assert x_cat.dtype == torch.int64
mu, logvar, x = self.encoder(x_cont, x_cat)
z = self.reparameterize(mu, logvar)
recon = self.decoder(z)
return recon, mu, logvar, x

def loss_function(self, obs, recon, mu, logvar):
recon_loss = F.smooth_l1_loss(recon, obs, reduction='mean')
kld = -0.5 * torch.mean(1 + logvar - mu ** 2 - logvar.exp())
return recon_loss, kld

def training_step(self, batch, batch_idx):
'''
Executed with each batch of data during training
'''
recon, mu, logvar, x = self.forward(batch)

# The loss function compares the concatenated input vector
# including embeddings to the reconstructed vector
recon_loss, kld = self.loss_function(x, recon, mu, logvar)
loss = recon_loss + self.hparams.kld_beta * kld
# We log some values to monitor the training process
self.log(
'total_loss', loss.mean(dim=0),
on_step=True, prog_bar=True, logger=True,
)
self.log(
'recon_loss', recon_loss.mean(dim=0),
on_step=True, prog_bar=True, logger=True,
)
self.log(
'kld', kld.mean(dim=0),
on_step=True, prog_bar=True, logger=True,
)
return loss

def test_step(self, batch, batch_idx):
'''
Executed with each batch of data during testing
'''
recon, mu, logvar, x = self.forward(batch)
recon_loss, kld = self.loss_function(x, recon, mu, logvar)
loss = recon_loss + self.hparams.kld_beta * kld
self.log('test_loss', loss)
return loss

def configure_optimizers(self):
# Define the Adam optimiser:
opt = torch.optim.AdamW(
self.parameters(), lr=self.hparams.lr,
weight_decay=self.hparams.weight_decay, eps=1e-4,
)
# Set up a cosine annealing schedule for the learning rate
sch = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
opt, T_0=25, T_mult=1, eta_min=1e-9, last_epoch=-1,
)
return [opt], [sch]
# The next two methods create the training and test data loaders
# based on the custom Dataset class.
def train_dataloader(self):
dataset = TSDataset(
'train', cont_vars=self.hparams.cont_vars,
cat_vars = self.hparams.cat_vars, lbl_as_feat=True,
)
return DataLoader(
dataset, batch_size=self.hparams.batch_size,
num_workers=2, pin_memory=True, shuffle=True,
)

def test_dataloader(self):
dataset = TSDataset(
'test', cont_vars=self.hparams.cont_vars,
cat_vars=self.hparams.cat_vars, lbl_as_feat=True,
)
return DataLoader(
dataset, batch_size=self.hparams.batch_size,
num_workers=2, pin_memory=True,
)

A word on the method reparameterize(): As mentioned in Part I, we repeatedly sample from the intermediate distribution during training to learn a distribution rather than individual points. The problem is that the sampling operation is not differentiable, so we cannot backpropagate through it to train the network. Instead, we employ the "reparameterisation trick", which is made possible by an essential property of the normal distribution:

In English, this means that sampling from a normal distribution of variance 𝜎² centred around mean 𝜇 is strictly equivalent to taking a sample from the standard normal distribution (mean 0 and variance 1), multiplying it by 𝜎, and adding 𝜇.

So we sample an independent noise ϵ, which we don't need to backpropagate through, and backpropagate through the deterministic process producing 𝜇 and 𝜎² [in reality, log(𝜎²), but this is a minor technical detail]. Again, read [1] for further insights.

We need to set some hyperparameters. stdev (the standard deviation of the normal distribution from which we sample 𝜖 in the reparameterize() method) would typically be 1. However, several papers found improvements with smaller values, such as 0.1, which is the value we will use. Another hyperparameter worth mentioning is kld_beta, the coefficient applied to the KLD loss term in the total loss computation. This factor helps because the reconstruction loss is typically more challenging to improve than the KLD; therefore if both were weighted equally, the model would start by optimising the KLD loss before improving the reconstruction loss substantially. By setting this coefficient to a value < 1, we try to ensure that both loss values improve at a similar rate. Below is the whole hyperparameter setup, which the model receives as a dictionary:

hparams = OrderedDict(
run='vae_anomaly_v1',
cont_vars = ['value', 'hour_min', 'gap_holiday', 't'],
# Remember to remove 'day' and 'month':
cat_vars = ['day_of_week', 'holiday'],
embedding_sizes = [
(embed_cats[i], 16) for i in range(len(embed_cats))
],
latent_dim = 16,
layer_sizes = '64,128,256,128,64',
batch_norm = True,
stdev = 0.1,
kld_beta = 0.05,
lr = 0.001,
weight_decay = 1e-5,
batch_size = 128,
epochs = 60,
)

Model Training

With Pytorch-Lightning, implementing logging, checkpointing, and training is very easy:

model = VAE(**hparams)
logger = WandbLogger(
name=hparams['run'], project='VAE_Anomaly',
version=hparams['run'], save_dir='kaggle/working/checkpoints'
)
ckpt_callback = pl.callbacks.ModelCheckpoint(
dirpath='.', filename='vae_weights'
)
trainer = pl.Trainer(
accelerator='gpu', devices=1, logger=logger,
max_epochs=hparams['epochs'], auto_lr_find=False,
callbacks=[ckpt_callback], gradient_clip_val=10.,
enable_model_summary=True,
)
trainer.fit(model)

We are ready to fit our model! Training is pretty quick on Kaggle's P100 GPUs. The total loss almost goes to zero:

Training curve (total loss = reconstruction loss + 0.05 * KL divergence)

The loss value on the test set is higher but still relatively low.

Evaluation on the test period

Passing the instances of the test period to the model, we get back individual loss values that we can append to the data:

Test data with the loss values of each example after passing through the VAE.

At this point, we need to make one more decision: How do we set the acceptance threshold for what constitutes an anomaly? Here, we set it based on its quantile in the loss distribution. We choose a very high quantile to ensure we are only getting the tail of that distribution, in this case, 0.999 (which corresponds to a loss value of 0.1563).

We can then flag any loss value above that threshold as an anomaly and plot the loss distribution:

Histogram of the loss values for the test set examples (200 bins). The last 0.1% of the observations are considered outliers.

What does it mean in terms of the distribution of the temperature values? After reversing the transformations applied to the data to put each feature back into the value ranges of the raw dataset, we can plot the temperature distribution:

The temperature distribution of the test set. The anomalies appear in orange as small segments in the circled area.

It's a little hard to tell on this figure, but the instances flagged as abnormal are not at the lower end of the temperature spectrum, as they would have been if we had used a more naive approach based only on the temperature distribution. Let's see these flagged examples on the time series:

Time series for the test period in the original units. The detected anomalies are shown as red dots.

With a threshold at the 99.9th centile, our model correctly detects the anomaly present in the test period, albeit a bit late, and does not generate any false positives. The threshold we defined can be adjusted to favour precision (reducing the number of false positives) or recall (reducing the number of false negatives), depending on the business problem.

Undoubtedly, one could improve these results presented with more sophisticated feature engineering and hyperparameter optimisation. The goal of this article was only to illustrate some valuable concepts, and I hope it was reached!

Conclusion

The VAE's job is to reduce the data to its discriminative features, which allows us to detect data points that do not fit this learned structure because they result in a worse reconstruction. The beauty of using a VAE for this task is that we do not need to be explicit about how such examples would deviate from the norm. As a result, we can hope to obtain an anomaly detection model with a subtler definition of an outlier.

In Part I, I tried to give you an intuitive understanding of VAEs and how they can address the problem of anomaly detection. In this article, we saw a concrete implementation and how to use it to discover new anomalies. I hope that you have a better understanding of how these ideas translate concretely into an effective learning model.

PS: If you found this article interesting, spam the clap icon and share the link!

[1] Doersch, C., 2016. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908.
[2] Ahmad, S., Lavin, A., Purdy, S. and Agha, Z., 2017. Unsupervised real-time anomaly detection for streaming data. Neurocomputing, 262, pp.134–147.

--

--

PhD Researcher and AI/ML advisor at LyRise.ai.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store