Providing a Vision Of The Future Using Spatio-Temporal Machine Learning Models on Weather Images


Current weather predictions in the industry involve measurements of data such as humidity, temperature, wind speed, etc. that are inputted into physics equations to be computed for predictions. While a reliable method, it is an extremely computational process that can take days to computer, which also makes the predictions independent of more recent weather data and suffer from a natural range of predictions.

The goal of this project is to test current ML research in video frame prediction on weather prediction. Particularly, we will be using consecutive radar images that are fed into a model to extract weather patterns. It will then generate accurate future radar images with the hope that these can be reverse-analyzed into meaningful weather predictions. Our focus is primarily on the image prediction and generation aspect.

The hope is to explore possible ways to update our current weather prediction models to make them more real-time and offer more accurate predictions. Particularly in cases such as flash-floods and the like, the potential to offer these weather patterns where we otherwise could not could be potentially life-saving. To do this, we test various models on various different video prediction datasets before showing our results on weather radar images.


We had a few datasets, with our preliminary one being Moving MNIST. Moving MNIST consists of 10,000 sequences of 2 moving digits moving in a 64x64 frame. The intuition behind learning Moving MNIST is that the model learns how to represent the input sequence, and is able to use that representation to predict future frames of the sequence. Moving MNIST is a synthetically generated model, with no added noise, keeping the data as simple as possible, though still non-trivial as it contains many interesting properties. Each digit has a random velocity, and they bounce off the walls and overlap each other. There is a lot of information that our model must learn to represent, so even though this dataset appears to be quite simple in nature, there are a few features that make the problem more complicated. As it is a generated model, we could also theoretically make our dataset as large or unique as we wanted it to be in order to avoid overfitting.

The Caltech Pedestrian Dataset is a dataset with a lot more variance. Particularly, there is now change across all parts of each frame and now a third axis of movement. The level of detail, the images being RGB instead of grayscale, and the variable settings make the task of generation also significantly more difficult. However, the majority of the data is the car moving forward, with little changes in the environment between individual frames. As such there needed to be some manipulation in how we handled the data, which are detailed in the Processing section below. The motivation behind using this dataset is the same: a non-trivial dataset with a lot of variance in what the model has to be able to predict.

The weather dataset we decided to use was one from the CIKM AnalytiCup 2017 Precipitation Forecasting Challenge. Unlike the previous datasets, where it is obvious what the next frame should look like for the human eye, there really isn’t much to go off of. Weather seems to be very random, with portions of clouds disappearing and reappearing at seemingly random times. This is a dataset where we see the usefulness of our model, at completing a task that humans struggle to do just based off of images.

Initial Model


The base model which we used called a ConvLSTM has a similar intuition to the basic LSTM model used in a lot of text processing tasks. However, rather than having basic matrix multiplication operations at the entrances of each gate, we decide to swap these with learnable convolution filters which produce the same output dimension of data. This can be visualized with the image below.

The intuition can be thought of as perceiving the matrix multiplication in standard LSTM cells as a dense neural network layer. By using convolution layers, our hope is that these gates learn complex spatio-temporal features in the context of sequential image inputs. We essentially transform a standard LSTM model used for 1 dimensional learning such as in text prediction and replace it with convLSTM cells.

The first model we went with was one we used to learn the MovingMNIST dataset. Our base model was a ConvLSTM but we decided to stack 3 on top of each other for better depth and performance along with BatchNormalization between each layer. You can find the model below. We feed in ten input frames to the ConvLSTM cells and then use the output to pass into new cells that predict that predict the next frame.

Choosing Our Loss Functions

Image generation tasks offer a significantly harder question on how to measure accuracy than standard classification and/or prediction tasks. While humans can often perceptually distinguish the quality of a model’s output, it is somewhat challenging to quantify this in a way that a model can learn from. However, there are multiple approaches we used to experiment our ideal loss function.

We paired our initial model with mean squared error loss. We take the squared error between each respective pixel of the output image and the ground truth, intuitively measuring the difference between the two images. We also used a tanh activation for the last layer with image pixel values scaled between -1 and 1. The results we had with this are below. The first row of images is the predicted 11th frame and the bottom row is the ground truth 11th frame.

As we can see, all the images are much more blurry than the ground truth is even though the general spatial accuracy is great. This is a general issue we’ve found with ConvLSTMs that we will discuss later in this article too. However, we wanted to find a way to fix this without re-engineering the ConvLSTM itself.

This is when we came across the SSIM loss function. While MSE compares pixel values to each other, the general intuition behind SSIM is to compare means and distributions of pixel values instead. This would, for example, allow for greater similarity values between images that might be the exact same but in lighter or darker lighting. The motivation is that these comparisons measure much more perceptual similarities that are important to the human eye. We can witness a comparison with SSIM and MSE below.

The graphic below shows a reference image as well as 6 compared images which all have the same MSE values when compared to the reference image. However, SSIM rates these images to be most similar at the highest point of the circle and worst at the lowest point in the circle. This shows that SSIM has a better formula for telling whether two images are similar. However, it is important to note that there is no mathematical model that is perfect at doing this. Even, SSIM has its flaws in certain places.

We also experimented with variations of SSIM including MS-SSIM (Multiscale Structural Similarity) but received results that were very similar between the two.

The results below were achieved with MSE + SSIM/8 as our loss function.

After reducing blurriness significantly on Moving MNIST, we tried the same model on our Pedestrian Dataset and got the results below. As mentioned before, the dataset was pretty static between consecutive frames. That is, there is very little variation between the 10th frame and the 11th frame, making it hard for a model to really learn useful predictions. Therefore, we took every 5 frames in order to get more movement in sequences. The results of this skipping is seen below The 2nd and 4th rows are ground truth images while the 1st and 3rd rows are their respective predictions of the 11th frames in the sequence.

Changes to the Single Frame Model and Predicting on Weather

After having some success on the two previous datasets, we finally moved onto predicting the weather. Initially, we tried using the same models as before. However, we found that we’d always receive results blurrier than the ground truth images. This led us to experiment with two variations of the same model which experimented with 2D and 3D Convs as shown in the figure below.

The intuition behind a model such as this is that, though ConvLSTMs are great at predicting and producing spatially accurate results, they aren’t as great as producing perceptually accurate results. This may be due to the fact that converting the gates of LSTMs to use convolutional transformations primarily helps in extracting spatio-temporal features. However, it likely does not replace the large amount of feature transformation that is often needed in image generation tasks through multiple, large convolutional layers. Thus, 2D and 3D Conv layers should help with encoding/decoding of spatial information to a more perceptually representative output.

As seen below, our single frame prediction performs pretty well though it still tends to be a little blurrier (top) prediction compared to the ground truth results.

Multiframe Prediction

Though results for single frame prediction fair pretty well, the real end goal is to predict over multiple frames and get long-term output. The recursive approach of feeding in sequences to our single frame model to generate multi-frame predictions resulted in greater noise generated over time, as seen below.

Thus, we decided to try an Encoder-Forecaster approach to solving this issue. An encoder-forecaster model utilizes LSTM layers to first generate an encoding of the input frames, and then to recursively generate new frames, carrying a cell state between each prediction which makes it better than simply recursively predicting a single-frame and introducing more noise each time.

Encoder-Forecaster Architecture

When using an encoder-forecaster model, it is possible to actually train the encoder and forecaster separately. First, one can train an encoder-decoder structure to recreate the ten input frames, compressing the input into a smaller dimension space and ensuring we’ve properly encoded the initial input. Then, you can simply run the encoder part on the data set and just train a forecaster to predict future frames based on previously generated encodings. Thus, the encoding would be forced to capture higher-dimensional features of the inputs and would be less prone to noise. However, in our experiments, we did not find a significant improvement when training separately as such, because there wasn’t a large need to remember the original frames when predicting simple weather patterns.

For our final architecture, we stacked multiple ConvLSTM encoder-forecasters to increase the effective complexity of our model. We set our task to be predicting the next 5 frames of a sequence given its first 10 frames. For each input, we copy over the final cell and hidden states from the encoder part of the model to initialize the decoder ConvLSTM stack. We take the 10th input frame and run it through the decoder to generate our 11th frame prediction. Then, we take the 11th predicted frame and use it to predict the 12th frame recursively, except with update cell and hidden states in the ConvLSTMs. We used SeLU activation to avoid the dead neuron problem that often occurs with ReLU activation, where the neuron input is always negative and is therefore never updated and never contributes to the output.

Additionally, to combat the noise inherent to ConvLSTMs, we added a final 2D CNN layer distributed across all the outputs of the ConvLSTMs to act as a transformation from the ConvLSTM representation to the final output — a weather image.

Results and Conclusion

Key: First ten images are the input, the 5 top right images are the ground truth, and the 5 bottom right images is our model’s prediction

After training for ~100 epochs, we were able to obtain solid results that did more than just copy the 10th frame over 5 times. As you can see in the first and second examples, our model was able to pick up on the general movement of the weather within the frame. However, we were not able to eliminate noise from our predictions, as the final outputs looked a lot more blurry than the true pictures. Another thing that makes this task very difficult is the lack of information available to our model. In the last example, we see that the dense block is followed by a gap on the left in the sequence. However, since our model doesn’t see that gap in the 10 input frames, it predicts the block to stretch out much farther than it actually does. In this case, the model simply does not have enough information for a better prediction.

Much of the primary purpose of this project has been met. Primarily, we feel we have been able to show there is a high feasibility in updating our weather prediction models using ML. However, many improvements still have to be made before something like this can be used in the industry.

What would be most valuable would be to cross-correlate the time and location of radar images with other atmospheric measurements, such as winds, pressure, humidity, and precipitation. This can be made possible if weather stations are able to easily store such data in sync with their available radar images. To make such a dataset easily accessible, which has relatively little cost to plethora of information already shared by, say, the NOAA, could foster a host of ML researchers to fully explore this territory. We advise you to look out over the next decade about the research done here. There may be tremendous and beneficial changes to how you receive your weather reports each day.

A technology club at UC Berkeley that fosters a community of creative and passionate engineers to tackle real world problems using machine learning.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store