Guest Column | February 21, 2020

Amazon Forecast: Best Practices

By Rob Whelan, 2nd Watch

Best practices

In part one of this article, we offered an overview of Amazon Forecast and how to use it. In part two, we get into Amazon Forecast best practices:

Know Your Business Goal

In our data and analytics practice, business value comes first. We want to know and clarify use cases before we talk about technology. Using Amazon Forecast is no different. When creating a forecast, do you want to make sure you always have enough inventory on hand? Or do you want to make sure that all your inventory gets used all the time? This will drive which “quartile” you look at.

Each quartile - the defaults are 10 percent, 50 percent, and 90 percent - is important for its own reasons and should be looked at to give a range. What is the 50 percent quartile? The forecast at this quartile has a 50-50 chance of being right. The real number has a 50 percent chance of being higher and a 50 percent chance of being lower than the actual value. The forecast at the 90 percent quartile has a 90 percent chance of being higher than the actual value, while the forecast at the 10 percent quartile has only a 10 percent chance of being higher. So, if you want to make sure you sell all your inventory, use the 10 percent quartile forecast.

Use Related Time Series

Amazon has made Forecast so easy to use with related time series, you really have nothing to lose to make your forecast more robust. All you have to do is make the time series time units the same as your target time series.

One way to create a related dataset is to use categorical or binary data whose future values are already known - for example, whether the future time is on a weekend or a holiday or there is a concert playing - anything that is on a schedule that you can rely on.

Even if you don’t know if something will happen, you can create multiple forecasts where you vary the future values. For example, if you want to forecast attendance at a baseball game this Sunday, and you want to model the impact of weather, you could create a feature is raining and try one forecast with “yes, it’s raining” and another with “no, it’s not raining.”

Look At A Range Of Forecasted Values, Not A Singular Forecasted Value

Don’t expect the numbers to be precise. One of the biggest values from a forecast is knowing what the likely range of actual values will be. Then, take some time to analyze what drives that range. Can it be made smaller (more accurate) with more related data? If so, can you control any of that related data?

Visualize The Results

Show historical and forecast values on one chart. This will give you a sense of how the forecast is trending. You can backfill the chart with actuals as they come in, so you can learn more about your forecast’s accuracy.

Choose A “Medium-Term” Time Horizon

Your time horizon - how far in the future your forecast looks - is either 500 timesteps or one-third of your time-series data, whichever is smaller. We recommend choosing up to a 10 percent horizon for starters. This will give you enough forward-looking forecasts to evaluate the usefulness of your results without taking too long.

Save Your Data Prep Code

Save the code you use to stage your data for the forecast for the future. Because you will be doing this again, you don’t want to repeat yourself. An efficient way to do this is to use PySpark code inside a Sagemaker notebook. If you end up using your forecast in production, you will eventually place that code into a Glue ETL pipeline (using PySpark), so it is best to just use PySpark out of the box.

Another advantage of using PySpark is that the utilities to load and drop csv-formatted data to/from S3 are dead simple. You will be using CSV for Forecasting work.

Interpret The Results!

The guide to interpreting results is here, but admittedly it is a little dense if you are not a statistician. One easy metric to look at, especially if you use multiple algorithms, is Root Mean Squared Error (RMSE). You want this as low as possible, and, in fact, Amazon will choose its winning algorithm mostly on this value.

It Will Take Some Time

How long will it take? If you do select AutoML, expect model training to take a while - at least 20 minutes for even the smallest datasets. If your dataset is large, it can take an hour or several hours. The same is true when you generate the actual forecast. So, start it at the beginning of the day so you can work with it before lunch, or near the end of your day so you can look at it in the morning.

Data Prep Details (For Your Data Engineer)

  • Match the 'forecast frequency' to the frequency of your observation timestamps.
  • Set the demand datatype to a float prior to import (it might be an integer).
  • Get comfortable with `striptime` and `strftime` - you have only two options for timestamp format.
  • Assume all data are from the same time zone. If they are not, make them that way. Use python datetime methods.
  • Split out a validation set like this: https://github.com/aws-samples/amazon-forecast-samples/blob/master/notebooks/1.Getting_Data_Ready.ipynb
  • If using pandas dataframes, do not use the index when writing to csv.

Conclusion

If you’re ever asked to produce a forecast or predict some number in the future, you now have a robust method at your fingertips to get there. With Amazon Forecast, you have access to Amazon.com’s optimized algorithms for time series forecasting. If you can get your target data into CSV format, then you can use a forecast. Before you start, have a business goal in mind - it is essential to think about ranges of possibilities rather than a discrete number. And be sure to keep in mind our best practices for creating a forecast, such as using a “medium-term” time horizon, visualizing the results, and saving your data preparation code.

About The Author

Rob Whelan is Practice Director, Data & Analytics, 2nd Watch.