Costly Lessons of Time Series Forecasting

Assessing the ideas that defined a competition model and two attempts at cloud deployment

The following article describes what began as a fun challenge and expression of creativity, and would then unravel into a development quagmire that took two lengthy attempts to resolve. The end product is a near-zero cost cloud architecture capable of automatically retraining a specially designed forecasting model, scripted via AWS CDK.

The Premise: Kaggle

Kaggle is a data science platform managed by Google. It's a popular starting point for practitioners of data science, with tutorials, community members, and data science competitions of all difficulties. One of Kaggle's flagship tutorial competitions is Store Sales - Time Series Forecasting, where participants compete using ~4 years of sales data to accurately predict the next 2 weeks.

In particular, there are 54 franchises selling up to 33 families of items, with each combination constituting its own time series. Additionally, the dataset includes exogenous data such as oil prices and holidays.

Model Architecture

The architecture I developed depended on one central idea. Conventionally, one single model could incorporate all training data to make predictions. Alternatively, each combination of store and item family, totaling 1,782 in number, could train its own “personal” time series model. I took a third route, in which each store and each item family possessed its own model, and their predictions would factor into a unique weighted sum for each combination; this totaled 87 time series models prior to weighting.

The architecture employed ARIMA-based models to simulate each time series, and an XGBoost-based model to incorporate ARIMA predictions and remaining information into the weighted sums. Dataset preprocessing also contained several engineered features, statistical transforms, and ARIMA parameters chosen using econometric methods. It was an unusual foray, but the model worked well, with the competition entry placing in the top 18%.

“You Can Script It, But Can You Deploy It?”

I may rarely or never be given the leeway to design such unorthodox models for commercial use, but I felt it was only fair that I try my hand at simulating practical usage of this model. I had three goals:

Host the model architecture on AWS
Simulate periodic data ingestion and retraining
Be professional

The last goal, while seemingly innocuous, would shape the looming catastrophe of the first deployment attempt. Despite incessant effort and a genuine interest, I had ended up with only bits and pieces of an untested, high-capacity architecture. An autopsy is as follows:

Tool choice obsession & unfamiliarity: I explicitly chose Apache Spark for processing and Apache Kafka for message delivery because they are industry standards. I was thus recommended AWS EMR and MSK. These are incomprehensibly scalable provisioned compute clusters that quickly generated hundreds of dollars in expenses. That is money I still wouldn't mind having back!

Development process unfamiliarity: Most online AWS tutorials, for clarity, show direct provisioning of resources and IAM grants through the AWS console. I had simply never heard of infrastructure as code at the time, and decided to ignore lingering concerns of infrastructure management and reproducibility. This left me with a functionless codebase and arbitrarily provisioned resources.

ChatGPT affirmation: Even throughout the second attempt, large language models essentially never gave explicit disapproval of proposed architectures. Multiple ideas with fundamental conflicts were greenlit with, at most, moderate tweaks suggested. Cloud architecture design depends on human domain knowledge.

This deployment effort was shelved for more than a year. In the back of my mind, though, the question remained. It was of course possible to deploy the competition model, and I had a responsibility to learn how. With new knowledge of AWS services, data engineering, and the development process, I returned.

Successful Second Attempt

The second attempt was defined by IaC-validated development, deeper knowledge of the AWS and data engineering ecosystems, and stricter adherence to cost and professionalism. Linked below is the finished product; the store-sales-aws GitHub repository requires the installation of only a few tools, can be deployed in a matter of minutes, can flexibly simulate data ingestion and automatic retraining, and contains thorough commit history, documentation, and continuous integration. It even has a simple dashboard for monitoring accuracy.

Although the project’s design decisions were refined gradually, there were a few particular circumstances that epitomized the lessons that directed them:

Planning data processing infrastructure: Data storage was essentially restricted to S3, with processing restricted to ECS Fargate tasks and Lambda functions. More code would be lifted directly from the competition notebooks than I anticipated. AWS’ myriads of data engineering services (and RDBMSs) proved to be suited to solutions that were more standardized, adaptable, and business-oriented than what the competition model could offer.

Cyclical architecture prevention: These issues deceptively presented themselves as single runtime errors on cdk synth or cdk deploy, but instead needed a wide, conceptual review. EventBridge was crucial in decoupling components that had to communicate with one another, and I couldn’t have devised the execution queue architecture on my own. Avoiding tunnel vision and keeping an eye on big picture solutions was key.

LLM code generation and quality management: LLM-boosted code development was invaluable, but without quickly established standards of abstraction and error handling, could only moderately prevent quality from spiraling out of control. This had to be addressed through an extra refactoring branch after release of the MVP. Fortunately, a major step up had still been made with the adoption of IaC and finer usage of agents. Here, raw background knowledge showed its importance.

Below is an architectural diagram with an explanation following:

The CloudFormation stacks are as follows:

Storage: Consisting of S3 buckets for data and serialized model storage, and DynamoDB tables for metadata.

Compute: Consisting of 2 ECS Fargate tasks that separate training between the SARIMAX and XGBoost phases, an ECS Fargate task for evaluation of the latest model on new data, and a Lambda function for data preprocessing.

Orchestration: Containing an AWS Step Functions state machine to run compute tasks in a DAG (directed acyclic graph) workflow, along with supporting infrastructure to queue and trigger jobs on every 15th/16th day of accumulated data.

Monitoring: Containing only a CloudWatch dashboard to observe evolving XGBoost/StackingRegressor model quality.

No fancy distributed compute services, and frankly many of the core AWS data engineering services omitted was disappointing, but managing expenses and integration is paramount. This architecture is about 1/1000th the price of the last.

Conclusion

The events in this article spanned two years and came in three distinct phases. While the data science portion of the store sales project was largely carefree, the warning regarding translation from notebook code to deployment rang all too true. It took a careful persistence and trust in textbook methods to transform the store sales project from financial ruin to an application of reliability and reproducibility. Next, I’ll be on my way to fill in some of the knowledge gaps that remain, such as the industry standard Terraform, and perhaps simple Databricks or Snowflake. Regardless, this brought to a close an open question of development skills with good exposure to cloud engineering on AWS.

GitHub link: https://github.com/zani-t/store-sales-aws

Search This Blog

@zanit