Run Pretrained Foundation Models on Amazon SageMaker¶
Foundation models are large pretrained models that generate predictions zero-shot on new data. Because they’re trained on massive, diverse datasets, they generalize to unseen data out of the box — no dataset-specific fitting required.
That makes the workflow much simpler than training your own time series predictor, which requires you to first fit a predictor on your data and then manage the trained artifact. With foundation models you skip the fit step entirely and go straight to deploying an endpoint or running batch predictions.
AutoGluon-Cloud exposes this workflow through TimeSeriesFoundationModel. For now it covers time series forecasting only, with models like Chronos-2 available out of the box.
Attention
SageMaker compute and S3 storage are billed to your AWS account. AutoGluon-Cloud is a free wrapper, but it’s your responsibility to monitor usage and delete endpoints when no longer needed.
Create the model¶
A TimeSeriesFoundationModel needs an IAM execution role (so SageMaker can run jobs on your behalf) and an S3 bucket (to stage data and store outputs). There are two ways to supply them:
Use a saved config (recommended). Save the role and bucket once to
~/.autogluon/cloud.yaml— see Setup — and subsequent constructor calls will pick them up automatically:from autogluon.cloud import TimeSeriesFoundationModel model = TimeSeriesFoundationModel(model_id="chronos-2")
Pass them at construction. Useful when you need different roles or buckets per call:
model = TimeSeriesFoundationModel( model_id="chronos-2", role="arn:aws:iam::222222222222:role/MyAutoGluonRole", cloud_output_path="s3://my-autogluon-bucket/ag-foundation-model", )
The examples in the rest of this tutorial reuse a single model object created this way.
Available models¶
The following model_id values are currently supported. Chronos-2 models natively support covariates and cross-learning across items, while Chronos-Bolt is univariate-only.
Model ID |
Documentation |
Weights |
|---|---|---|
|
||
|
||
|
||
|
||
|
chronos-2 is the recommended model — it supports covariates, cross-learning across items, and context lengths up to 8192 time steps. For background on Chronos models, see the Forecasting with Chronos-2 tutorial.
Data¶
The examples use a retail sales dataset with weekly sales for 1,115 stores. Load the historical observations:
import pandas as pd
data = pd.read_parquet("https://autogluon.s3.amazonaws.com/datasets/timeseries/retail_sales/train.parquet")
data.head()
| id | timestamp | Sales | Open | Promo | SchoolHoliday | StateHoliday | Customers | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2013-01-13 | 32952.0 | 0.857143 | 0.714286 | 5.0 | 0.0 | 3918.0 |
| 1 | 1 | 2013-01-20 | 25978.0 | 0.857143 | 0.000000 | 0.0 | 0.0 | 3417.0 |
| 2 | 1 | 2013-01-27 | 33071.0 | 0.857143 | 0.714286 | 0.0 | 0.0 | 3862.0 |
| 3 | 1 | 2013-02-03 | 28693.0 | 0.857143 | 0.000000 | 0.0 | 0.0 | 3561.0 |
| 4 | 1 | 2013-02-10 | 35771.0 | 0.857143 | 0.714286 | 0.0 | 0.0 | 4094.0 |
At a minimum, the input must contain three columns: an item ID, a timestamp, and the target value to forecast — here, id, timestamp, and Sales. The remaining columns (Open, Promo, SchoolHoliday, StateHoliday, Customers) are covariates, used by models that support them like Chronos-2 and ignored by univariate-only models like Chronos-Bolt. See the Time Series Quick Start for the long-format schema.
Chronos-2 can optionally use future values of covariates known ahead of time (e.g. holidays or planned promotions). The test split contains those future values — drop Sales (the target) since it’s what we want to predict:
known_covariates = (
pd.read_parquet("https://autogluon.s3.amazonaws.com/datasets/timeseries/retail_sales/test.parquet")
.drop(columns=["Sales"])
)
known_covariates.head()
| id | timestamp | Open | Promo | SchoolHoliday | StateHoliday | |
|---|---|---|---|---|---|---|
| 0 | 1 | 2015-05-03 | 0.714286 | 0.714286 | 0.0 | 1.0 |
| 1 | 1 | 2015-05-10 | 0.857143 | 0.714286 | 0.0 | 0.0 |
| 2 | 1 | 2015-05-17 | 0.714286 | 0.000000 | 0.0 | 1.0 |
| 3 | 1 | 2015-05-24 | 0.857143 | 0.714286 | 0.0 | 0.0 |
| 4 | 1 | 2015-05-31 | 0.714286 | 0.000000 | 0.0 | 1.0 |
Inference modes¶
TimeSeriesFoundationModel supports three inference modes on SageMaker. The right choice depends on how often you need predictions and how much latency you can tolerate:
Batch prediction — launch a one-off SageMaker job that scores a dataset and writes the results to S3. Compute spins up, runs, and shuts down automatically. Best for offline forecasting on larger datasets where minutes of startup latency are fine.
Real-time inference — deploy the model to a long-running SageMaker endpoint and send requests over HTTPS. Lowest per-request latency, supports GPU instances. You pay for the endpoint as long as it’s up, so best when you need predictions on demand and have steady traffic.
Serverless inference — deploy to a SageMaker Serverless Inference endpoint that scales to zero between requests. You only pay for active inference time. Best for intermittent or unpredictable traffic. Trade-offs: CPU only, cold-start latency on the first request after idle, and an extra setup step to bundle weights into a single artifact.
The examples below all reuse the data and known_covariates DataFrames loaded above.
Batch prediction¶
Use predict() to score a dataset as a one-off job. It returns a DataFrame of forecasts:
predictions = model.predict(
data=data,
target="Sales",
id_column="id",
timestamp_column="timestamp",
prediction_length=13,
known_covariates=known_covariates, # optional
)
The job also writes the forecasts to S3 as a CSV. By default they land at {cloud_output_path}/{job_name}/predictions.csv; pass predictions_path to choose an explicit destination:
predictions = model.predict(
data=data,
target="Sales",
id_column="id",
timestamp_column="timestamp",
prediction_length=13,
predictions_path="s3://my-bucket/forecasts/2026-06-02.csv",
)
For long-running jobs you can return immediately with wait=False. predict() then returns a JobPredictionFuture you can poll with .status() and resolve with .result():
future = model.predict(
data=data,
target="Sales",
id_column="id",
timestamp_column="timestamp",
prediction_length=13,
wait=False,
)
print(future.job_name, future.status()) # 'ag-...', 'InProgress'
predictions = future.result() # blocks until the job finishes, returns a DataFrame
Real-time inference¶
Deploy the model to a SageMaker endpoint with deploy(), then send requests through the returned endpoint. Pick an instance_type based on cost and latency requirements (defaults to ml.g5.xlarge):
endpoint = model.deploy(instance_type="ml.g5.xlarge") # takes a few minutes
predictions = endpoint.predict(
data=data,
target="Sales",
id_column="id",
timestamp_column="timestamp",
prediction_length=13,
known_covariates=known_covariates, # optional
)
The endpoint stays active — and billed — until you delete it:
endpoint.delete_endpoint()
Serverless inference¶
Serverless endpoints scale to zero between requests, so you only pay for active inference time. They run network-isolated, which means the model weights have to be bundled into a single model.tar.gz ahead of time rather than downloaded from HuggingFace at deploy time. Use cache_model_artifact() to do this once, then deploy with inference_mode="serverless":
cached_model = model.cache_model_artifact("s3://YOUR-BUCKET/fm-cache")
print(cached_model.model_artifact_uri) # 's3://YOUR-BUCKET/fm-cache/chronos-2/model.tar.gz'
endpoint = cached_model.deploy(inference_mode="serverless")
predictions = endpoint.predict(
data=data,
target="Sales",
id_column="id",
timestamp_column="timestamp",
prediction_length=13,
known_covariates=known_covariates, # optional
)
endpoint.delete_endpoint()
Subsequent runs can skip cache_model_artifact by passing the bundled artifact straight to the constructor:
model = TimeSeriesFoundationModel(
model_id="chronos-2",
model_artifact_uri="s3://YOUR-BUCKET/fm-cache/chronos-2/model.tar.gz",
)
endpoint = model.deploy(inference_mode="serverless")