If you log an artifact that does not track external files, W&B saves the artifact’s files to W&B servers. This is the default behavior when you log artifacts with the W&B Python SDK.See the Artifacts quickstart for information on how to save files and directories to W&B servers instead.
Track an artifact in an external bucket
Use the W&B Python SDK to track references to files stored outside of W&B.- Initialize a run with
wandb.init()
. - Create an artifact object with
wandb.Artifact()
. - Specify the reference to the bucket path with the artifact object’s
add_reference()
method. - Log the artifact’s metadata with
run.log_artifact()
.
datasets/mnist/
directory contains a collection of images. Track the directory as a dataset with wandb.Artifact.add_reference()
. The following code sample creates a reference artifact mnist:latest
using the artifact object’s add_reference()
method.
W&B Artifacts support any Amazon S3 compatible interface, including CoreWeave Storage and MinIO. The scripts described below work as-is with both providers, when you set the
AWS_S3_ENDPOINT_URL
environment variable to point at your CoreWeave Storage or MinIO server.By default, W&B imposes a 10,000 object limit when adding an object prefix. You can adjust this limit by specifying
max_objects=
when you call add_reference()
.Download an artifact from an external bucket
W&B retrieves the files from the underlying bucket when it downloads a reference artifact using the metadata recorded when the artifact is logged. If your bucket has object versioning enabled, W&B retrieves the object version that corresponds to the state of the file at the time an artifact was logged. As you evolve the contents of your bucket, you can always point to the exact version of your data a given model was trained on, because the artifact serves as a snapshot of your bucket during the training run. The following code sample shows how to download a reference artifact. The the APIs for downloading artifacts are the same for both reference and non-reference artifacts:W&B recommends that you enable ‘Object Versioning’ on your storage buckets if you overwrite files as part of your workflow. With versioning enabled on your buckets, artifacts with references to files that have been overwritten will still be intact because the older object versions are retained.Based on your use case, read the instructions to enable object versioning: AWS, GCP, Azure.
Add and download an external reference example
The following code sample uploads a dataset to an Amazon S3 bucket, tracks it with a reference artifact, then downloads it:See the following reports for an end-to-end walkthrough on how to track artifacts by reference for GCP or Azure:
Cloud storage credentials
W&B uses the default mechanism to look for credentials based on the cloud provider you use. Read the documentation from your cloud provider to learn more about the credentials used:Cloud provider | Credentials Documentation |
---|---|
CoreWeave AI Object Storage | CoreWeave AI Object Storage documentation |
AWS | Boto3 documentation |
GCP | Google Cloud documentation |
Azure | Azure documentation |
AWS_REGION
environment variable to match the bucket region.
Rich media such as images, audio, video, and point clouds may fail to render in the App UI depending on the CORS configuration of your bucket. Allow listing app.wandb.ai in your bucket’s CORS settings will allow the App UI to properly render such rich media.If rich media such as images, audio, video, and point clouds does not render in the App UI, ensure that
app.wandb.ai
is allowlisted in your bucket’s CORS policy.Track an artifact in a filesystem
Another common pattern for fast access to datasets is to expose an NFS mount point to a remote filesystem on all machines running training jobs. This can be an even simpler solution than a cloud storage bucket because from the perspective of the training script, the files look just like they are sitting on your local filesystem. Luckily, that ease of use extends into using Artifacts to track references to file systems, whether they are mounted or not. Suppose you have a filesystem mounted at/mount
with the following structure:
mnist/
is a dataset, a collection of images. You can track it with an artifact:
By default, W&B imposes a 10,000 file limit when adding a reference to a directory. You can adjust this limit by specifying
max_objects=
when you call add_reference()
.file://
prefix that denotes the use of filesystem references. The second component begins the path to the dataset, /mount/datasets/mnist/
.
The resulting artifact mnist:latest
looks and acts like a regular artifact. The only difference is that the artifact only consists of metadata about the files, such as their sizes and MD5 checksums. The files themselves never leave your system.
You can interact with this artifact just as you would a normal artifact. In the UI, you can browse the contents of the reference artifact using the file browser, explore the full dependency graph, and scan through the versioned history of your artifact. However, the UI cannot render rich media such as images, audio, because the data itself is not contained within the artifact.
Downloading a reference artifact:
download()
operation copies the files from the referenced paths to construct the artifact directory. In the above example, the contents of /mount/datasets/mnist
are copied into the directory artifacts/mnist:v0/
. If an artifact contains a reference to a file that was overwritten, then download()
will throw an error because the artifact can no longer be reconstructed.
Putting it all together, you can use the following code to track a dataset under a mounted filesystem that feeds into a training job: