Blog
How we delivered a production-grade AI platform on Google Cloud with 99.4% uptime in 12 weeks

How we delivered a production-grade AI platform on Google Cloud with 99.4% uptime in 12 weeks

7
min read
Maksym Bohdan
August 14, 2025

When an early-stage AI startup in retail analytics approached us, they had a clear vision but a big challenge ahead. Their team was building impressive machine learning models in Jupyter notebooks and serving business logic through lightweight Python microservices—great for prototyping, but far from ready for enterprise-scale production.

With plans to onboard more than 20 new customers in just six months, they needed a complete transformation: a secure, automated infrastructure on Google Cloud, a reliable CI/CD process for all backend and ML services, and a lean MLOps layer to manage models in production.

In this case study, we’ll walk through how we built their platform from the ground up, including Terraform-provisioned infrastructure, GitOps deployments, Vertex AI pipelines, model versioning, and real-time monitoring.

About the client

Our client* is an early-stage AI startup building advanced analytics solutions for large retail chains. Their platform combines sales, inventory, and customer behavior data to generate actionable insights, relying on machine learning models to predict demand, optimize pricing, and improve store operations.

At the time they approached us, the team’s workflow was optimized for speed of experimentation: data scientists worked in Jupyter notebooks, while backend developers served business logic via lightweight Python microservices. 

*At this time, we cannot disclose the company’s name due to NDA restrictions, but all technical details and results presented here are based on the actual project.

The challenge before scaling

To onboard more than 20 new enterprise customers within six months, the startup needed to transform its prototype into a robust, production-ready cloud platform. The priorities were clear:

  • Migrate to a secure and scalable cloud environment that could handle multiple customers without significant operational overhead.
  • Establish a DevOps foundation with Infrastructure as Code, CI/CD, and GitOps deployment workflows.
  • Implement a lean but effective MLOps layer for model versioning, experimentation tracking, and automated deployment.
  • Introduce monitoring and alerting to ensure high availability and quickly detect issues in both infrastructure and models.

At the time, deployments were manual and time-consuming, with no automated testing or release process. Models existed as files passed between team members, making it difficult to track versions or roll back changes. 

Any scaling effort without fixing these gaps would have risked outages, inconsistent model performance, and long lead times for new customer onboarding.

Building the DevOps foundation on Google Cloud

We began by defining every infrastructure component in Terraform, ensuring the entire setup is reproducible and version-controlled. The client’s requirements demanded three fully isolated environments—development, staging, and production—each deployed in its own GCP project for strict separation of workloads and permissions.

For each environment, Terraform provisions:

  • Artifact Registry for storing and versioning Docker images of both microservices and ML components.
  • Private Google Kubernetes Engine (GKE) cluster with no public control plane access. This is essential for reducing the attack surface and enforcing network-level isolation.
  • Google Cloud NAT to allow outbound internet connectivity for pulling dependencies without exposing workloads to the public internet.
  • Google Secret Manager for centrally managing sensitive credentials and API keys with fine-grained IAM policies.
  • Workload Identity to map Kubernetes service accounts to GCP IAM service accounts, eliminating long-lived credentials inside containers.

The infrastructure was built with modular Terraform stacks so that changes to one environment don’t cascade unexpectedly to others. Every change is peer-reviewed via GitHub Pull Requests before being applied through CI/CD.

GitOps-driven delivery for all services

To keep deployments predictable and auditable, we implemented FluxCD as the GitOps controller. It continuously reconciles Kubernetes manifests in Git with the actual state of the cluster. 

This means:

  • Any change in the Git repository (new image tag, updated configuration) is automatically rolled out to the right environment.
  • Rollbacks are as simple as reverting a Git commit.
  • Every deployment is traceable to a specific code change, image digest, and commit hash.

GitHub Actions handles the build and delivery pipeline:

  1. Build Docker images for microservices and ML components.
  2. Run automated unit and integration tests.
  3. Push versioned images to Artifact Registry.
  4. Update Kubernetes manifests in the GitOps repository.
  5. FluxCD detects changes and deploys them automatically.

This approach reduced deployment times from 2–3 days to under 15 minutes per service.

Network and security hardening

Security was a non-negotiable requirement due to the retail domain’s sensitivity. We implemented:

  • Private GKE clusters with VPC-native networking.
  • Pod Security Policies and Network Policies to control which services can communicate internally.
  • IAM least privilege for every service account.
  • Shielded GKE Nodes to protect against boot-level malware.
  • Binary Authorization to ensure only signed and verified container images can be deployed.

Observability from the ground up

We integrated a monitoring stack using VictoriaMetrics for metrics storage and Grafana for visualization. Custom dashboards were created for:

  • Kubernetes cluster health (CPU, memory, pod restarts).
  • Microservice performance (latency, error rates).
  • Vertex AI model endpoint metrics (latency, request throughput).

Alerting rules are defined in Grafana and routed to Slack via Alertmanager. For example:

  • P95 latency > 500ms for any API triggers an immediate alert.
  • Pod restart loop > 3 in 5 minutes notifies DevOps.

This setup ensures that both infrastructure and application-level issues are detected before they impact end users.

Scaling for multi-tenant workloads

Given the upcoming 20+ enterprise clients, the architecture needed to support multi-tenancy without duplicating infrastructure. We used namespace-based isolation in GKE combined with Helm charts for templating deployments per customer. Resource quotas ensure that a single tenant cannot monopolize cluster resources.

Continuous cost and performance tuning

GKE node pools were optimized with:

  • Preemptible nodes for non-critical workloads (e.g., nightly batch jobs)—reducing costs by ~70% compared to on-demand.
  • Autoscaling at both node and pod levels to handle unpredictable AI workload spikes.
  • Separation of high-CPU and high-memory node pools for ML training vs. API serving.

We also enabled VPA (Vertical Pod Autoscaler) recommendations for fine-tuning container resource requests over time.

Introducing MLOps with Vertex AI

Once the DevOps foundation was in place, the focus shifted to automating the client’s machine learning workflows to the same level of reproducibility and transparency. We chose Vertex AI as the central MLOps platform, integrating it tightly with the existing CI/CD setup so that deploying a model would feel no different from deploying an application.

The features of GCP Vertex AI cover the full spectrum of ML workflows including training, evaluation, interference/prediction, and model versioning among others. 

The first step was building a modular ML pipeline with Vertex AI Pipelines and the Kubeflow Pipelines SDK. This pipeline automated the full model lifecycle, encompassing data ingestion and preprocessing, model training and evaluation, registration in the Model Registry, and deployment to production endpoints. Every step was containerized and parameterized, allowing the same workflow to be reused across multiple models and datasets. The entire process was triggered automatically from GitHub Actions, ensuring that any change in model code or configuration led to a fresh pipeline run.

Experiment tracking was a critical improvement. Previously, model versions were shared as files on local machines, often without clear lineage. With Vertex AI Experiments, every pipeline execution was logged with its parameters, datasets, and metrics. Within the first month, over seventy experiments were recorded, giving the team a transparent history of how each model evolved and why certain versions outperformed others. This level of traceability proved essential when validating models for enterprise customers.

Model registration became the backbone of deployment governance. Each successful pipeline run resulted in a model entry in the Vertex AI Model Registry, complete with metadata on input/output schema, evaluation metrics, and data lineage. Deployments to Vertex AI Endpoints were directly tied to these registry entries, allowing the use of canary releases to gradually roll out new models. This approach minimized risk: if a new version degraded performance or accuracy, traffic could be rolled back within minutes without impacting all users.

Monitoring in production was designed to go beyond basic uptime checks. Vertex AI Model Monitoring was configured to track feature drift, prediction drift, and endpoint latency. Baselines were computed from training data, and daily monitoring jobs compared live inputs against these baselines. Two drift events were detected in the first month, both of which triggered retraining workflows before the issues could affect customers. Alerts were delivered to Slack through Cloud Functions, giving the data science team immediate visibility into anomalies.

Results and impact

After twelve weeks of focused work, the transformation was visible at every level—from infrastructure resilience to the way models were deployed and monitored. What began as a set of local notebooks and loosely connected microservices became a fully automated, enterprise-ready AI platform running on Google Cloud.

DevOps achievements:

  • All infrastructure provisioned via Terraform, enabling full reproducibility and disaster recovery.
  • Deployment time reduced from several days to under 15 minutes per service with GitHub Actions and FluxCD.
  • Uptime improved from 92% to 99.4% through better observability and proactive alerting.
  • GKE autoscaling and dedicated node pools absorbed workload spikes without manual intervention.
  • Use of preemptible nodes for batch jobs cut non-critical workload costs by ~70%.

MLOps achievements:

  • One reusable pipeline in Vertex AI managing the entire ML lifecycle.
  • Over 70 experiments tracked in the first month with full parameter and metric history.
  • Model Registry governance with safe canary deployments and rollback options.
  • Two drift events detected and resolved before they impacted customers.
  • Inference latency under 200ms at P95 even during peak load.

Operational impact:

  • Onboarding a new enterprise customer reduced from weeks to a few hours.
  • Multi-tenant isolation in GKE namespaces prevents cross-customer interference.
  • Helm-based deployment templates make new environment provisioning a configuration task rather than a manual build.

In practical terms, the client now has a platform that scales without scaling the operational burden. Developers focus on building features and improving models; DevOps maintains a healthy, secure, and cost-efficient infrastructure—a combination that enables fast onboarding, consistent performance, and continuous improvement.

Final: From local experiments to a scalable AI platform

By building a solid DevOps foundation on Google Cloud and a lean MLOps layer with Vertex AI, we transformed the client’s local prototypes into a secure, automated, and scalable production platform. Infrastructure, microservices, and ML models now share the same delivery pipelines, monitoring stack, and security controls.

The result: onboarding new enterprise customers in hours, deploying models with canary rollouts, maintaining sub-200ms inference latency, and catching drift events before they impact users. The platform is ready to scale to 20+ clients without adding operational complexity.

If you’re facing similar scaling challenges with your AI infrastructure, our team can help you design and deliver a platform that’s ready for growth from day one.

Maksym Bohdan
Writer at Dysnix
Author, Web3 enthusiast, and innovator in new technologies
Copied to Clipboard
Paste it wherever you like