I'm a platform engineer migrating our legacy monolithic application to a microservices architecture, and we've standardized on Kubernetes for orchestration. Our current hurdle is designing a persistent storage strategy for stateful services like our user session cache and document processing queue; the default dynamic provisioning on our cloud provider is leading to unpredictable performance and cost spikes. I'm evaluating whether to implement a dedicated storage operator or redesign these services to be stateless, but both paths seem fraught with complexity.
Sounds tricky. Start by pinning down RPO/RTO and latency targets before exploring solutions.
From a project I worked on, we split hot data (session caches) into fast in-cluster storage with replication, and kept queues on a durable operator-managed volume. We used a Redis cluster for the session cache with AOF persistence and a Kafka cluster backed by a Ceph/Rook store for reliability. It added ops but stabilized latency and made cost predictable after tuning.
Consider a two-track decision framework: 1) dedicated storage operator (e.g., OpenEBS, Portworx, Ceph via Rook) that provides CSI, data replication, backups; 2) stateless redesign where the session data and queues live in external services (managed Redis, managed Kafka) with idempotent processing and eventual consistency. Build a table comparing latency, failure domains, migration cost, and ops burden. Run a 4-week pilot with representative load.
What are your data gravity and access patterns? Are session tokens ephemeral or long-lived? Do queues require at-least-once delivery? If you answer these, you can decide if storage should be centralized or distributed.
I’d lean toward hybrid rather than full stateless. Some components will have strong locality needs or require deterministic persistence; trying to force statelessness can introduce extra complexity. Use externalized state where possible but keep fast path data in cluster with robust replication.
Propose a phased plan: phase 0: spike to compare two viable options, phase 1: implement CSI-based storage for one service, phase 2: roll out auto-migration tooling and DR tests, keep monitoring. Include cost model and rollback plan.