MultiHub Forum

Full Version: How to balance governance and flexibility in ML model-training orchestration?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'm the technical lead for a team building a new machine learning platform intended to serve hundreds of data scientists across our organization. We're in the final design phase for the model training orchestration layer, and we're stuck on a fundamental architectural choice. The debate centers on whether to build a highly opinionated, domain-specific framework that enforces best practices and standardized workflows, or to create a more flexible, low-level toolkit that provides primitives and lets individual teams assemble their own pipelines. The opinionated approach promises better governance, reproducibility, and easier onboarding for new hires, but we fear it could stifle innovation and be too rigid for advanced research projects. The flexible approach empowers our experts but risks creating a "wild west" of incompatible scripts and technical debt. We've seen other companies struggle with both extremes. For engineers who have built or adopted similar internal platforms, what guided your decision on this spectrum between flexibility and control? How did you validate that your chosen level of abstraction was correct before committing significant engineering resources, and what mechanisms did you put in place to allow the system to evolve if your initial assumptions were wrong?