How much time a data scientist should spend on MLOps?
Someone had recently asked me this:
How much time a data scientist should spend on MLOps and Data Engineering? Everything feels important and relevant?
I'm sure that this is not a universal answer, but here is how I think about it:
MLOps is a means to an end. That end varies from reproducibility to low latency, high throughput to simply faster experimentation/release times.
Pick your end goal(s), and build the tooling around that.
It's okay for your first $$-generating pipeline to be simply a Jupyter notebook writing outputs to a csv which you then upload to a dashboard manually.
To give another example: In my most recent role, reproducing our experiments was 2-3 hours at most. So we didn't bother saving the model artifacts or feature stores.
But we spent a lot more energy in reducing release time for new models to ~2 hours i.e. we could take our new data processing + model weights + model serving code to stage in 2 hours from code merging.
This had some amazing 2nd order effects: Customers were happy with how agile we were, team was motivated since they could see the impact almost every week and the internal stakeholders like PM were happy since we could give them a more reliable timeline on when we'll go to prod.