Don’t forget to understand the at-inference usage profile

If you are deploying your algorithm as a microservice endpoint it’s worth thinking about how often and when it will be called. For typical software applications you may well expect a steady request rate. Whereas, for many machine learning applications it can be called as part of a large batch process leading to bursty volumes where there are no requests for five days then a need to handle 5 million inferences at once. A nice thing about using a walking skeleton (Create a Walking Skeleton/ Steel Thread) is that you get an early understanding of the demand profile and can set up load balancing for appropriate provisioning.

Last updated