Getting the time estimates right is one of the most difficult things in the world, in my experience, says Alexander Samuelsson, CTO and co-founder at Imagimob, author of this blog.
The most important factor in determining the budget for your ML project is the amount of data needed to create a model fulfilling your performance requirements.
If the required data doesn’t already exist, and believe me, it almost never does, getting hold of and refining this data will cover most of your work. Hopefully you can automate a lot of the collection process, but then you still need calendar time to collect it, and you always need to clean it, analyze it and annotate it.
Anyway. Figuring out exactly what, and how much data you need, before a project starts, is almost impossible if you haven’t built a similar model, in a similar domain before.
However, there is some simple rules/frameworks which we can use to figure out the ballpark/order of magnitude of data that we will be dealing with.
You can think of a good ML model as a model that classifies real world data well enough to fulfill your accuracy requirements. The issue that the model is facing is that it has been trained on a training set.
This training set of data is a subset of whatever data the model will face outside of the lab/in the real world. If this subset accurately captures the properties of most of the real world data, the model will perform well.
Here lies the key to estimating/budgeting for your data collection!
Model A, a wakeword detector. Photo by Lazar Gugleta on Unsplash
Let’s consider two example ML models,
Model A is a wakeword detector. It is constantly listening for the phrase ”Hey Alexa!” and wakes up to receive further commands if it captures this phrase.
Such a model is always on/active, and is targeting the general consumer. This means that this model will be subject to a huge variance of real world data.
It will pick up audio from different acoustic environments, with different background noises and it has to understand many different kind of voices and accents.
Model A will be very expensive to build if you have to collect the data yourself. You will need to collect data from 1000s of people, in many different environments and you also need a huge dataset of background noise and other utterances and normal conversation, to separate from the ”Hey Alexa!” phrase we are looking for.
Consider instead, model B.
Model B is listening to hear if the assembly of two parts in a factory is correct. Model B lives in a very different reality. It is deployed in one or a few factories. It is placed in a known location in that factory. Let’s say that it can even be protected from alot of background noise through some clever placement and shielding.
The variety of sounds and soundscapes that Model B will experience is miniscule in comparison to Model A.
This snow fling is miniscule, like the data needed by Model B. Photo by Aaron Burden on Unsplash
Collecting sufficient data to build Model B, will be orders of magnitude faster and cheaper.
When budgeting for a ML project, reasoning about the ”life” of the model like this will seriously help to place you in the right ballpark for your budget estimate…
Happy Machine learning!
This blog was originally posted on https://alexsamuelsson.com/2022/09/15/ml-project-time-estimates/
Please contact me at alex@imagimob.com if you have any thoughts or experience on this subject!