Context
Given that Animl supports using of a growing number of ML models (both object detection models and classification models), and especially now that we've integrated SpeciesNet, we need better, more efficient workflows to try to answer questions such as: What model(s) will work best on my data and for my particular use case? How well has the model I've chosen been working? Are my confidence thresholds optimally set?
These are legitimately hard questions to answer, but Animl should in theory be well suited to addressing them. Users can upload their data, get predictions, correct those predictions to create a ground-truthed labeled dataset, and from there we have everything we need to evaluate model performance (in theory). In practice, I've found this gets fairly complicated and thorny very fast.
Current solution
Currently, if you want to understand how an ML model is performing in production (i.e. on new images, unseen by the model in training or testing), a user would have to:
- Get predictions: upload a bunch of their images (if it's not already in Animl), with Automation Rules configured to run whatever model(s) with confidence threshold settings they are hoping to evaluate
- Review and correct those predictions: we need human-verified, ground-truth labels to compare the predictions to, so users must manually review the data using the Animl UI (if they haven't already)
- Run analysis scripts: ask me to run the analyzeMLObjectLevel.js script (for Object-level performance metrics) or the analyzeMLSequenceLevel.js script (for sequence-level analysis).
- Interpret results: I would then send over a CSV version of the output to the user. An Excel example of what the output looks like (with some color and formatting) can be found here: sci_biosecurity_2023-4-28--2024-5-29_sequence-level_2024-06-10T1815Z.xlsx. The interpretation of evaluation metrics is also not terribly straight forward, and it's highly use-case dependent (for example the trade off between prioritizing recall vs precision will depend on use case & user's tolerance for missing detections vs dealing with false positives)
This is not a bad start, IMO. The output gives you the number of truePositives, falsePositives, and falseNegatives (the numbers from which all evaluation metrics are derived) as well as the precision, recall, and f1 score (the evaluation metrics) of each TARGET_CLASS (the classes that the ML model predicts). It further breaks out all of these metrics for all of the target classes by Deployment, because model performance can vary widely from deployment to deployment. And you can configure additional filters (project, start date, end date) in the analysis config file.
If your question is, "if a [some species] is in an image, how likely is it that it will be detected and classified correctly?", the scripts also allow you to answer that by performing the analysis at the object level. If your question is "if a [some species] is in an sequence of images, how likely is it that it will be detected and classified correctly in at least one of the images in the sequence?", you can perform the analysis at the sequence/burst level.
Lastly, you can use the scripts to assess JUST the object detection (MegaDetector) step, or and entire ML pipeline (the performance of the object detection and classification steps, together). So if you're interested in assessing how a pipeline is performing end-to-end, i.e. to answer questions like, "If a rodent trips a camera, how likely is it that both my object detector and classifier model will work as expected and I will receive an alert?", the current scripts can answer that.
Limitations and open questions
However, there are also serious limitations and room for improvement to the current workflow. For example:
-
Only works for one model at a time: The workflow described above only produces evaluation metrics for one model pipeline with one particular set of confidence thresholds at a time. If you wanted to compare the performance of two different model pipelines, or if you wanted to compare the performance of the same model but with different confidence threshold settings, you would have to repeat all of the steps above (including the manual prediction review). In an ideal world, given some set of human-labeled images, we could test out a bunch of models at once on that same benchmark dataset without having to re-upload and re-label all of the data each time. The re-labeling part is the most time consuming step and feels pretty redundant if you've already done it once.
-
The scripts were not designed for large classifiers (SpeciesNet) with 1000s of classes at every taxonomic level: when I wrote these scripts our classifiers had at most a couple dozen classes, all at more or less the same taxonomic level, and both how we configure the scripts and how we perform the analysis may have to change to accommodate SpeciesNet. For example, how do we configure the TARGET_CLASSES array in the analysisConfig.js file to scale to 1000s of classes? How do we treat predictions that are close, but not exactly right (e.g. when a "mule deer" is predicted to be a "white-tailed deer") or correct but not specific enough or the correct taxonomic rank (e.g. the model predicts "rodent" but what the user really wants is "brown rat")?
Key to how we currently deal with some of this is the concept of the predicted class (the class the ML model produces and we're evaluating) and the validation classes (the classes/labels a person may have applied that count as "validating" the predicted class). So if a model's predicted class is "rodent", validation classes might be "mouse", "rat", "rodent", etc. That approach assumed that users would be adding more granular, specific labels that could be considered a validation of higher-level predicted classes, but what happens when the reverse is also possible (SpeciesNet is providing species-level predicted classes like "white tailed deer" and "mule deer" but users are correcting them to "deer"). This may require rethinking and restructuring of our config and analysis scripts.
-
The scripts don't evaluate classifier performance independently: The current scripts are taking some shortcuts and simplifying things a bit. For example, they don't exactly follow the approach you'd use to evaluate an object detector in a model training context, specifically, we aren't evaluating how accurate the location of the bounding boxes are using intersection-over-union or anything like that. Instead, we're ignoring the bounding box accuracy and evaluating it more like a classifier (i.e., if MegaDetector predicted and "animal" and a user either validated it as an "animal" or added some more specific species label on top of it, the scripts would count that as a true positive). Additionally, while the scripts can currently analyze either (a) an object detector independently or (b) the results of using an object detector and a classifier together, they can not analyze the performance of classifier independently. For now I don't think this is a high priority limitation to address, but it's worth being aware of.
-
Is Animl even the right environment to be doing this in? Perhaps a set of offline, local inference scripts are more appropriate?
-
Are these even the right questions to be asking? Should we take more of a UX efficiency/time-saving based approach? Different mistakes by models can slow down manual image review in different ways. For example, a false negative by an object detector means that the user has to stop and draw their own bounding box around a missed object, then add a label to it, which is incredibly slow and much, much slower than invalidating a false positive. Similarly, if users ultimately have to label all deer as "deer", from a time perspective it doesn't matter whether the model predicted "giraffe" or "white tailed deer" or "ungulate" - all would need to be corrected and all would take the same amount of time. That said, there's an argument to be made that close predictions are helpful, even if wrong, because then users can filter on that incorrect class (e.g. "white deer") and bulk-change all of them to "deer" with the bulk-selection tool.
All of that is to say that if the question is "how much time will this model save me", that's different than asking "how accurate is this model", and maybe the time saving question is the more important one.
Resources and prerequisite concepts
If anyone is interested in trying to address these questions and limitations, I'd recommend familiarizing yourself with the following:
- The difference and tradeoffs between precision and recall
- Computer Vision for Ecology lecture on evaluating ML Models
- The types and sequencing of ML models used in a typical camera trap processing pipeline (object detection first, classification second)
- Familiarity with MegaDetector, the most commonly used object detection model for camera trap data, and SpeciesNet, the most common classifier (though Animl supports more than just these two models).
- Animl's data model, specifically the concept of Objects and Labels, Labels'
not-validated/validated/invalidated states, Objects' locked/unlocked states, and Images' isReviewed state, and the concept of the "most representative label". I will try to get some better documentation on this together and link it here, but for now there's an example of an Animl image record you can reference below.
- Understand how the current analysis scripts work, specifically the predicted vs validation class concept, and how actuals, true positives, false positives, and false negatives are calculated from Animl data.
// example Animl Image record
// This Image has one Object, which MegaDetector correctly detected and labeled as an "animal"
// and an additional "rodent" Label predicted by the MIRAv2 classifier, which was validated as being correct by a user.
// In doing so the Object became "locked", the Image became "reviewed"
// and the "rodent" became the Object's "most representative label"
{
"_id": "sci_biosecurity:41240fc6d05477747a49c5f1b9bb83c5",
"bucket": "animl-images-serving-prod",
"batchId": null,
"fileTypeExtension": "jpg",
"dateAdded": {
"$date": "2025-08-07T12:37:16.565Z"
},
"dateTimeOriginal": {
"$date": "2025-08-07T12:36:07.000Z"
},
"timezone": "America/Los_Angeles",
"make": "BuckEyeCam",
"cameraId": "X811492D",
"deploymentId": {
"$oid": "6164c9f7599481769b50cc2a"
},
"projectId": "sci_biosecurity",
"originalFileName": "p_130372.jpg",
"imageWidth": 1280,
"imageHeight": 960,
"imageBytes": 160599,
"mimeType": "image/jpeg",
"userSetData": {
"TEXT1": "#E5 #G20",
"TEXT2": "Centinela"
},
"model": "X80",
"location": {
"_id": {
"$oid": "68949dfc7ff9639c6068c137"
},
"geometry": {
"type": "Point",
"coordinates": [
119.799469444444,
34.0176694444444
],
"_id": {
"$oid": "68949dfc7ff9639c6068c138"
}
}
},
"triggerSource": "Burst #2",
"tags": [],
"objects": [
{
"_id": {
"$oid": "68949e007ff9639c6068c15a"
},
"bbox": [
0.4603290557861328,
0.4360730051994324,
0.5325396060943604,
0.5756036639213562
],
"locked": true,
"labels": [
{
"_id": {
"$oid": "68949e017ff9639c6068c17a"
},
"type": "ml",
"labelId": "rodent",
"conf": 1,
"bbox": [
0.4603290557861328,
0.4360730051994324,
0.5325396060943604,
0.5756036639213562
],
"labeledDate": {
"$date": "2025-08-07T12:37:21.592Z"
},
"mlModel": "mirav2",
"mlModelVersion": "v2.0",
"validation": {
"validated": true,
"userId": "<email_address>",
"_id": {
"$oid": "68949e3620445ceeb29dbc34"
},
"validationDate": {
"$date": "2025-08-07T12:38:14.467Z"
}
}
},
{
"_id": {
"$oid": "68949e007ff9639c6068c158"
},
"type": "ml",
"labelId": "1",
"conf": 0.9110135436058044,
"bbox": [
0.4603290557861328,
0.4360730051994324,
0.5325396060943604,
0.5756036639213562
],
"labeledDate": {
"$date": "2025-08-07T12:37:20.587Z"
},
"mlModel": "megadetector_v5a",
"mlModelVersion": "v5.0a"
}
]
}
],
"comments": [],
"__v": 2,
"reviewed": true
}
Context
Given that Animl supports using of a growing number of ML models (both object detection models and classification models), and especially now that we've integrated SpeciesNet, we need better, more efficient workflows to try to answer questions such as: What model(s) will work best on my data and for my particular use case? How well has the model I've chosen been working? Are my confidence thresholds optimally set?
These are legitimately hard questions to answer, but Animl should in theory be well suited to addressing them. Users can upload their data, get predictions, correct those predictions to create a ground-truthed labeled dataset, and from there we have everything we need to evaluate model performance (in theory). In practice, I've found this gets fairly complicated and thorny very fast.
Current solution
Currently, if you want to understand how an ML model is performing in production (i.e. on new images, unseen by the model in training or testing), a user would have to:
This is not a bad start, IMO. The output gives you the number of
truePositives,falsePositives, andfalseNegatives(the numbers from which all evaluation metrics are derived) as well as theprecision,recall, andf1 score(the evaluation metrics) of eachTARGET_CLASS(the classes that the ML model predicts). It further breaks out all of these metrics for all of the target classes by Deployment, because model performance can vary widely from deployment to deployment. And you can configure additional filters (project, start date, end date) in the analysis config file.If your question is, "if a [some species] is in an image, how likely is it that it will be detected and classified correctly?", the scripts also allow you to answer that by performing the analysis at the object level. If your question is "if a [some species] is in an sequence of images, how likely is it that it will be detected and classified correctly in at least one of the images in the sequence?", you can perform the analysis at the sequence/burst level.
Lastly, you can use the scripts to assess JUST the object detection (MegaDetector) step, or and entire ML pipeline (the performance of the object detection and classification steps, together). So if you're interested in assessing how a pipeline is performing end-to-end, i.e. to answer questions like, "If a rodent trips a camera, how likely is it that both my object detector and classifier model will work as expected and I will receive an alert?", the current scripts can answer that.
Limitations and open questions
However, there are also serious limitations and room for improvement to the current workflow. For example:
Only works for one model at a time: The workflow described above only produces evaluation metrics for one model pipeline with one particular set of confidence thresholds at a time. If you wanted to compare the performance of two different model pipelines, or if you wanted to compare the performance of the same model but with different confidence threshold settings, you would have to repeat all of the steps above (including the manual prediction review). In an ideal world, given some set of human-labeled images, we could test out a bunch of models at once on that same benchmark dataset without having to re-upload and re-label all of the data each time. The re-labeling part is the most time consuming step and feels pretty redundant if you've already done it once.
The scripts were not designed for large classifiers (SpeciesNet) with 1000s of classes at every taxonomic level: when I wrote these scripts our classifiers had at most a couple dozen classes, all at more or less the same taxonomic level, and both how we configure the scripts and how we perform the analysis may have to change to accommodate SpeciesNet. For example, how do we configure the
TARGET_CLASSESarray in the analysisConfig.js file to scale to 1000s of classes? How do we treat predictions that are close, but not exactly right (e.g. when a "mule deer" is predicted to be a "white-tailed deer") or correct but not specific enough or the correct taxonomic rank (e.g. the model predicts "rodent" but what the user really wants is "brown rat")?Key to how we currently deal with some of this is the concept of the
predictedclass (the class the ML model produces and we're evaluating) and thevalidationclasses (the classes/labels a person may have applied that count as "validating" the predicted class). So if a model's predicted class is "rodent", validation classes might be "mouse", "rat", "rodent", etc. That approach assumed that users would be adding more granular, specific labels that could be considered a validation of higher-level predicted classes, but what happens when the reverse is also possible (SpeciesNet is providing species-level predicted classes like "white tailed deer" and "mule deer" but users are correcting them to "deer"). This may require rethinking and restructuring of our config and analysis scripts.The scripts don't evaluate classifier performance independently: The current scripts are taking some shortcuts and simplifying things a bit. For example, they don't exactly follow the approach you'd use to evaluate an object detector in a model training context, specifically, we aren't evaluating how accurate the location of the bounding boxes are using intersection-over-union or anything like that. Instead, we're ignoring the bounding box accuracy and evaluating it more like a classifier (i.e., if MegaDetector predicted and "animal" and a user either validated it as an "animal" or added some more specific species label on top of it, the scripts would count that as a true positive). Additionally, while the scripts can currently analyze either (a) an object detector independently or (b) the results of using an object detector and a classifier together, they can not analyze the performance of classifier independently. For now I don't think this is a high priority limitation to address, but it's worth being aware of.
Is Animl even the right environment to be doing this in? Perhaps a set of offline, local inference scripts are more appropriate?
Are these even the right questions to be asking? Should we take more of a UX efficiency/time-saving based approach? Different mistakes by models can slow down manual image review in different ways. For example, a false negative by an object detector means that the user has to stop and draw their own bounding box around a missed object, then add a label to it, which is incredibly slow and much, much slower than invalidating a false positive. Similarly, if users ultimately have to label all deer as "deer", from a time perspective it doesn't matter whether the model predicted "giraffe" or "white tailed deer" or "ungulate" - all would need to be corrected and all would take the same amount of time. That said, there's an argument to be made that close predictions are helpful, even if wrong, because then users can filter on that incorrect class (e.g. "white deer") and bulk-change all of them to "deer" with the bulk-selection tool.
All of that is to say that if the question is "how much time will this model save me", that's different than asking "how accurate is this model", and maybe the time saving question is the more important one.
Resources and prerequisite concepts
If anyone is interested in trying to address these questions and limitations, I'd recommend familiarizing yourself with the following:
not-validated/validated/invalidatedstates, Objects'locked/unlockedstates, and Images'isReviewedstate, and the concept of the "most representative label". I will try to get some better documentation on this together and link it here, but for now there's an example of an Animl image record you can reference below.