Accelerating science with 'off-the-shelf' models

Melanie Segado

Machine learning/Movement/Healthcare

Not long ago, extracting insights from video data required building custom computer vision models. This was a very time-consuming task that demanded large datasets, hours of researcher time spent hand-annotating data, and lots of computing power. For small datasets like those that are prevalent in many areas of science and medicine, training models was simply not an option altogether. This mismatch between the needs of machine learning and the size of available datasets left many interesting questions unanswered. The goal of my research is to bridge this gap, finding strategies to make the most of small, specialized datasets by analyzing them with foundation models.

Why do small datasets matter?

Let’s consider a concrete example of where this can be useful. Subtle patterns in infant movement during the first months of life can help predict neurodevelopmental disorders such as cerebral palsy – a lifelong condition affecting balance and motor control. While clinicians excel at picking up these patterns, they don’t always get to see infants until they’re much older, and so infants miss out on a critical window of neuroplasticity where rehabilitation would be most helpful. However, videos of infants can be easily recorded using everyday tools like cell-phone cameras and their movements can be analyzed using computer vision, paving the way for the development of low-cost clinical tools that can be implemented at scale.

A big barrier to developing these tools has been the fact that infants are really difficult to analyze from a computer’s perspective. They tend to bunch up into complicated shapes (occluding their own limbs in the process),and they are often wearing clothes that look a lot like the blankets they’re laying on. To overcome these difficulties, researchers either had to train custom models for infant pose tracking, or fine-tuning existing algorithms, both of which required lots of time spent annotating videos. But look at how much things have improved over the past 7 years!

Take a look at the video below and move the slider back and forth to compare the performance algorithm from the not-so-distant past (2017) with one from 2023, on a fully ai-generated video of an infant.

◀

▶

2017

2023

◀

▶

On the bottom layer (when the slider is all the way to the left) is a model called OpenPose, which was a game-changer in the field when it released. The specific model shown here was pre-trained on 64K images, and finetuned on 47K annotated frames of infant video. While it performs very well when the infant’s limbs are clearly visible, it fails in spots where the relevant parts of the image are covered by objects like the crib. This is because it’s relying on finding parts of the image that look like specific joints (e.g., knees). When they’re occluded, the algorithm fails.

The model overlayed on top (when the slider is all the way to the right) is ViTPose-H, a model pre-trained on vast amounts of image data (300M labelled images), fine-tuned on a much smaller dataset of human poses (250K), and not fine-tuned at all on infant data. As you can see, it does much better at capturing the overall shape of the infant’s pose, even when the information is occluded, despite not having been trained on infants. This not only has it been trained on more data, it also uses a modern achitechture called a Vision Transformer that enables it to learn not just what specific joints look like, but also their spatial relationship to other joints (e.g., hips and ankles) and other parts of the image.

Why would these models help “accelerate science”?

Foundation models, pre-trained on massive datasets, have transformed AI applications—from large language models to computer vision. Platforms like HuggingFace that host pre-trained models, and user-friendly tools like OpenMMLab’s MMPose, make these powerful tools easily accessible. By fine-tuning pre-trained models with domain-specific data, or even using them straight off the shelf, researchers can achieve meaningful insights with far less effort and fewer resources. It’s hard to overstate just how rapidly this landscape has evolved. To give an example, the first year of my postdoc was spent optimizing algorithms for infant pose estimation by carefully curating a database of difficult-to-detect poses, annotating them by hand, and training an algorithm to improve its performance. While I made progress, ViTPose performed better off-the-shelf than any of the custom models I had been working on. The ability to get precise pose tracking without the need to train models significantly lowers the barrier to entry for research groups with interesting questions and small, specialized datasets.

How much data is in a “massive dataset”

To get a sense of just how much data goes into pre-training a foundation model that can then be quickly fine-tuned for specialzed applications, let’s take a look at the relative scale of a few of the datasets that went into training the models used for pose-tracking in the videos above.

Click a button to explore dataset sizes.

Impact and Future Directions

The rise of off-the-shelf AI models marks a significant shift in the ease with which researchers can integrate state-of-the-art tools into their research. Vision transformers for movement analysis are just one example of how accessible AI tools can help push the boundaries of disease detection and treatment. As these resources become even more widely available, the future holds breakthroughs that will benefit patients worldwide.

If you want to see how we’ve been able to put this approach to use, check out our pre-prints! Data-Driven Early Prediction of Cerebral Palsy Using AutoML and interpretable kinematic features (medRxiv)

Assessing infant risk of cerebral palsy with video-based motion tracking (medRxiv)

About this post

Check our source code here