The challenges and opportunities of gesture recognition

Wojciech Marusarz
November 10, 2020

The challenges and opportunities of gesture recognition #

If you’ve ever watched Minority Report, you probably remember the scene where agent John Anderton (Tom Cruise) interacts with a massive transparent computer screen using a pair of futuristic three-finger gloves. He swiftly navigates the interface – controls video playback, zooms in and out of images, selects individual elements appearing on the screen and moves them around the interface. This fluid and seamless man-computer interaction is only possible due to the fact that his hand gestures are perfectly recognized by the machine.

When Minority Report launched in 2002 the scene was science fiction at its purest. But what then seemed like a pie in the sky, is totally achievable with today’s gesture recognition technology. Augmented reality devices allow you to put on special gloves and use your hands as the input devices. In fact, we’re already beyond that. The new challenge in gesture recognition is how to get rid of gloves altogether. Technological developments in machine learning are opening new opportunities to make it finally happen.

The evolution of man-computer interaction #

Communication of humans with computers has always been a challenge. From the early mediums such as perforated cards, people have spent more than half of the past century experimenting with various ways to interact with computers – in pursuit of more efficient and intuitive interfaces. Keyboard and mouse slowly became the de-facto industry standard for user input.

Many of the novel input devices never really caught on, making the interaction with personal computers somewhat limited. Dan O’Sullivan and Tom Igoe’s model aptly illustrates it, showing how badly imbalanced is the use of individual body parts in computer interaction (see below). A majority of our interaction with computers involves fingers and eyes, but some of the parts of our body like legs, arms and mouth are criminally underused – or not used for interaction at all. This is debilitating – pretty much like typing emails with just one finger - source

How the computer sees us — Dan O’Sullivan & Tom Igoe’s model, 2004. The image illustrates how imbalanced is the use of individual body parts in man-computer interaction.

The invention of the touch screen brought some major improvements to the man-computer interaction, making it more natural and tactile. Soon, users could use not just one finger, but multiple fingers simultaneously – multi-touch technology enabled us to use more than one point of contact with the surface (i.e. a trackpad or touchscreen) at the same time. This prompted the emergence of multi-finger gestures allowing the user to zoom in, zoom out, scroll, rotate, and toggle between the elements of the user interface.

Much progress has been made, but we believe that people can communicate even more effectively when given a richer lexicon of gestures. Hand gestures can add another dimension to communication. Just like verbal communication involves enriching voice with subtle facial expressions and gestures, so does the machine interface benefit from an additional layer of interactivity provided by gestures.

Much progress has been made in voice recognition technology over the years. Machine learning is already used to recognize facial expressions. But gesture recognition has always been more challenging and only recently started to gain popularity. Powered by machine learning, gesture recognition is now not only possible, but also more accurate than ever.

Enter machine learning #

Reading gestures involves the recognition of images or recordings captured by the camera. Each gesture, when identified, is translated to a specific command in the controlled application.

Because standard image processing techniques do not yield great results, gesture recognition needs to be backed by machine learning to unfold its full potential.

The challenges of gesture recognition #

Ideally, gesture recognition should be based on a photo of a still hand showing only a single gesture against a clear background in well-lit conditions. But real-life conditions are hardly ever like that. We don’t always get the comfort to use solid, clear backgrounds when presenting gestures.

The role of machine learning in gesture recognition is, in part, to overcome some of the main technical issues associated with proper identification of gesture images.

1. Non-standard backgrounds

Gesture recognition should bring great results no matter the background: it should work whether you’re in the car, at home, or walking down the street. Machine learning gives you a way to teach the machine to tell the hand apart from the background consistently.

2. Movement

Common sense suggests that gesture is rather a movement than a static image. Gesture recognition should thus be able to recognize patterns, ex. instead of recognizing just an image of an open palm, we could recognize a waving movement and identify it e.g., as a command to close the currently used application.

3. Combination of movements

What is more, gestures could consist of several movements, so we need to provide some context and recognize patterns like moving fingers clockwise and showing a thumb could be used to mark some limited number of files or some limited area.

4. Diversity of gestures

There is much diversity in how humans perform specific gestures. While humans have a high tolerance for errors, this inconsistency may make the detection and classification of gestures more difficult for machines. This is also where machine learning helps.

5. Fighting the lag

The gesture detection system must be designed to eliminate the lag between performing a gesture and its classification. Adoption of hand gestures can only be encouraged by showing how consistent and instantaneous it can be. There is really no other reason to start using gestures if they don’t make your interaction faster and more convenient. Ideally, we’re looking for a negative lag (classification even before the gesture is finished) to truly instantaneously give the feedback to the user.

If you want to read more about gestures classification, there is a great article on our blog on Classifying unknown gestures .

Machine learning data sets #

One of the most common challenges in applying machine learning in gesture recognition projects is the lack of a rich and meaningful data set. For machine learning to work, you will need to feed it with data to train our models. The data set has to be adjusted to your needs, but some data sets can be very helpful:

MNIST Dataset #

The original MNIST image dataset of handwritten digits is a popular benchmark for image-based machine learning methods. The team behind the set have developed drop-in replacements that are more challenging for computer vision and original for real-world applications. One recent drop-in replacement is called Fashion-MNIST. Each image in it is 28 pixels in height and 28 pixels in width, making for a total of 784 pixels, and represents a piece of clothing. The Zalando researchers quoted the startling claim that “Most pairs of MNIST digits (i.e. the 784 total pixels per sample) can be distinguished pretty well by just one pixel”.

To stimulate the community to develop more drop-in replacements, the Sign Language MNIST follows the same CSV format with labels and pixel values in single rows. The American Sign Language letter database of hand gestures represents a multi-class problem with 24 classes of letters (excluding J and Z, which require motion).

20 BN Jester’s Dataset #

The 20BN-JESTER dataset is a large collection of densely-labeled video clips that show humans performing pre-defined hand gestures in front of a laptop camera or webcam. The dataset was created by a large number of people. It allows for training robust machine learning models to recognize human hand gestures. It is available free of charge for academic research, but commercial licenses are also available upon request.

LeapGestRecog Dataset #

LeapGestRecog Dataset is a hand gesture recognition database presented, composed of a set of near-infrared images acquired by the Leap Motion sensor. The database is composed of 10 different hand-gestures performed by 10 different subjects (5 men and 5 women).

The EgoGesture database #

EgoGesture is a multi-modal large scale dataset for egocentric hand gesture recognition. This dataset provides the test-bed not only for gesture classification in segmented data but also for gesture detection in continuous data. The dataset contains 2,081 RGB-D videos, 24,161 gesture samples and 2,953,224 frames from 50 distinct subjects. EgoGesture came up with 83 classes of static or dynamic gestures focused on interaction with wearable devices, as shown in the table below.

Kaggle Hand Gesture Recognition Database #

The Hand Gesture Recognition Database is a collection of near-infrared images of ten distinct hand gestures. The gesture collection is broken down into 10 folders labeled 00 to 09, each containing images from a given subject. In each folder, there are subfolders for each gesture. Kaggle is planning to build a dictionary lookup, storing the names of the gestures we need to identify, and giving each gesture a numerical identifier. They’ll also build a dictionary reverse lookup that tells us what gesture is associated with a given identifier.

NVIDIA Dynamic Hand Gesture dataset #

As a solution to compensate for the lag in detection and classification of dynamic hand gestures from unsegmented multi-modal input streams, Nvidia proposes the idea of negative lag (classification even before the gesture is finished). This can make feedback truly instantaneous.

Nvidia published a paper in which they introduce a recurrent three-dimensional convolutional neural network which performs simultaneous detection and classification of dynamic hand gestures from unsegmented multi-modal input streams. This significantly reduces lag and makes the recognition near-instant. Connectionist temporal classification is introduced to train the network to predict class labels from in-progress gestures in unsegmented input streams.

To validate the method, they introduce a new challenging multi-modal dynamic hand gesture dataset captured with depth, color, and stereo-IR sensors. On this challenging dataset, Nvidia’s gesture recognition system achieves an accuracy of 83.8%, outperforming many competing state-of-the-art algorithms, and approaches human accuracy (88.4%).

If none of the above databases fit your needs, you will need to create a data set on your own – using the so-called Data Augmentation technique.

Recognizing user intentions #

Research on users shows that there is a common, intuitive sense for hand gesture directions and commands. For example, moving your hand upwards means increasing, and downwards means decreasing. Likewise, moving right means next, moving left means previous.

Well, that’s a theory. In practice, users were asked to decrease volume using gestures, moved their hands downward, made rotating moves with their fingers, dragged invisible sliders from right to left, or even pushed invisible buttons; likewise, they were using a remote control. A lot of research needs to be done to elaborate a common set of gestures, which could be used by all applications.

People #

Controlling devices at home seems natural, but people tend to avoid using their gestures in public places – it shows their intentions, and it embarrasses them. We will probably need to handle two sets of gestures – one for public places and one for places where we can be alone.

Performance #

To enable gesture recognition in mass scale, computations need to be done on smartphones, TV, or on-board computers in cars. That requires developing efficient algorithms or even dedicated microchips. Right now, more advanced computations are performed on remote servers, and it breaks our privacy

Machine Learning? Challenge accepted! #

Modern approach tends to highly rely on Deep Learning Algorithms and Computer Vision technologies, and tends to eliminate any additional hardware devices from the process (i.e. the Minority Report gloves).

The most popular approach to detect gestures is using Convolutional Neural Networks (CNN), and it mostly consists of four parts:

Hand Detection #

The camera detects hand movements, and a machine learning algorithm segments the image to find hand edges and positions. It is a difficult part, but there are solutions ready to use ex. MediaPipe from Google .

Movement Tracking #

The camera captures each frame, and the system detects movement sequences for further analysis.

It is important not only to detect hands but also to detect how hand position has changed in 3d space because it brings some gesture meaning.

Gesture Recognition #

An image classifier takes a photograph or video and learns how to classify it into one of the provided categories.

Data sources can be used to train a neural network. Alternatively, custom software can be written to capture and assign gestures to specific categories. Data Augmentation can be used to rotate, increase, or decrease image size.

Convolutional Neural Networks consist of several layers, each passing data to the next one. Each layer has its own neurons – nodes performing mathematical operations.

To use Convolutional Neural Networks, you need to define classes of data sets. The number of classes corresponds to the number of neurons in the last layer.

To avoid overfitting caused by lack of data, Dropout Rate has to be introduced. It makes some neurons from input or hidden layers disappear, so the network is adjusted to better handle newly provided data - not only data used for training.

To train a network, some tuning is required – there is no golden rule on how to do it. Depending on provided data, images can be converted, pooling to remove not important details from images can be introduced and number of epochs which is a number how many times image passes through neural network can be adjusted.

After training, a pre-trained model is created, which can be used in a gesture recognition system. Thanks to it, a trained neural network is capable of recognizing gestures.

The output from CNN is the confusion matrix, which provides the accuracy of prediction. Basing on the matrix,

Who is using it? #

According to the Grand View Research , based on research in China, the gesture recognition market is growing, offering new use cases and practical applications.

In 2020 alone, the market share was 11.6 billion USD , but it is forecast to reach an eye-watering 30.6 billion USD by 2025.

Consumer electronics #

Consumer electronics is what first springs to mind when talking about gesture recognition. Smartphones or TVs with embedded cameras allow us to use our hands to interact with media applications – control the playback of songs or movies. Gestures can be used to play/pause content, increase and decrease volume, mute or unmute sound, or to skip to the next song. To improve the accuracy of gesture recognition, data from your smartphone can be combined with data from wearable devices like smartwatches or dedicated devices with cameras.

Home assistants is another category of consumer electronics which could benefit from hand gestures. This could involve, for example, turning on and off, or dimming the lights. Gesture recognition can never get as advanced as voice commands, but it is very intuitive and certainly allows for the implementation of some basic command.

Automotive #

In automotive, hand gestures are mostly used for infotainment systems to control in-car music players and phone systems. But gestures can also be used for lights control and GPS navigation. The benefit here is mainly improved convenience and user experience, as the driver no longer has to touch around the dashboard trying to find a button to switch radio stations or answer a phone call through the loudspeaker system.

Healthcare #

Gesture recognition can help to keep surgical wards sterile. By reviewing medical documentation or controlling the camera without touching the screen, the surgeon can reduce the risk of infection.

Typical implementations here would involve using simple gestures to control cameras – zooming in, zooming out, rotating, and moving cameras - see GestSure

Entertainment #

Virtual Reality is another beneficiary of gesture recognition. Most game consoles require controllers, but Kinect proved that it is not required. Using full-body movements can make your whole body a game controller.

Last words: The future is now #

Predictions show that the market for gesture recognition technologies is growing, and there are some interesting projects that already use it.

Kinect, developed by Microsoft , was originally intended as a system that can track whole-body movements, but the gesture recognition developer embraced it as an inexpensive way to build an HGR (Hand Gesture Recognition) setup. It can be adjusted to not only track hands but also to recognize hand gestures.

KinTrans Hands Can Talk is a project which uses AI to learn and process the body movements of sign language.

GestSure allows surgeons to navigate through MRI and CT scans without touching a screen.

Audi and BMW have already implemented a system that allows drivers to use gestures to control the infotainment system inside the car.

There are also numerous open-source projects for hand gesture recognition, like Real-time-GesRec based on PyTorch .

Additional reading:
https://ltu.diva-portal.org/smash/get/diva2:1299000/FULLTEXT01
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6928637/
https://cs.stanford.edu/people/ssrinath/pubs/MPI-I-2016-4-002.pdf

Now, let's talk about your project!

We don't have one standard offer.
Each project is unique, rest assured that we will approach the next one full of energy and engagement.

LET'S CONNECT