The challenges and opportunities of gesture recognition

Wojciech Marusarz - November 10, 2020

The challenges and opportunities of gesture recognition

If you’ve ever watched Minority Report, you probably remember the scene where agent John Anderton (Tom Cruise) interacts with a massive transparent computer screen using a pair of futuristic three-finger gloves. He swiftly navigates the interface – controls video playback, zooms in and out of images, selects individual elements appearing on the screen and moves them around the interface. This fluid and seamless man-computer interaction is only possible due to the fact that his hand gestures are perfectly recognized by the machine.

When Minority Report launched in 2002 the scene was science fiction at its purest. But what then seemed like a pie in the sky, is totally achievable with today’s gesture recognition technology. Augmented reality devices allow you to put on special gloves and use your hands as the input devices. In fact, we’re already beyond that. The new challenge in gesture recognition is how to get rid of gloves altogether. Technological developments in machine learning are opening new opportunities to make it finally happen.

The evolution of man-computer interaction

Communication of humans with computers has always been a challenge. From the early mediums such as perforated cards, people have spent more than half of the past century experimenting with various ways to interact with computers – in pursuit of more efficient and intuitive interfaces. Keyboard and mouse slowly became the de-facto industry standard for user input.

Many of the novel input devices never really caught on, making the interaction with personal computers somewhat limited. Dan O’Sullivan and Tom Igoe’s model aptly illustrates it, showing how badly imbalanced is the use of individual body parts in computer interaction (see below). A majority of our interaction with computers involves fingers and eyes, but some of the parts of our body like legs, arms and mouth are criminally underused – or not used for interaction at all. This is debilitating – pretty much like typing emails with just one finger - source

How the computer sees us — Dan O’Sullivan & Tom Igoe’s model, 2004. The image illustrates how imbalanced is the use of individual body parts in man-computer interaction.

The invention of the touch screen brought some major improvements to the man-computer interaction, making it more natural and tactile. Soon, users could use not just one finger, but multiple fingers simultaneously – multi-touch technology enabled us to use more than one point of contact with the surface (i.e. a trackpad or touchscreen) at the same time. This prompted the emergence of multi-finger gestures allowing the user to zoom in, zoom out, scroll, rotate, and toggle between the elements of the user interface.

Much progress has been made, but we believe that people can communicate even more effectively when given a richer lexicon of gestures. Hand gestures can add another dimension to communication. Just like verbal communication involves enriching voice with subtle facial expressions and gestures, so does the machine interface benefit from an additional layer of interactivity provided by gestures.

Much progress has been made in voice recognition technology over the years. Machine learning is already used to recognize facial expressions. But gesture recognition has always been more challenging and only recently started to gain popularity. Powered by machine learning, gesture recognition is now not only possible, but also more accurate than ever.

Enter machine learning

Reading gestures involves the recognition of images or recordings captured by the camera. Each gesture, when identified, is translated to a specific command in the controlled application.

Because standard image processing techniques do not yield great results, gesture recognition needs to be backed by machine learning to unfold its full potential.

The challenges of gesture recognition

Ideally, gesture recognition should be based on a photo of a still hand showing only a single gesture against a clear background in well-lit conditions. But real-life conditions are hardly ever like that. We don’t always get the comfort to use solid, clear backgrounds when presenting gestures.

The role of machine learning in gesture recognition is, in part, to overcome some of the main technical issues associated with proper identification of gesture images.

1. Non-standard backgrounds

Gesture recognition should bring great results no matter the background: it should work whether you’re in the car, at home, or walking down the street. Machine learning gives you a way to teach the machine to tell the hand apart from the background consistently.

2. Movement

Common sense suggests that gesture is rather a movement than a static image. Gesture recognition should thus be able to recognize patterns, ex. instead of recognizing just an image of an open palm, we could recognize a waving movement and identify it e.g., as a command to close the currently used application.

3. Combination of movements

What is more, gestures could consist of several movements, so we need to provide some context and recognize patterns like moving fingers clockwise and showing a thumb could be used to mark some limited number of files or some limited area.

4. Diversity of gestures

There is much diversity in how humans perform specific gestures. While humans have a high tolerance for errors, this inconsistency may make the detection and classification of gestures more difficult for machines. This is also where machine learning helps.

5. Fighting the lag

The gesture detection system must be designed to eliminate the lag between performing a gesture and its classification. Adoption of hand gestures can only be encouraged by showing how consistent and instantaneous it can be. There is really no other reason to start using gestures if they don’t make your interaction faster and more convenient. Ideally, we’re looking for a negative lag (classification even before the gesture is finished) to truly instantaneously give the feedback to the user.

If you want to read more about gestures classification, there is a great article on our blog on Classifying unknown gestures.

Machine learning data sets

One of the most common challenges in applying machine learning in gesture recognition projects is the lack of a rich and meaningful data set. For machine learning to work, you will need to feed it with data to train our models. The data set has to be adjusted to your needs, but some data sets can be very helpful:

MNIST Dataset

The original MNIST image dataset of handwritten digits is a popular benchmark for image-based machine learning methods. The team behind the set have developed drop-in replacements that are more challenging for computer vision and original for real-world applications. One recent drop-in replacement is called Fashion-MNIST. Each image in it is 28 pixels in height and 28 pixels in width, making for a total of 784 pixels, and represents a piece of clothing. The Zalando researchers quoted the startling claim that “Most pairs of MNIST digits (i.e. the 784 total pixels per sample) can be distinguished pretty well by just one pixel”.

To stimulate the community to develop more drop-in replacements, the Sign Language MNIST follows the same CSV format with labels and pixel values in single rows. The American Sign Language letter database of hand gestures represents a multi-class problem with 24 classes of letters (excluding J and Z, which require motion).

20 BN Jester’s Dataset

The 20BN-JESTER dataset is a large collection of densely-labeled video clips that show humans performing pre-defined hand gestures in front of a laptop camera or webcam. The dataset was created by a large number of people. It allows for training robust machine learning models to recognize human hand gestures. It is available free of charge for academic research, but commercial licenses are also available upon request.

LeapGestRecog Dataset

LeapGestRecog Dataset is a hand gesture recognition database presented, composed of a set of near-infrared images acquired by the Leap Motion sensor. The database is composed of 10 different hand-gestures performed by 10 different subjects (5 men and 5 women).

The EgoGesture database

EgoGesture is a multi-modal large scale dataset for egocentric hand gesture recognition. This dataset provides the test-bed not only for gesture classification in segmented data but also for gesture detection in continuous data. The dataset contains 2,081 RGB-D videos, 24,161 gesture samples and 2,953,224 frames from 50 distinct subjects. EgoGesture came up with 83 classes of static or dynamic gestures focused on interaction with wearable devices, as shown in the table below.

Kaggle Hand Gesture Recognition Database

The Hand Gesture Recognition Database is a collection of near-infrared images of ten distinct hand gestures. The gesture collection is broken down into 10 folders labeled 00 to 09, each containing images from a given subject. In each folder, there are subfolders for each gesture. Kaggle is planning to build a dictionary lookup, storing the names of the gestures we need to identify, and giving each gesture a numerical identifier. They’ll also build a dictionary reverse lookup that tells us what gesture is associated with a given identifier.

NVIDIA Dynamic Hand Gesture dataset

As a solution to compensate for the lag in detection and classification of dynamic hand gestures from unsegmented multi-modal input streams, Nvidia proposes the idea of negative lag (classification even before the gesture is finished). This can make feedback truly instantaneous.

Nvidia published a paper in which they introduce a recurrent three-dimensional convolutional neural network which performs simultaneous detection and classification of dynamic hand gestures from unsegmented multi-modal input streams. This significantly reduces lag and makes the recognition near-instant. Connectionist temporal classification is introduced to train the network to predict class labels from in-progress gestures in unsegmented input streams.

To validate the method, they introduce a new challenging multi-modal dynamic hand gesture dataset captured with depth, color, and stereo-IR sensors. On this challenging dataset, Nvidia’s gesture recognition system achieves an accuracy of 83.8%, outperforming many competing state-of-the-art algorithms, and approaches human accuracy (88.4%).

If none of the above databases fit your needs, you will need to create a data set on your own – using the so-called Data Augmentation technique.

Recognizing user intentions

Research on users shows that there is a common, intuitive sense for hand gesture directions and commands. For example, moving your hand upwards means increasing, and downwards means decreasing. Likewise, moving right means next, moving left means previous.

Well, that’s a theory. In practice, users were asked to decrease volume using gestures, moved their hands downward, made rotating moves with their fingers, dragged invisible sliders from right to left, or even pushed invisible buttons; likewise, they were using a remote control. A lot of research needs to be done to elaborate a common set of gestures, which could be used by all applications.

People

Controlling devices at home seems natural, but people tend to avoid using their gestures in public places – it shows their intentions, and it embarrasses them. We will probably need to handle two sets of gestures – one for public places and one for places where we can be alone.

Performance

To enable gesture recognition in mass scale, computations need to be done on smartphones, TV, or on-board computers in cars. That requires developing efficient algorithms or even dedicated microchips. Right now, more advanced computations are performed on remote servers, and it breaks our privacy

Machine Learning? Challenge accepted!

Modern approach tends to highly rely on Deep Learning Algorithms and Computer Vision technologies, and tends to eliminate any additional hardware devices from the process (i.e. the Minority Report gloves).

The most popular approach to detect gestures is using Convolutional Neural Networks (CNN), and it mostly consists of four parts:

Hand Detection

The camera detects hand movements, and a machine learning algorithm segments the image to find hand edges and positions. It is a difficult part, but there are solutions ready to use ex. MediaPipe from Google.

Movement Tracking

The camera captures each frame, and the system detects movement sequences for further analysis.

It is important not only to detect hands but also to detect how hand position has changed in 3d space because it brings some gesture meaning.

Gesture Recognition

An image classifier takes a photograph or video and learns how to classify it into one of the provided categories.

Data sources can be used to train a neural network. Alternatively, custom software can be written to capture and assign gestures to specific categories. Data Augmentation can be used to rotate, increase, or decrease image size.

Convolutional Neural Networks consist of several layers, each passing data to the next one. Each layer has its own neurons – nodes performing mathematical operations.

To use Convolutional Neural Networks, you need to define classes of data sets. The number of classes corresponds to the number of neurons in the last layer.

To avoid overfitting caused by lack of data, Dropout Rate has to be introduced. It makes some neurons from input or hidden layers disappear, so the network is adjusted to better handle newly provided data - not only data used for training.

To train a network, some tuning is required – there is no golden rule on how to do it. Depending on provided data, images can be converted, pooling to remove not important details from images can be introduced and number of epochs which is a number how many times image passes through neural network can be adjusted.

After training, a pre-trained model is created, which can be used in a gesture recognition system. Thanks to it, a trained neural network is capable of recognizing gestures.

The output from CNN is the confusion matrix, which provides the accuracy of prediction. Basing on the matrix,

Who is using it?

According to the Grand View Research, based on research in China, the gesture recognition market is growing, offering new use cases and practical applications.

In 2020 alone, the market share was 11.6 billion USD, but it is forecast to reach an eye-watering 30.6 billion USD by 2025.

Consumer electronics

Consumer electronics is what first springs to mind when talking about gesture recognition. Smartphones or TVs with embedded cameras allow us to use our hands to interact with media applications – control the playback of songs or movies. Gestures can be used to play/pause content, increase and decrease volume, mute or unmute sound, or to skip to the next song. To improve the accuracy of gesture recognition, data from your smartphone can be combined with data from wearable devices like smartwatches or dedicated devices with cameras.

Home assistants is another category of consumer electronics which could benefit from hand gestures. This could involve, for example, turning on and off, or dimming the lights. Gesture recognition can never get as advanced as voice commands, but it is very intuitive and certainly allows for the implementation of some basic command.

Automotive

In automotive, hand gestures are mostly used for infotainment systems to control in-car music players and phone systems. But gestures can also be used for lights control and GPS navigation. The benefit here is mainly improved convenience and user experience, as the driver no longer has to touch around the dashboard trying to find a button to switch radio stations or answer a phone call through the loudspeaker system.

Healthcare

Gesture recognition can help to keep surgical wards sterile. By reviewing medical documentation or controlling the camera without touching the screen, the surgeon can reduce the risk of infection.

Typical implementations here would involve using simple gestures to control cameras – zooming in, zooming out, rotating, and moving cameras - see GestSure

Entertainment

Virtual Reality is another beneficiary of gesture recognition. Most game consoles require controllers, but Kinect proved that it is not required. Using full-body movements can make your whole body a game controller.

Last words: The future is now

Predictions show that the market for gesture recognition technologies is growing, and there are some interesting projects that already use it.

Kinect, developed by Microsoft, was originally intended as a system that can track whole-body movements, but the gesture recognition developer embraced it as an inexpensive way to build an HGR (Hand Gesture Recognition) setup. It can be adjusted to not only track hands but also to recognize hand gestures.

KinTrans Hands Can Talk is a project which uses AI to learn and process the body movements of sign language.

GestSure allows surgeons to navigate through MRI and CT scans without touching a screen.

Audi and BMW have already implemented a system that allows drivers to use gestures to control the infotainment system inside the car.

There are also numerous open-source projects for hand gesture recognition, like Real-time-GesRec based on PyTorch.

Additional reading:
https://ltu.diva-portal.org/smash/get/diva2:1299000/FULLTEXT01
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6928637/
https://cs.stanford.edu/people/ssrinath/pubs/MPI-I-2016-4-002.pdf

About the author

Wojciech Marusarz

Software Engineer

Wojciech enjoys working with small teams where the quality of the code and the project's direction are essential. In the long run, this allows him to have a broad understanding of the subject, develop personally and look for challenges. He deals with programming in Java and Kotlin. Additionally, Wojciech is interested in Big Data tools, making him a perfect candidate for various Data-Intensive Application implementations.

Are you curious if AI
is something that can
change your company?

Let’s find out.

JOIN WORKSHOPS

Find us on

Need help with implementing AI in your business?

Let's talk blue circle

This site uses cookies for analytical purposes.

Accept Privacy Policy

In the interests of your safety and to implement the principle of lawful, reliable and transparent processing of your personal data when using our services, we developed this document called the Privacy Policy. This document regulates the processing and protection of Users’ personal data in connection with their use of the Website and has been prepared by Nexocode.

To ensure the protection of Users' personal data, Nexocode applies appropriate organizational and technical solutions to prevent privacy breaches. Nexocode implements measures to ensure security at the level which ensures compliance with applicable Polish and European laws such as:

  1. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (published in the Official Journal of the European Union L 119, p 1); Act of 10 May 2018 on personal data protection (published in the Journal of Laws of 2018, item 1000);
  2. Act of 18 July 2002 on providing services by electronic means;
  3. Telecommunications Law of 16 July 2004.

The Website is secured by the SSL protocol, which provides secure data transmission on the Internet.

1. Definitions

  1. User – a person that uses the Website, i.e. a natural person with full legal capacity, a legal person, or an organizational unit which is not a legal person to which specific provisions grant legal capacity.
  2. Nexocode – NEXOCODE sp. z o.o. with its registered office in Kraków, ul. Generała Henryka Kamieńskiego 51, 30-644 Kraków, entered into the Register of Entrepreneurs of the National Court Register kept by the District Court for Kraków-Śródmieście in Kraków, 11th Commercial Department of the National Court Register, under the KRS number: 0000686992, NIP: 6762533324.
  3. Website – website run by Nexocode, at the URL: nexocode.com whose content is available to authorized persons.
  4. Cookies – small files saved by the server on the User's computer, which the server can read when when the website is accessed from the computer.
  5. SSL protocol – a special standard for transmitting data on the Internet which unlike ordinary methods of data transmission encrypts data transmission.
  6. System log – the information that the User's computer transmits to the server which may contain various data (e.g. the user’s IP number), allowing to determine the approximate location where the connection came from.
  7. IP address – individual number which is usually assigned to every computer connected to the Internet. The IP number can be permanently associated with the computer (static) or assigned to a given connection (dynamic).
  8. GDPR – Regulation 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of individuals regarding the processing of personal data and onthe free transmission of such data, repealing Directive 95/46 / EC (General Data Protection Regulation).
  9. Personal data – information about an identified or identifiable natural person ("data subject"). An identifiable natural person is a person who can be directly or indirectly identified, in particular on the basis of identifiers such as name, identification number, location data, online identifiers or one or more specific factors determining the physical, physiological, genetic, mental, economic, cultural or social identity of a natural person.
  10. Processing – any operations performed on personal data, such as collecting, recording, storing, developing, modifying, sharing, and deleting, especially when performed in IT systems.

2. Cookies

The Website is secured by the SSL protocol, which provides secure data transmission on the Internet. The Website, in accordance with art. 173 of the Telecommunications Act of 16 July 2004 of the Republic of Poland, uses Cookies, i.e. data, in particular text files, stored on the User's end device.
Cookies are used to:

  1. improve user experience and facilitate navigation on the site;
  2. help to identify returning Users who access the website using the device on which Cookies were saved;
  3. creating statistics which help to understand how the Users use websites, which allows to improve their structure and content;
  4. adjusting the content of the Website pages to specific User’s preferences and optimizing the websites website experience to the each User's individual needs.

Cookies usually contain the name of the website from which they originate, their storage time on the end device and a unique number. On our Website, we use the following types of Cookies:

  • "Session" – cookie files stored on the User's end device until the Uses logs out, leaves the website or turns off the web browser;
  • "Persistent" – cookie files stored on the User's end device for the time specified in the Cookie file parameters or until they are deleted by the User;
  • "Performance" – cookies used specifically for gathering data on how visitors use a website to measure the performance of a website;
  • "Strictly necessary" – essential for browsing the website and using its features, such as accessing secure areas of the site;
  • "Functional" – cookies enabling remembering the settings selected by the User and personalizing the User interface;
  • "First-party" – cookies stored by the Website;
  • "Third-party" – cookies derived from a website other than the Website;
  • "Facebook cookies" – You should read Facebook cookies policy: https://www.facebook.com/policy/cookies
  • "Other Google cookies" – Refer to Google cookie policy: www.google.com/policies/technologies/types/

3. How System Logs work on the Website

User's activity on the Website, including the User’s Personal Data, is recorded in System Logs. The information collected in the Logs is processed primarily for purposes related to the provision of services, i.e. for the purposes of:

  • analytics – to improve the quality of services provided by us as part of the Website and adapt its functionalities to the needs of the Users. The legal basis for processing in this case is the legitimate interest of Nexocode consisting in analyzing Users' activities and their preferences;
  • fraud detection, identification and countering threats to stability and correct operation of the Website.

4. Cookie mechanism on the Website

Our site uses basic cookies that facilitate the use of its resources. Cookies contain useful information and are stored on the User's computer – our server can read them when connecting to this computer again. Most web browsers allow cookies to be stored on the User's end device by default. Each User can change their Cookie settings in the web browser settings menu: Google ChromeOpen the menu (click the three-dot icon in the upper right corner), Settings > Advanced. In the "Privacy and security" section, click the Content Settings button. In the "Cookies and site date" section you can change the following Cookie settings:

  • Deleting cookies,
  • Blocking cookies by default,
  • Default permission for cookies,
  • Saving Cookies and website data by default and clearing them when the browser is closed,
  • Specifying exceptions for Cookies for specific websites or domains

Internet Explorer 6.0 and 7.0
From the browser menu (upper right corner): Tools > Internet Options > Privacy, click the Sites button. Use the slider to set the desired level, confirm the change with the OK button.

Mozilla Firefox
browser menu: Tools > Options > Privacy and security. Activate the “Custom” field. From there, you can check a relevant field to decide whether or not to accept cookies.

Opera
Open the browser’s settings menu: Go to the Advanced section > Site Settings > Cookies and site data. From there, adjust the setting: Allow sites to save and read cookie data

Safari
In the Safari drop-down menu, select Preferences and click the Security icon.From there, select the desired security level in the "Accept cookies" area.

Disabling Cookies in your browser does not deprive you of access to the resources of the Website. Web browsers, by default, allow storing Cookies on the User's end device. Website Users can freely adjust cookie settings. The web browser allows you to delete cookies. It is also possible to automatically block cookies. Detailed information on this subject is provided in the help or documentation of the specific web browser used by the User. The User can decide not to receive Cookies by changing browser settings. However, disabling Cookies necessary for authentication, security or remembering User preferences may impact user experience, or even make the Website unusable.

5. Additional information

External links may be placed on the Website enabling Users to directly reach other website. Also, while using the Website, cookies may also be placed on the User’s device from other entities, in particular from third parties such as Google, in order to enable the use the functionalities of the Website integrated with these third parties. Each of such providers sets out the rules for the use of cookies in their privacy policy, so for security reasons we recommend that you read the privacy policy document before using these pages. We reserve the right to change this privacy policy at any time by publishing an updated version on our Website. After making the change, the privacy policy will be published on the page with a new date. For more information on the conditions of providing services, in particular the rules of using the Website, contracting, as well as the conditions of accessing content and using the Website, please refer to the the Website’s Terms and Conditions.

Nexocode Team