Video Classification for Drowsiness Detection

Wipro Tech Blogs
8 min readJan 23, 2023


By Sourish Sarkar, Vijay Prakash Saini, Nabarun Barua, Vinutha B N

Multiple studies show that, globally one of the major causes of fatal road accidents is driver fatigue. More than 1 in 4 drivers (29.4%) reported having driven when they were so tired that they had a hard time keeping their eyes open. One in five (19.8%) reported having done this more than once, and 2.4 percent reported having done this often or regularly. NHTSA estimates that in 2015, over 72,000 police-reported crashes involved drowsy drivers. These crashes led to 41,000 injuries and more than 800 deaths [6].

To avoid such scenarios which can lead to an accident, we have developed a system using Deep Learning techniques. This system detects the state of a driver’s consciousness and alert the driver if found sleepy. We employ two approaches to derive the binary classification of video of a driver to determine whether the driver is drowsy or not. First approach uses the Transformer Model generally used for natural Language processing, where feature is extracted in video format from a vision transformer (VIT) and then is used to train the model to detect drowsiness. In second approach, which is used in a conventional sequential model LSTM, feature extraction is done by Google’s Mediapipe API to train the model for detecting drowsiness.

For training our model we have used Sivas University of Science and Technology’s Drowsiness DataSet (SUST-DDD) [7]. This dataset consists of the videos recorded by the cameras of the cell phone of the drivers during actual driving.

Approach 1

We trained a model on features extracted by a Vision transformer (ViT) [1] from a set of training videos. A pre-trained vision transformer is used to extract features from each frame, positional encoding is injected into extracted features, and the feature vectors are combined into a single feature vector that represents the entire video. The process of feature extraction is illustrated below:

Figure 1: Feature Extraction Architecture

Vision transformer:

The ‘Transformer’ architecture was introduced in a seminal work by Vaswani et al., in 2017 [2]. Transformers have gained huge success in the Natural Language Processing (NLP) domain. However, it was in 2020 that they started establishing their presence among the computer vision community. This was mainly due to the proposal of the vision transformer[1] architecture which showed that in presence of sufficiently large amounts of data, transformers can even outperform Convolutional Neural Networks (CNNs)!

Figure 2: ViT architecture (Image credits: An image is worth 16 x 16 words: Transformers for image recognition at scale, Dosovitskiy et al., ICLR 2021)

In this work, we used a pre-trained vision transformer as a feature extractor. Specifically, given an image, we extracted the output from the transformer encoder corresponding to the `CLS` token. This led to a 768-dimensional feature vector corresponding to a single image. These vectors represent the entire video. Here, one might wonder why we chose the output of the `CLS` token and not the other tokens. What makes it so special? This is because this token represents the entire input. Let’s look at the vision transformer paper where the authors mention this.

Figure 3: Use of class token (Image credits: An image is worth 16 x 16 words: Transformers for image recognition at scale, Dosovitskiy et al., ICLR 2021)

We used the Hugging Face [4] library which allows use of several transformer architectures straight out of the box illustrated by the following code snippet:

There is one issue that we need to consider before we could use these features to train a simple classifier (we used an MLP with no hidden layers, which basically boils down to logistic regression). The different frames in a video are stacked in a particular order. Although, the transformer uses positional encodings for its inputs (which are the patches extracted from an image), there is no positional encoding present to account for the ordering of the frames. To overcome this constraint, the positional-encodings library is used. Injecting positional encoding is shown below:

Lastly, summing the feature vectors across the channels resulted in a single 768-dimensional vector representing the entire video which was used for training a simple classifier. We tried out both 1D and 2D positional encodings, both of which are available in the positional-encodings library. We refer the interested reader to their project homepage for further details.

Figure 3: Drowsiness alert based on VIT

Test results using Approach 1:

We trained a classifier containing a single fully connected layer, mapping the 768-dimensional inputs to a single value. Note that this is our good old logistic regression classifier if we use the sigmoid activation function. Finally, we trained the classifier by minimizing the binary cross-entropy loss. We tried with 1D, 2D and no positional encodings and found the following:

  1. Model with 1D positional encoding achieves an accuracy of 80.74% on the test dataset.
  2. Model with 2D positional encoding achieves an accuracy of 75.4% on the test dataset.
  3. Model with no positional encoding achieves an accuracy of 80.58% on the test dataset.

However, quite interestingly, the second model seems to do better on some challenging test videos! In this case, the other models almost always misclassified these videos.

Approach 2:

In the second approach, we trained an LSTM model which took input facial landmarks extracted from the same training set of videos. For a fair comparison, the test set was also the same as Approach 1.

Facial landmarks — Features used for training an LSTM:

Face landmarks are key points on the face. (think of key points as important points, for example, on the eyes, nose, lips, etc. These act as important identifying features for a face. Think for a moment the way a person recognizes the face of another person even when the other person’s face is occluded, for example, by a face mask) The Mediapipe [3] library provides a fast and easy solution to this problem and provides the 3D coordinates of various face landmarks. The library extracts 478 landmarks in all. Among these, the first 468 are facial landmarks while the last 10 correspond to the iris. Below is the code to extract these landmarks from an image of a face into a numpy array:

Figure 4. Attention Mesh: Overview of model architecture. (Image credits:
Figure 6: Facial landmarks extracted using Mediapipe


Recurrent neural networks (RNNs) have been the architecture of choice to handle sequential data. However, a major problem with RNNs is their inability to do well on long sequences. Long short-term memory (LSTM) alleviates this drawback by providing both short-term and long-term memory, allowing to model sequential data efficiently. Here is how an LSTM looks like:

Figure 5: LSTM Architecture [(Image credits:]

We train the LSTM using the facial landmarks extracted from the frames of our videos. To understand the dimension of the features (the extracted facial landmarks) used for training the LSTM, let us consider for example, video sequences containing 125 frames. As explained above, we extract 3D coordinates for the 478 facial landmarks from each frame. This provides us with a 1434-dimensional (478*3) feature vector for every frame. Hence, for a single video, we obtained 125 such 1434-dimensional feature vectors. In other words, there was a matrix of dimensions 125 x 1434 corresponding to a single video, as illustrated in the diagram below:

Figure 6: Model Architecture for second approach

Test result using Approach 2:

Akin to approach 1, we trained the LSTM model by minimizing the binary cross entropy loss. We obtained an accuracy of 67.25% on the test set using this model.


Two significant issues that were encountered during these experiments were:

1. Slow performance on videos with high resolution.

2. Presence of noise (and/or various other forms of degradation) in videos.

For the first issue, we sampled the videos by a factor of four. Next, we also exploited the great deal of temporal redundancies present in the videos. By temporal redundancies, we mean that the adjacent frames in a video are highly correlated and redundant. Hence, we trimmed the video by discarding alternating frames which did not seem to hurt the model’s performance.

The second issue, however, requires much more careful treatment since video restoration is a hot topic of research. We have done video denoising before presenting the video to our model for testing. That, however, is the topic of another blog!!


We discussed video classification approaches to develop a system that can alert drivers or raise an alarm, looking at driver’s behavior or symptoms of drowsiness. We delved into the concept of deep learning architecture and implemented two approaches, one used LSTM which is sequential model which can keep track of history to improve the predictions, another approach is complex network of Transformers. While Transformer models has become the standard for Natural Language Processing, for classifying video, we observed that it performed better than LSTM model with increase in accuracy (>75% for Transformer Model Vs 67.25% for the LSTM Model).


[1] Dosovitskiy, Alexey, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).

[2] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).



[5] Gers, F. A., Schraudolph, N. N., Schmidhuber, J., “Learning Precise Timing with LSTM Recurrent Networks”, Journal of Machine Learning Research 3, 2002, pp. 115–143


[7] KAVALCI, YILMAZ ESRA, and M. AKCAYOL. “SUST-DDD: A Real-Drive Dataset for Driver Drowsiness Detection.” CONFERENCE OF OPEN INNOVATIONS ASSOCIATION, FRUCT Учредители: FRUCT Oy. №31.