Spatiotemporal Convolutions and Video Vision Transformers for Signer-Independent Sign Language Recognition
- Marais, Marc, Brown, Dane L, Connan, James, Boby, Alden
- Authors: Marais, Marc , Brown, Dane L , Connan, James , Boby, Alden
- Date: 2023
- Subjects: To be catalogued
- Language: English
- Type: text , article
- Identifier: http://hdl.handle.net/10962/463478 , vital:76412 , xlink:href="https://ieeexplore.ieee.org/abstract/document/10220534"
- Description: Sign language is a vital tool of communication for individuals who are deaf or hard of hearing. Sign language recognition (SLR) technology can assist in bridging the communication gap between deaf and hearing individuals. However, existing SLR systems are typically signer-dependent, requiring training data from the specific signer for accurate recognition. This presents a significant challenge for practical use, as collecting data from every possible signer is not feasible. This research focuses on developing a signer-independent isolated SLR system to address this challenge. The system implements two model variants on the signer-independent datasets: an R(2+ I)D spatiotemporal convolutional block and a Video Vision transformer. These models learn to extract features from raw sign language videos from the LSA64 dataset and classify signs without needing handcrafted features, explicit segmentation or pose estimation. Overall, the R(2+1)D model architecture significantly outperformed the ViViT architecture for signer-independent SLR on the LSA64 dataset. The R(2+1)D model achieved a near-perfect accuracy of 99.53% on the unseen test set, with the ViViT model yielding an accuracy of 72.19 %. Proving that spatiotemporal convolutions are effective at signer-independent SLR.
- Full Text:
- Date Issued: 2023
- Authors: Marais, Marc , Brown, Dane L , Connan, James , Boby, Alden
- Date: 2023
- Subjects: To be catalogued
- Language: English
- Type: text , article
- Identifier: http://hdl.handle.net/10962/463478 , vital:76412 , xlink:href="https://ieeexplore.ieee.org/abstract/document/10220534"
- Description: Sign language is a vital tool of communication for individuals who are deaf or hard of hearing. Sign language recognition (SLR) technology can assist in bridging the communication gap between deaf and hearing individuals. However, existing SLR systems are typically signer-dependent, requiring training data from the specific signer for accurate recognition. This presents a significant challenge for practical use, as collecting data from every possible signer is not feasible. This research focuses on developing a signer-independent isolated SLR system to address this challenge. The system implements two model variants on the signer-independent datasets: an R(2+ I)D spatiotemporal convolutional block and a Video Vision transformer. These models learn to extract features from raw sign language videos from the LSA64 dataset and classify signs without needing handcrafted features, explicit segmentation or pose estimation. Overall, the R(2+1)D model architecture significantly outperformed the ViViT architecture for signer-independent SLR on the LSA64 dataset. The R(2+1)D model achieved a near-perfect accuracy of 99.53% on the unseen test set, with the ViViT model yielding an accuracy of 72.19 %. Proving that spatiotemporal convolutions are effective at signer-independent SLR.
- Full Text:
- Date Issued: 2023
An evaluation of hand-based algorithms for sign language recognition
- Marais, Marc, Brown, Dane L, Connan, James, Boby, Alden
- Authors: Marais, Marc , Brown, Dane L , Connan, James , Boby, Alden
- Date: 2022
- Subjects: To be catalogued
- Language: English
- Type: text , article
- Identifier: http://hdl.handle.net/10962/465124 , vital:76575 , xlink:href="https://ieeexplore.ieee.org/abstract/document/9856310"
- Description: Sign language recognition is an evolving research field in computer vision, assisting communication between hearing disabled people. Hand gestures contain the majority of the information when signing. Focusing on feature extraction methods to obtain the information stored in hand data in sign language recognition may improve classification accuracy. Pose estimation is a popular method for extracting body and hand landmarks. We implement and compare different feature extraction and segmentation algorithms, focusing on the hands only on the LSA64 dataset. To extract hand landmark coordinates, MediaPipe Holistic is implemented on the sign images. Classification is performed using poplar CNN architectures, namely ResNet and a Pruned VGG network. A separate 1D-CNN is utilised to classify hand landmark coordinates extracted using MediaPipe. The best performance was achieved on the unprocessed raw images using a Pruned VGG network with an accuracy of 95.50%. However, the more computationally efficient model using the hand landmark data and 1D-CNN for classification achieved an accuracy of 94.91%.
- Full Text:
- Date Issued: 2022
- Authors: Marais, Marc , Brown, Dane L , Connan, James , Boby, Alden
- Date: 2022
- Subjects: To be catalogued
- Language: English
- Type: text , article
- Identifier: http://hdl.handle.net/10962/465124 , vital:76575 , xlink:href="https://ieeexplore.ieee.org/abstract/document/9856310"
- Description: Sign language recognition is an evolving research field in computer vision, assisting communication between hearing disabled people. Hand gestures contain the majority of the information when signing. Focusing on feature extraction methods to obtain the information stored in hand data in sign language recognition may improve classification accuracy. Pose estimation is a popular method for extracting body and hand landmarks. We implement and compare different feature extraction and segmentation algorithms, focusing on the hands only on the LSA64 dataset. To extract hand landmark coordinates, MediaPipe Holistic is implemented on the sign images. Classification is performed using poplar CNN architectures, namely ResNet and a Pruned VGG network. A separate 1D-CNN is utilised to classify hand landmark coordinates extracted using MediaPipe. The best performance was achieved on the unprocessed raw images using a Pruned VGG network with an accuracy of 95.50%. However, the more computationally efficient model using the hand landmark data and 1D-CNN for classification achieved an accuracy of 94.91%.
- Full Text:
- Date Issued: 2022
Investigating the Effects of Image Correction Through Affine Transformations on Licence Plate Recognition
- Boby, Alden, Brown, Dane L, Connan, James, Marais, Marc
- Authors: Boby, Alden , Brown, Dane L , Connan, James , Marais, Marc
- Date: 2022
- Subjects: To be catalogued
- Language: English
- Type: text , article
- Identifier: http://hdl.handle.net/10962/465190 , vital:76581 , xlink:href="https://ieeexplore.ieee.org/abstract/document/9856380"
- Description: Licence plate recognition has many real-world applications, which fall under security and surveillance. Deep learning for licence plate recognition has been adopted to improve existing image-based processing techniques in recent years. Object detectors are a popular choice for approaching this task. All object detectors are some form of a convolutional neural network. The You Only Look Once framework and Region-Based Convolutional Neural Networks are popular models within this field. A novel architecture called the Warped Planar Object Detector is a recent development by Zou et al. that takes inspiration from YOLO and Spatial Network Transformers. This paper aims to compare the performance of the Warped Planar Object Detector and YOLO on licence plate recognition by training both models with the same data and then directing their output to an Enhanced Super-Resolution Generative Adversarial Network to upscale the output image, then lastly using an Optical Character Recognition engine to classify characters detected from the images.
- Full Text:
- Date Issued: 2022
- Authors: Boby, Alden , Brown, Dane L , Connan, James , Marais, Marc
- Date: 2022
- Subjects: To be catalogued
- Language: English
- Type: text , article
- Identifier: http://hdl.handle.net/10962/465190 , vital:76581 , xlink:href="https://ieeexplore.ieee.org/abstract/document/9856380"
- Description: Licence plate recognition has many real-world applications, which fall under security and surveillance. Deep learning for licence plate recognition has been adopted to improve existing image-based processing techniques in recent years. Object detectors are a popular choice for approaching this task. All object detectors are some form of a convolutional neural network. The You Only Look Once framework and Region-Based Convolutional Neural Networks are popular models within this field. A novel architecture called the Warped Planar Object Detector is a recent development by Zou et al. that takes inspiration from YOLO and Spatial Network Transformers. This paper aims to compare the performance of the Warped Planar Object Detector and YOLO on licence plate recognition by training both models with the same data and then directing their output to an Enhanced Super-Resolution Generative Adversarial Network to upscale the output image, then lastly using an Optical Character Recognition engine to classify characters detected from the images.
- Full Text:
- Date Issued: 2022
- «
- ‹
- 1
- ›
- »