Meet the NORCICS PhDs and PostDocs – Touseef Sadiq

Meet the NORCICS PhDs and PostDocs – Touseef Sadiq

Multimodal Machine Learning: Learning multimodal intermediate video and language representations in deep networks for descriptive object identification and tracking in urban environments.

Meet Touseef Sadiq, a PhD researcher at the Center of Artificial Intelligence Research (CAIR)  at the University of Agder, Norway. Touseef's current research focuses on exploring deep multimodal learning for descriptive object identification and tracking in urban environments. His role within NORCICS Task 3.4 pertains to, "Humanized Deep Learning & Big Data Analytics"(NORCICS Task 3.4).

Human learning encompasses various modalities; we read, watch, and listen, processing diverse sensory inputs. Computers, too, can learn from multiple data types, termed multimodal data, to address intricate challenges. In smart city contexts, the integration of visual data and textual data is essential for unleashing the complete potential of Multimodal Machine Learning (MML). MML's mission is to bridge data barriers, allowing visual information and human language to coexist harmoniously.

Smart cities in today's era are brimming with diverse data sources, including surveillance video monitoring, among others. Our research delves into integrating features from these data sources to enhance real-world applications, with a specific focus on combining videos and text data for advancing intelligent transportation systems in smart cities. Our dataset, CityFlow-NL, specializes in tracked-vehicle retrieval via natural language descriptions, serving as a tool to evaluate natural language-based tracked-vehicle retrieval systems in intelligent traffic contexts.

In Multimodal Machine Learning (MML), deep learning models are the engines that drive the fusion of different data types. MML relies on neural networks, similar to the human brain, to process inputs, encompassing visual processing and text understanding. To extract meaningful features from videos, Convolutional Neural Networks (CNNs) such as ReXNET50 and EfficientNetB0 are employed, capturing visual details through convolutional layers. These architectures autonomously learn hierarchical representations from raw pixels, excelling at tasks like object detection. On the textual side, Bidirectional Encoded Representations from Transformers (BERT) and its variants, like RoBERTa base and RoBERTa large, are used for encoding text. BERT's ability to comprehend complex interactions through pre-training and fine-tuning makes it a powerful tool for MML.

The alignment challenge revolves around aligning features from diverse modalities to enable their learning in a common representation space. This crucial task ensures that data from different sources and formats can be effectively integrated and processed together. To address these disparities, the focus is on identifying common features and framing them in a shared embedded space, which is where Multimodal Machine Learning excels. Two approaches are explored for bridging the gap between visual and text input features: Similarity Learning, which uses Siamese Neural Networks for feature alignment, and Contrastive Approaches, employing measures like infoNCE loss and circle loss to gauge similarity between multimodal features, enhancing alignment models. Our journey highlights the pivotal role of feature extraction techniques in extracting information from vision and language modalities, emphasizing model selection, including the use of CNNs for visual encoding and transformers-based models like BERT for text processing.

In the pursuit of feature alignment within a common latent space, we've explored similarity-based techniques and contrastive methods to narrow the "modality gap" between features. Despite our efforts to minimize the modality gap, performance losses in downstream tasks persist, often attributed to the next module in the pipeline. Within the modality-specific latent space, we've carefully assessed how modality-specific feature representations influence downstream performance. Yet, an "information gap" persists even with these advanced models. To address this challenge, we consider the use of regularization techniques such as deep feature loss and modality-specific inter and intra-modality losses. Moreover, our progress is hindered by the pervasive problem of data scarcity in machine learning, prompting our exploration of innovative data synthesis methods, including Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and data augmentation.

To view or add a comment, sign in

More articles by SFI-NORCICS .

Others also viewed

Explore content categories