Khoa Vo | Publications

2023

CVPRW

DNA: Deformable Neural Articulations Network for Template-Free Dynamic 3D Human Reconstruction From Monocular RGB-D Video

Khoa Vo, Trong-Thang Pham, Kashu Yamazaki, Minh Tran, and Ngan Le

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops Jun 2023

HTML
AAAI

VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

Kashu Yamazaki, Khoa Vo, Sang Truong, Bhiksha Raj, and Ngan Le

In AAAI Conference on Artificial Intelligence Jun 2023
IJCV

AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation

Khoa Vo, Sang Truong, Kashu Yamazaki, Bhiksha Raj, Minh-Triet Tran, and Ngan Le

International Journal of Computer Vision Jan 2023

Abs HTML Code

Temporal action proposal generation (TAPG) is a challenging task, which requires localizing action intervals in an untrimmed video. Intuitively, we as humans, perceive an action through the interactions between actors, relevant objects, and the surrounding environment. Despite the significant progress of TAPG, a vast majority of existing methods ignore the aforementioned principle of the human perceiving process by applying a backbone network into a given video as a black-box. In this paper, we propose to model these interactions with a multi-modal representation network, namely, Actors-Objects-Environment Interaction Network (AOE-Net). Our AOE-Net consists of two modules, i.e., perception-based multi-modal representation (PMR) and boundary-matching module (BMM). Additionally, we introduce adaptive attention mechanism (AAM) in PMR to focus only on main actors (or relevant objects) and model the relationships among them. PMR module represents each video snippet by a visual-linguistic feature, in which main actors and surrounding environment are represented by visual information, whereas relevant objects are depicted by linguistic features through an image-text model. BMM module processes the sequence of visual-linguistic features as its input and generates action proposals. Comprehensive experiments and extensive ablation studies on ActivityNet}}-}}1.3 and THUMOS-14 datasets show that our proposed AOE-Net outperforms previous state-of-the-art methods with remarkable performance and generalization for both TAPG and temporal action detection. To prove the robustness and effectiveness of AOE-Net, we further conduct an ablation study on egocentric videos, i.e. EPIC-KITCHENS 100 dataset. Our source code is publicly available at https://github.com/UARK-AICV/AOE-Net.
Book Chap.

Chapter 19: Neural Architecture Search for Medical Image Applications

Viet-Khoa Vo-Ho, Kashu Yamazaki, Hieu Hoang, Minh-Triet Tran, and Ngan Le

In Meta Learning With Medical Imaging and Health Informatics Applications Jan 2023

Abs HTML

Deep learning methods have been successful in solving tasks in machine learning and have made breakthroughs in many sectors owing to their ability to automatically extract features from unstructured data. However, their performance relies on manual trial-and-error processes for selecting an appropriate network architecture, hyperparameters for training, and pre/postprocedures. Even it has been proven that network architecture plays a critical role to the feature representation of data and the final performance, design of the network architecture is computationally intensive and heavily relies researchers’ experience. Automated machine learning (AutoML) and its advanced techniques i.e. Neural Architecture Search (NAS) have been promoted to address those limitations. Not only in general computer vision tasks, NAS has motivated various applications in multiple areas including medical imaging. In medical imaging, NAS has a significant progress in improving the accuracy of image classification, segmentation, reconstruction and more. In this book chapter, we first revise the background of NAS by providing documents of well-known approaches in search space, search strategy and evaluation strategy. We then introduce various NAS approaches in medical imaging with different applications such as classification, segmentation, detection, reconstruction, etc. Finally, we describe several open problems in NAS.

2022

BMVC

AISFormer: Amodal Instance Segmentation with Transformer

Minh Tran, Khoa Vo, Kashu Yamazaki, Arthur Fernandes, Michael Kidd, and Ngan Le

In Proceedings of the British Machine Vision Conference (BMVC) Jan 2022

Abs Code

Amodal Instance Segmentation (AIS) aims to segment the region of both visible and possible occluded parts of an object instance. While Mask R-CNN-based AIS approaches have shown promising results, they are unable to model high-level features coherence due to the limited receptive field. The most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present AISFormer, an AIS framework, with a Transformer-based mask head. AISFormer explicitly models the complex coherence between occluder, visible, amodal, and invisible masks within an object’s regions of interest by treating them as learnable queries. Specifically, AISFormer contains four modules: (i) feature encoding: extract ROI and learn both short-range and long-range visual features. (ii) mask transformer decoding: generate the occluder, visible, and amodal mask query embeddings by a transformer decoder (iii) invisible mask embedding: model the coherence between the amodal and visible masks, and (iv) mask predicting: estimate output masks including occluder, visible, amodal and invisible. We conduct extensive experiments and ablation studies on three challenging benchmarks i.e. KINS, D2SA, and COCOA-cls to evaluate the effectiveness of AISFormer.
Brain Sci.

Spiking Neural Networks and Their Applications: A Review

Kashu Yamazaki, Khoa Vo, Darshan Bulsara, and Ngan Le

Brain Sciences Jan 2022

Abs HTML

The past decade has witnessed the great success of deep neural networks in various domains. However, deep neural networks are very resource-intensive in terms of energy consumption, data requirements, and high computational costs. With the recent increasing need for the autonomy of machines in the real world, e.g., self-driving vehicles, drones, and collaborative robots, exploitation of deep neural networks in those applications has been actively investigated. In those applications, energy and computational efficiencies are especially important because of the need for real-time responses and the limited energy supply. A promising solution to these previously infeasible applications has recently been given by biologically plausible spiking neural networks. Spiking neural networks aim to bridge the gap between neuroscience and machine learning, using biologically realistic models of neurons to carry out the computation. Due to their functional similarity to the biological neural network, spiking neural networks can embrace the sparsity found in biology and are highly compatible with temporal code. Our contributions in this work are: (i) we give a comprehensive review of theories of biological neurons; (ii) we present various existing spike-based neuron models, which have been studied in neuroscience; (iii) we detail synapse models; (iv) we provide a review of artificial neural networks; (v) we provide detailed guidance on how to train spike-based neuron models; (vi) we revise available spike-based neuron frameworks that have been developed to support implementing spiking neural networks; (vii) finally, we cover existing spiking neural network applications in computer vision and robotics domains. The paper concludes with discussions of future perspectives.
ICIP

VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

Kashu Yamazaki, Sang Truong, Khoa Vo, Michael Kidd, Chase Rainwater, Khoa Luu, and Ngan Le

In 2022 IEEE International Conference on Image Processing (ICIP) Jan 2022

Abs HTML

In this paper, we leverage the human perceiving process, that involves vision and language interaction, to generate a coherent paragraph description of untrimmed videos. We propose vision-language (VL) features consisting of two modalities, i.e., (i) vision modality to capture global visual content of the entire scene and (ii) language modality to extract scene elements description of both human and non-human objects (e.g. animals, vehicles, etc), visual and non-visual elements (e.g. relations, activities, etc). Furthermore, we propose to train our proposed VLCap under a contrastive learning VL loss. The experiments and ablation studies on ActivityNet Captions and YouCookII datasets show that our VLCap outperforms existing SOTA methods on both accuracy and diversity metrics.

2021

BMVC
Oral Session

AEI: Actors-Environment Interaction with Adaptive Attention for Temporal Action Proposals Generation

Khoa Vo, Hyekang Joo, Kashu Yamazaki, Sang Truong, Kris Kitani, Minh-Triet Tran, and Ngan Le

In Proceedings of the British Machine Vision Conference (BMVC) Jan 2021

Abs HTML PDF Code

Humans typically perceive the establishment of an action in a video through the interaction between an actor and the surrounding environment. Despite the great progress in temporal action proposal generation, most existing works ignore the above fact and leave their model learning to propose actions as a black-box. In this paper, we make an attempt to simulate that ability of human by proposing Actor Environment Interaction (AEI) network to learn video visual representation for temporal action proposals generation. AEI contains two modules i.e. perception-based visual representation (PVR) and boundary matching module (BMM). PVR represents each video snippet by taking human-human relations and humans-environment relations into consideration using the proposed adaptive attention mechanism. Then, the video representation is taken by BMM to generate action proposals. AEI is comprehensively evaluated in ActivityNet-1.3 and THUMOS-14 datasets, on temporal action proposal and detection tasks, with two boundary matching architectures (i.e. CNN-based and GCN-based) and two classifiers (i.e. Unet and P-GCN). Our AEI shows significant improvement when regarding human logical thinking to extract spatio-temporal visual representation. Our AEI robustly outperforms SOTA methods with remarkable performance and generalization for both temporal action proposal generation and temporal action detection.
IEEE Access

ABN: Agent-Aware Boundary Networks for Temporal Action Proposal Generation

Khoa Vo, Kashu Yamazaki, Sang Truong, Minh-Triet Tran, Akihiro Sugimoto, and Ngan Le

IEEE Access Jan 2021

Abs HTML

Temporal action proposal generation (TAPG) aims to estimate temporal intervals of actions in untrimmed videos, which is a challenging yet plays an important role in many tasks of video analysis and understanding. Despite the great achievement in TAPG, most existing works ignore the human perception of interaction between agents and the surrounding environment by applying a deep learning model as a black-box to the untrimmed videos to extract video visual representation. Therefore, it is beneficial and potentially improves the performance of TAPG if we can capture these interactions between agents and the environment. In this paper, we propose a novel framework named Agent-Aware Boundary Network (ABN), which consists of two sub-networks: (1) an Agent-Aware Representation Network to obtain both agent-agent and agents-environment relationships in the video representation; and (2) a Boundary Generation Network to estimate the confidence score of temporal intervals. In the Agent-Aware Representation Network, the interactions between agents are expressed through local pathway, which operates at a local level to focus on the motions of agents whereas the overall perception of the surroundings are expressed through global pathway, which operates at a global level to perceive the effects of agents-environment. Comprehensive evaluations on 20-action THUMOS-14 and 200-action ActivityNet-1.3 datasets with different backbone networks (i.e C3D, SlowFast and Two-Stream) show that our proposed ABN robustly outperforms state-of-the-art methods regardless of the employed backbone network on TAPG. We further examine the proposal quality by leveraging proposals generated by our method onto temporal action detection (TAD) frameworks and evaluate their detection performances.
ICASSP

Agent-Environment Network for Temporal Action Proposal Generation

Khoa Vo, Ngan Le, Kashu Kamazaki, Akihiro Sugimoto, and Minh-Triet Tran

In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing Jan 2021

Abs HTML

Temporal action proposal generation is an essential and challenging task that aims at localizing temporal intervals containing human actions in untrimmed videos. Most of existing approaches are unable to follow the human cognitive process of understanding the video context due to lack of attention mechanism to express the concept of an action or an agent who performs the action or the interaction between the agent and the environment. Based on the action definition that a human, known as an agent, interacts with the environment and performs an action that affects the environment, we propose a contextual Agent-Environment Network. Our proposed contextual AEN involves (i) agent pathway, operating at a local level to tell about which humans/agents are acting and (ii) environment pathway operating at a global level to tell about how the agents interact with the environment. Comprehensive evaluations on 20-action THUMOS-14 and 200-action ActivityNet-1.3 datasets with different backbone networks, i.e C3D and SlowFast, show that our method robustly exhibits outperformance against state-of-the-art methods regardless of the employed backbone network.
Diagnostics

Narrow Band Active Contour Attention Model for Medical Segmentation

Ngan Le, Toan Bui, Khoa Vo, Kashu Yamazaki, and Khoa Luu

Diagnostics Jan 2021

Abs HTML

Medical image segmentation is one of the most challenging tasks in medical image analysis and widely developed for many clinical applications. While deep learning-based approaches have achieved impressive performance in semantic segmentation, they are limited to pixel-wise settings with imbalanced-class data problems and weak boundary object segmentation in medical images. In this paper, we tackle those limitations by developing a new two-branch deep network architecture which takes both higher level features and lower level features into account. The first branch extracts higher level feature as region information by a common encoder-decoder network structure such as Unet and FCN, whereas the second branch focuses on lower level features as support information around the boundary and processes in parallel to the first branch. Our key contribution is the second branch named Narrow Band Active Contour (NB-AC) attention model which treats the object contour as a hyperplane and all data inside a narrow band as support information that influences the position and orientation of the hyperplane. Our proposed NB-AC attention model incorporates the contour length with the region energy involving a fixed-width band around the curve or surface. The proposed network loss contains two fitting terms: (i) a high level feature (i.e., region) fitting term from the first branch; (ii) a lower level feature (i.e., contour) fitting term from the second branch including the (ii1) length of the object contour and (ii2) regional energy functional formed by the homogeneity criterion of both the inner band and outer band neighboring the evolving curve or surface. The proposed NB-AC loss can be incorporated into both 2D and 3D deep network architectures. The proposed network has been evaluated on different challenging medical image datasets, including DRIVE, iSeg17, MRBrainS18 and Brats18. The experimental results have shown that the proposed NB-AC loss outperforms other mainstream loss functions: Cross Entropy, Dice, Focal on two common segmentation frameworks Unet and FCN. Our 3D network which is built upon the proposed NB-AC loss and 3DUnet framework achieved state-of-the-art results on multiple volumetric datasets.

2019

App. Sci.

A Smart System for Text-Lifelog Generation from Wearable Cameras in Smart Environment Using Concept-Augmented Image Captioning with Modified Beam Search Strategy

Khoa Vo, Quoc-An Luong, Duy-Tam Nguyen, Mai-Khiem Tran, and Minh-Triet Tran

Applied Sciences Jan 2019

Abs HTML

During a lifetime, a person can have many wonderful and memorable moments that he/she wants to keep. With the development of technology, people now can store a massive amount of lifelog information via images, videos or texts. Inspired by this, we develop a system to automatically generate caption from lifelog pictures taken from wearable cameras. Following up on our previous method introduced at the SoICT 2018 conference, we propose two improvements in our captioning method. We trained and tested the model on the baseline MSCOCO datasets and evaluated on different metrics. The results show better performance compared to our previous model and to some other image captioning methods. Our system also shows effectiveness in retrieving relevant data from captions and achieve high rank in ImageCLEF 2018 retrieval challenge.
CVPRW

Vehicle Re-identification with Learned Representation and Spatial Verification and Abnormality Detection with Multi-Adaptive Vehicle Detectors for Traffic Video Analysis

Khac-Tuan Nguyen, Trung-Hieu Hoang, Minh-Triet Tran, Trung-Nghia Le, Ngoc-Minh Bui, Trong-Le Do, Khoa Vo, Quoc-An Luong, Mai-Khiem Tran, Thanh-An Nguyen, Thanh-Dat Truong, Vinh-Tiep Nguyen, and Minh N. Do

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops Jun 2019

2018

SoICT

Personal Diary Generation from Wearable Cameras with Concept Augmented Image Captioning and Wide Trail Strategy

Khoa Vo, Quoc-An Luong, Duy-Tam Nguyen, Mai-Khiem Tran, and Minh-Triet Tran

In Proceedings of the Ninth International Symposium on Information and Communication Technology Jun 2018

Abs HTML

Writing diary is not only a hobby but also provides a personal lifelog for better analysis and understanding of a user’s daily activities and events. However, in a busy society, people may not have enough time to write in diary all their social interaction. This motivates our proposal to develop a ubiquitous system to automatically generate daily text diary using our novel method for image captioning from photos taken periodically from wearable cameras. We propose to incorporate common visual concepts extracted from a photo to enhance the details of the image description. We also propose a wide trail beam search strategy to enhance the naturalness of text caption. Our captioning method improves the results on MSCOCO dataset on four metrics: BLEU, METEOR, ROUGE-L, CIDEr. As compared to the method proposed by Xu et.al and Neuraltalk of Karpathy, our model has better performance on all four metrics. We also develop smart glasses and a prototype smart workplace in which people can have their personal diary generated from photos taken by smart glasses. Furthermore, we also apply a transformer machine translation model in order to translate captions into Vietnamese language. The results are promising and can be used for Vietnamese people.