Table of Links
-
Related Work
-
Methodology
3.1. Preliminaries and Notations
3.2. Relations between Attention-based VPG and MIL
3.3. MIVPG for Multiple Visual Inputs
3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios
-
Experiments and 4.1. General Setup
4.2. Scenario 1: Samples with Single Image
4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding
Supplementary Material
A. Detailed Architecture of QFormer
2.2. Multiple Instance Learning
Traditionally, Multiple Instance Learning [6, 28] can be broadly categorized into two main types: (1) The instance-level approach [5, 10, 14, 17] : In this approach, bag-level predictions are directly derived from the set of instance-level predictions. (2) The embedding-level approach [16, 20, 26, 34] : Here, bag-level predictions are generated from an bag-level embedding that represents multiple instances. For the former, hand-crafted pooling operators such as mean pooling or max pooling are often employed. However, in practical applications, these hand-crafted pooling operators often yield limited results. Hence, the majority of current research is grounded in the latter approach.
Aggregating instance features to form bag-level features typically leads to better outcomes but requires more complex pooling operations. Recent research has applied neural networks to the pooling process in MIL. For instance, MI-Net [40] utilizes a fully connected layer in MIL. Furthermore, AB-MIL [16] employs attention during the pooling process, allowing for better weighting of different instances. Another category of methods[34] attempts to consider the relationships between different instances using the self-attention mechanism. Moreover, DS-MIL [20] employs attention not only to consider instance-to-instance relationships but also instance-to-bag relationships; DTFDMIL [46] incorporates the Grad-CAM[33] mechanism into MIL. While these approaches concentrate on single modality, the extension of MIL to multimodal applications is scarcely explored [39].
Authors:
(1) Wenliang Zhong, The University of Texas at Arlington (wxz9204@mavs.uta.edu);
(2) Wenyi Wu, Amazon (wenyiwu@amazon.com);
(3) Qi Li, Amazon (qlimz@amazon.com);
(4) Rob Barton, Amazon (rab@amazon.com);
(5) Boxin Du, Amazon (boxin@amazon.com);
(6) Shioulin Sam, Amazon (shioulin@amazon.com);
(7) Karim Bouyarmane, Amazon (bouykari@amazon.com);
(8) Ismail Tutar, Amazon (ismailt@amazon.com);
(9) Junzhou Huang, The University of Texas at Arlington (jzhuang@uta.edu).
This paper is
