Samsung Research at ECCV 2018

Samsung Research presented a paper on a state-of-the-art image retrieval process at the European Conference on Computer Vision 2018 (ECCV 2018). ECCV is a biennial research conference, which is considered one of the most important conferences in computer vision. This year, it was held in Munich, Germany from September 8 to 14. Among the 2,439 submissions the conference committee received, 776 papers were presented—an acceptance ratio of 31.8%.

Accepted Papers

"Attention-Based Ensemble for Deep Metric Learning"

Wonsik Kim, Bhavya Goyal, Kunal Chawla, Jungmin Lee, and Keunjoo Kwon

This paper proposed a machine-learning algorithm for an ensemble. The ensemble trains multiple learners and aggregates their predictions to achieve higher accuracy. The paper introduces the use of attention mechanism, which is designed for each learner to focus on the different parts of the object in input images (e.g., roof, windows, headlights, etc. in the case of cars).

The proposed method is applied to the image retrieval task. Given a query image, the aim is to retrieve images from the database containing the same object as the query. For example, given an image of a car of a specific brand, it retrieves images of cars of the same brand. The paper shows state-of-the-art performance on different datasets, including car brands, bird species, clothes, and online products.

"Pivot Correlational Neural Network for Multimodal Video Categorization"

Sunghun Kang, Junyeong Kim, Hyunsoo Choi, Sungjin Kim and Chang D. Yoo

This paper proposed a machine-learning algorithm for an ensemble. The ensemble trains multiple learners and aggregates their predictions to achieve higher accuracy. The paper introduces the use of attention mechanism, which is designed for each learner to focus on the different parts of the object in input images (e.g., roof, windows, headlights, etc. in the case of cars).

This paper considers an architecture for multimodal video categorization referred to as Pivot Correlational Neural Network (Pivot CorrNN). The architecture is trained to maximizes the correlation between the hidden states as well as the predictions of the modal-agnostic pivot stream and modal-specific stream in the network. Here, the modal-agnostic pivot hidden state considers all modal inputs without distinction while the modal-specific hidden state is dedicated exclusively to one specific modal input. The Pivot CorrNN consists of three modules: (1) maximizing pivot-correlation module that attempts to maximally correlate the modal-agnostic and a modal-specific hidden-state as well as their predictions, (2) contextual Gated Recurrent Unit (cGRU) module which extends the capability of a generic GRU to take multimodal inputs in updating the pivot hidden-state, and (3) adaptive aggregation module that aggregates all modal-specific predictions as well as the modal-agnostic pivot predictions into one final prediction. We evaluate the Pivot CorrNN on two publicly available large-scale multimodal video categorization datasets, FCVID and YouTube-8M. From the experimental results, Pivot CorrNN achieves the best performance on the FCVID database and performance comparable to the state-of-the-art on YouTube-8M database.