While discriminative visual element mining has been introduced before, in this paper we present an approach that requires minimal annotation in both training and test time. Given only a bounding box localization of the foreground objects, our approach automatically transforms the input images into a roughly-aligned pose space and discovers the most discriminative visual fragments for each category.
These fragments are then used to learn robust classifiers that discriminate between very similar categories under challenging conditions such as large variations in pose or habitats. The minimal required input, is a critical characteristic that enables our approach to generalize over visual domains where expert knowledge is not readily available.
Moreover, our approach takes advantage of deep networks that are targeted towards fine-grained classification.It learns mid-level representations that are specific to a category and generalize well across the category instances at the same time.
Our evaluations demonstrate that the automatically learned representation based on discriminative fragments, significantly outperforms globally extracted deep features in classification accuracy.
With the rapid proliferation of smartphones and tablet computers, search has moved beyond text to other modalities like images and voice. For many applications like Fashion, visual search offers a compelling interface that can capture stylistic visual elements beyond color and pattern that cannot be as easily described using text.
However, extracting and matching such attributes remains an extremely challenging task due to high variability and deformability of clothing items. In this paper, we propose a fine-grained learning model and multimedia retrieval framework to address this problem.
First, an attribute vocabulary is constructed using human annotations obtained on a novel fine-grained clothing dataset. This vocabulary is then used to train a fine-grained visual recognition system for clothing styles.
We report benchmark recognition and retrieval results on Women's Fashion Coat Dataset and illustrate potential mobile applications for attribute-based multimedia retrieval of clothing items and image annotation.