As the amount of user generated content on the internet grows, it becomes ever more important to come up with vision systems that learn directly from weakly annotated and noisy data. We leverage a large scale collection of user generated content comprising of images, tags and title/captions of furniture inventory from an e-commerce website to discover and categorize learnable visual attributes. Furniture categories have long been the quintessential example of why computer vision is hard, and we make one of the first attempts to understand them through a large scale weakly annotated dataset. We focus on a handful of furniture categories that are associated with a large number of fine-grained attributes. We propose a set of localized feature representations built on top of state-of-the-art computer vision representations originally designed for fine-grained object categorization. We report a thorough empirical characterization on the visual identifiability of various fine-grained attributes using these representations and show encouraging results on finding iconic images and on multi-attribute prediction.
In image classification, visual separability between different object categories is highly uneven, and some categories are more difficult to distinguish than others. Such difficult categories demand more dedicated classifiers. However, existing deep convolutional neural networks (CNN) are trained as flat N-way classifiers, and few efforts have been made to leverage the hierarchical structure of categories.
In this paper, we introduce hierarchical deep CNNs (HD-CNNs) by embedding deep CNNs into a category hierarchy. An HD-CNN separates easy classes using a coarse category classifier while distinguishing difficult classes using fine category classifiers. During HD-CNN training, component-wise pretraining is followed by global finetuning with a multinomial logistic loss regularized by a coarse category consistency term.
In addition, conditional executions of fine category classifiers and layer parameter compression make HD-CNNs scalable for large-scale visual recognition. We achieve state-of-the-art results on both CIFAR100 and large-scale ImageNet 1000-class benchmark datasets. In our experiments, we build up three different HD-CNNs and they lower the top-1 error of the standard CNNs by 2.65%, 3.1% and 1.1%, respectively.
Text is ubiquitous in the artificial world and easily attainable when it comes to book title and author names. Using the images from the book cover set from the Stanford Mobile Visual Search dataset and additional book covers and metadata from openlibrary.org, we construct a large scale book cover retrieval dataset, complete with 100K distractor covers and title and author strings for each.
Because our query images are poorly conditioned for clean text extraction, we propose a method for extracting a matching noisy and erroneous OCR readings and matching it against clean author and book title strings in a standard document look-up problem setup.
Finally, we demonstrate how to use this text-matching as a feature in conjunction with popular retrieval features such as VLAD using a simple learning setup to achieve significant improvements in retrieval accuracy over that of either VLAD or the text alone.
Discovering visual knowledge from weakly labeled data are crucial to scale up computer vision recognition system, since it is expensive to obtain fully labeled data for a large number of concept categories while the weakly labeled data could be collected from the Internet cheaply and massively.
In this paper we proposes a scalable approach to discover visual concepts from weakly labeled image collections, with thousands of visual concept detectors learned. Then we show that the learned detectors could be applied to recognize concepts at image-level and to detect concepts at image region-level accurately.
Under domain-selected supervision, we further evaluate the learned concepts for scene recognition on SUN database and for object detection on Pascal VOC 2007. It shows promising performance compared to the fully supervised and weakly supervised methods.
We describe a completely automated large scale visual recommendation system for fashion. Our focus is to efficiently harness the availability of large quantities of online fashion images and their rich meta-data.
Specifically, we propose two classes of data driven models in the Deterministic Fashion Recommenders (DFR) and Stochastic Fashion Recommenders (SFR) for solving this problem. We analyze relative merits and pitfalls of these algorithms through extensive experimentation on a large-scale data set and baseline them against existing ideas from color science.
We also illustrate key fashion insights learned through these experiments and show how they can be employed to design better recommendation systems.
The industrial applicability of proposed models is in the context of mobile fashion shopping. Finally, we also outline a largescale annotated data set of fashion images (Fashion-136K) that can be exploited for future research in data driven visual fashion.
In online peer-to-peer commerce places where physical examination of the goods is infeasible, textual descriptions, images of the products, reputation of the participants, play key roles. Visual image is a powerful channel to convey crucial information towards e-shoppers and influence their choice.
In this paper, we investigate a well-known online marketplace where over millions of products change hands and most are described with the help of one or more images. We present a systematic data mining and knowledge discovery approach that aims to quantitatively dissect the role of images in e-commerce in great detail. Our goal is two-fold.
First, we aim to get a thorough understanding of impact of images across various dimensions: product categories, user segments, conversion rate. We present quantitative evaluation of the influence of images and show how to leverage different image aspects, such as quantity and quality, to effectively raise sale. Second, we study interaction of image data with other selling dimensions by jointly modeling them with user behavior data.
Results suggest that "watch" behavior encodes complex signals combining both attention and hesitation from buyer, in which image still holds an important role when compared to other selling variables, especially for products for which appearance is important. We conclude on how these findings can benefit sellers in a high competitive online e-commerce market.
Recent advances in consumer depth sensors have created many opportunities for human body measurement and modeling. Estimation of 3D body shape is particularly useful for fashion e-commerce applications such as virtual try-on or fit personalization.
In this paper, we propose a method for capturing accurate human body shape and anthropometrics from a single consumer grade depth sensor. We first generate a large dataset of synthetic 3D human body models using real-world body size distributions.
Next, we estimate key body measurements from a single monocular depth image. We combine body measurement estimates with local geometry features around key joint positions to form a robust multi-dimensional feature vector.
This allows us to conduct a fast nearest-neighbor search to every sample in the dataset and return the closest one. Compared to existing methods, our approach is able to predict accurate full body parameters from a partial view using measurement parameters learned from the synthetic dataset.
Furthermore, our system is capable of generating 3D human mesh models in real-time, which is significantly faster than methods which attempt to model shape and pose deformations.
To validate the efficiency and applicability of our system, we collected a dataset that contains frontal and back scans of 83 clothed people with ground truth height and weight. Experiments on real-world dataset show that the proposed method can achieve real-time performance with competing results achieving an average error of 1.9 cm in estimated measurements.
Fashion, and especially apparel, is the fastest-growing category in online shopping. As consumers requires sensory experience especially for apparel goods for which their appearance matters most, images play a key role not only in conveying crucial information that is hard to express in text, but also in affecting consumer's attitude and emotion towards the product.
However, research related to e-commerce product image has mostly focused on quality at perceptual level, but not the quality of content, and the way of presenting.This study aims to address the effectiveness of types of image in showcasing fashion apparel in terms of its attractiveness, i.e. the ability to draw consumer's attention, interest, and in return their engagement.
We apply advanced vision technique to quantize attractiveness using three common display types in fashion filed, i.e. human model, mannequin, and flat. We perform two-stage study by starting with large scale behavior data from real online market, then moving to well designed user experiment to further deepen our understandings on consumer's reasoning logic behind the action.
We propose a Fisher noncentral hypergeometric distribution based user choice model to quantitatively evaluate user's preference. Further, we investigate the potentials to leverage visual impact for a better search that caters to user's preference. A visual attractiveness based re-ranking model that incorporates both presentation efficacy and user preference is proposed. We show quantitative improvement by promoting visual attractiveness into search on top of relevance.