arXiv, November, 2014
Discovering visual knowledge from weakly labeled data is crucial to scale up computer vision recognition system, since it is expensive to obtain fully labeled data for a large number of concept categories. In this paper, we propose ConceptLearner, which is a scalable approach to discover visual concepts from weakly labeled image collections. Thousands of visual concept detectors are learned automatically, without human in the loop for additional annotation. We show that these learned detectors could be applied to recognize concepts at image-level and to detect concepts at image region-level accurately. Under domain-specific supervision, we further evaluate the learned concepts for scene recognition on SUN database and for object detection on Pascal VOC 2007. ConceptLearner shows promising performance compared to fully supervised and weakly supervised methods.
arXiv, October, 2014
Existing deep convolutional neural network (CNN) architectures are trained as N- way classifiers to distinguish between N output classes. This paper builds on the intuition that not all classes are equally difficult to distinguish from the true class label. Towards this end, we introduce hierarchical branching CNNs, called Hier- archical Deep CNN (HD-CNN), wherein classes that can be easily distinguished are classified in the higher-layer coarse category CNN, while the most difficult classifications are done on lower-layer fine category CNN. We propose utilizing a multinomial logistic loss and a novel temporal sparsity penalty for HD-CNN training. Together, they ensure each branching component deals with a subset of categories confusing to each other. This new network architecture adopts a coarse- to-fine classification strategy and a module design principle. The proposed model achieves superior performance over standard models. We demonstrate state-of- the-art results on the CIFAR100 benchmark.
Categories: Machine Learning, Vision
arXiv, November, 2014
In this work, we propose and address a new computer vision task, which we call fashion item detection, where the aim is to detect various fashion items a person in the image is wearing or carrying. The types of fashion items we consider in this work include hat, glasses, bag, pants, shoes and so on. The detection of fashion items can be an important first step of various e-commerce applications for fashion industry. Our method is based on state-of-the-art object detection method which combines object proposal methods with a Deep Convolutional Neural Network. Since the locations of fashion items are in strong correlation with the locations of body joints positions, we propose a hybrid discriminative-generative model to incorporate contextual information from body poses in order to improve the detection performance. Through the experiments, we demonstrate that our algorithm outperforms baseline methods with a large margin.
Im2Fit: Fast 3D Model Fitting and Anthropometrics using Single Consumer Depth Camera and Synthetic Data
arXiv, October, 2014
Recent advances in consumer depth sensors have created many opportunities for human body measurement and modeling. Estimation of 3D body shape is particularly useful for fashion e-commerce applications such as virtual try-on or fit personalization. In this paper, we propose a method for capturing accurate human body shape and anthropometrics from a single consumer grade depth sensor. We first generate a large dataset of synthetic 3D human body models using real-world body size distributions. Next, we estimate key body measurements from a single monocular depth image. We combine body measurement estimates with local geometry features around key joint positions to form a robust multi-dimensional feature vector. This allows us to conduct a fast nearest-neighbor search to every sample in the dataset and return the closest one. Compared to existing methods, our approach is able to predict accurate full body parameters from a partial view using measurement parameters learned from the synthetic dataset. Furthermore, our system is capable of generating 3D human mesh models in real-time, which is significantly faster than methods which attempt to model shape and pose deformations. To validate the efficiency and applicability of our system, we collected a dataset that contains frontal and back scans of 83 clothed people with ground truth height and weight. Experiments on real-world dataset show that the proposed method can achieve real-time performance with competing results achieving an average error of 1.9 cm in estimated measurements.
Categories: Incubations, Vision
arXiv, November, 2014
Text is ubiquitous in the artificial world and easily attainable when it comes to book title and author names. Using the images from the book cover set from the Stanford Mobile Visual Search dataset and additional book covers and metadata from openlibrary.org, we construct a large scale book cover retrieval dataset, complete with 100K distractor covers and title and author strings for each. Because our query images are poorly conditioned for clean text extraction, we propose a method for extracting a matching noisy and erroneous OCR readings and matching it against clean author and book title strings in a standard document look-up problem setup. Finally, we demonstrate how to use this text-matching as a feature in conjunction with popular retrieval features such as VLAD using a simple learning setup to achieve significant improvements in retrieval accuracy over that of either VLAD or the text alone.
SUNw: Scene Understanding Workshop, CVPR 2014
IEEE International Conference on Image Processing (ICIP), 2014
We present a new feature representation method for scene text recognition problem, particularly focusing on improving scene character recognition. Many existing methods rely on Histogram of Oriented Gradient (HOG) or part-based models, which do not span the feature space well for characters in natural scene images, especially given large variation in fonts with cluttered backgrounds. In this work, we propose a discriminative feature pooling method that automatically learns the most informative sub-regions of each scene character within a multi-class classification framework, whereas each sub-region seamlessly integrates a set of low-level image features through integral images. The proposed feature representation is compact, computationally efficient, and able to effectively model distinctive spatial structures of each individual character class. Extensive experiments conducted on challenging datasets (Chars74K, ICDAR’03, ICDAR’11, SVT) show that our method significantly outperforms existing methods on scene character classification and scene text recognition tasks.
arXiv, June, 2014
Fashion, and especially apparel, is the fastest-growing category in online shopping. As consumers requires sensory experience especially for apparel goods for which their appearance matters most, images play a key role not only in conveying crucial information that is hard to express in text, but also in affecting consumer's attitude and emotion towards the product. However, research related to e-commerce product image has mostly focused on quality at perceptual level, but not the quality of content, and the way of presenting. This study aims to address the effectiveness of types of image in showcasing fashion apparel in terms of its attractiveness, i.e. the ability to draw consumer's attention, interest, and in return their engagement. We apply advanced vision technique to quantize attractiveness using three common display types in fashion filed, i.e. human model, mannequin, and flat. We perform two-stage study by starting with large scale behavior data from real online market, then moving to well designed user experiment to further deepen our understandings on consumer's reasoning logic behind the action. We propose a Fisher noncentral hypergeometric distribution based user choice model to quantitatively evaluate user's preference. Further, we investigate the potentials to leverage visual impact for a better search that caters to user's preference. A visual attractiveness based re-ranking model that incorporates both presentation efficacy and user preference is proposed. We show quantitative improvement by promoting visual attractiveness into search on top of relevance.
Categories: Human Computer Interaction, Vision
ICML 2014 workshop on New Learning Models and Frameworks for BigData
We present a novel compact image descriptor for large scale image search. Our proposed descriptor - Geometric VLAD (gVLAD) is an extension of VLAD (Vector of locally Aggregated Descriptors) that incorporates weak geometry information into the VLAD framework. The proposed geometry cues are derived as a membership function over keypoint angles which contain evident and informative information but yet often discarded. A principled technique for learning the membership function by clustering angles is also presented. Further, to address the overhead of iterative codebook training over real-time datasets, a novel codebook adaptation strategy is outlined. Finally, we demonstrate the efficacy of proposed gVLAD based retrieval framework where we achieve more than 15% improvement in mAP over existing benchmarks.
We present a new feature representation method for scene text recognition problem, particularly focusing on improving scene character recognition. Many existing methods rely on Histogram of Oriented Gradient (HOG) or part based models, which do not span the feature space well for characters in natural scene images, especially given large variation in fonts with cluttered backgrounds. In this work, we propose a discriminative feature pooling method that automatically learns the most informative sub-regions of each scene character within a multi-class classification framework, whereas each sub-region seamlessly integrates a set of low-level image features through integral images. The proposed feature representation is compact, computationally efficient, and able to effectively model distinctive spatial structures of each individual character class. Extensive experiments conducted on challenging datasets (Chars74K, ICDAR’03, ICDAR’11, SVT) show that our method significantly outperforms existing methods on scene character classification and scene text recognition tasks.