Machine Learning and Data Science

Machine Learning and Data Science
On the forefront of practical machine learning research.
Info

The Machine Learning and Data Science team builds a variety of both descriptive and prediction models, from the massive volumes of behavioral and description data generated by eBay’s many buyers, sellers, and products. The data at eBay is among the largest and most challenging/diverse in the world, with applications requiring the fusion of massive amounts of behavior log, text, and image data — putting us on the forefront of practical machine learning research.

We work closely with our partners in product groups, with a special focus on developing data-driven models which improve user experience, including product search. We are a collection of machine learning, systems engineering, and domain experts, who research and develop cutting-edge methods which scale to large eCommerce data involving hundreds of millions of active users and billions of transactions, and which leverage a variety of modern computing platforms, including both distributed (Hadoop) and massive-core (including GPU) computing. In addition to routinely targeting and shipping our work to improve the eBay site, our researchers actively engage external research communities in a variety of ways, including service, publication, and speaking at a variety of the top research conferences, including SIGIR, WWW, KDD, ICML, and NIPS.

Publications
Washinton DC, 27-30 Oct. 2014

Astro: A Predictive Model for Anomaly Detection and Feedback-based Scheduling on Hadoop

Chaitali Gupta, Mayank Bansal, Tzu-Cheng Chuang, Ranjan Sinha, Sami Ben-romdhane

The sheer growth in data volume and Hadoop cluster size make it a significant challenge to diagnose and locate problems in a production-level cluster environment efficiently and within a short period of time. Often times, the distributed monitoring systems are not capable of detecting a problem well in advance when a large-scale Hadoop cluster starts to deteriorate i n performance or becomes unavailable. Thus, inc o m i n g workloads, scheduled between the time when cluster starts to deteriorate and the time when the problem is identified, suffer from longer execution times. As a result, both reliability and throughput of the cluster reduce significantly. In this paper, we address this problem by proposing a system called Astro, which consists of a predictive model and an extension to the Hadoop scheduler. The predictive model in Astro takes into account a rich set of cluster behavioral information that are collected by monitoring processes and model them using machine learning algorithms to predict future behavior of the cluster. The Astro predictive model detects anomalies in the cluster and also identifies a ranked set of metrics that have contributed the most towards the problem. The Astro scheduler uses the prediction outcome and the list of metrics to decide whether it needs to move and reduce workloads from the problematic cluster nodes or to prevent additional workload allocations to them, in order to improve both throughput and reliability of the cluster. The results demonstrate that the Astro scheduler improves usage of cluster compute resources significantly by 64.23% compared to traditional Hadoop. Furthermore, the runtime of the benchmark application reduced by 26.68% during the time of anomaly, thus improving the cluster throughput.

Keywords
Santa Clara, Oct. 29 2015-Nov. 1 2015

Eagle: User Profile-based Anomaly Detection for Securing Hadoop Clusters

Chaitali Gupta, Ranjan Sinha, Yong Zhang

Existing Big data analytics platforms, such as Hadoop, lack support for user activity monitoring. Several diagnostic tools such as Ganglia, Ambari, and Cloudera Manager are available to monitor health of a cluster, however, they do not provide algorithms to detect security threats or perform user activity monitoring. Hence, there is a need to develop a scalable system that can detect malicious user activities, especially in real-time, so that appropriate actions can be taken against the user. At eBay, we developed such a system named Eagle, which collects audit logs from Hadoop clusters and applications running on them, analyzes users behavior, generates profiles per user of the system, and predicts anomalous user activities based on their prior profiles. Eagle is a highly scalable system, capable of monitoring multiple eBay clusters in real-time. It includes machine-learning algorithms that create user profiles based on the user's history of activities. As far as we know, this is the first activity monitoring system on the Hadoop-ecosystem for the detection of intrusion-related activities using behavior-based profiles of users. When a user performs any operation in the cluster, Eagle matches current user action against his prior activity pattern and raises alarm if it suspects anomalous action. We investigate two machine-learning algorithms: density estimation, and principal component analysis (PCA). In this paper, we introduce the Eagle system, discuss the algorithms in detail, and show performance results. We demonstrate that the sensitivity of the density estimation algorithm is 93%, however the sensitivity of our system increases by 4.94% (on average) to 98% (approximately) by using an ensemble of the two algorithms during anomaly detection.

Keywords
Proceedings of NAACL-HLT 2015, pages 160–167, Denver, Colorado, May 31 – June 5, 2015. c 2015 Association for Computational Linguistics

Distributed Word Representations Improve NER for e-Commerce

This paper presents a case study of using distributed word representations, word2vec in particular, for improving performance of Named Entity Recognition for the e-Commerce domain. We also demonstrate that distributed word representations trained on a smaller amount of in-domain data are more effective than word vectors trained on very large amount of out-of-domain data, and that their combination gives the best results.

Keywords
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 805–814, Jeju, Republic of Korea, 8-14 July 2012. c 2012 Association for Computational Linguistics

Structuring E-Commerce Inventory

Large e-commerce enterprises feature millions of items entered daily by a large variety of sellers. While some sellers provide rich, structured descriptions of their items, a vast majority of them provide unstructured natural language descriptions. In the paper we present a 2 steps method for structuring items into descriptive properties. The first step consists in unsupervised property discovery and extraction. The second step involves supervised property synonym discovery using a maximum entropy based clustering algorithm. We evaluate our method on a year worth of ecommerce
data and show that it achieves excellent precision with good recall.

Keywords
40th International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015

Switching to and Combining Offline-Adapted Cluster Acoustic Models based on Unsupervised Segment Classification

The performance of automatic speech recognition system degrades significantly when the incoming audio differs from training data. Maximum likelihood linear regression has been widely used for unsupervised adaptation, usually in a multiple-pass recognition process. Here we present a novel adaptation framework for which the offline, supervised, high-quality adaptation is applied to clustered channel/speaker conditions that are defined with automatic and manual clustering of the training data. Upon online recognition, each speech segment is classified into one of the training clusters in an unsupervised way, and the corresponding top acoustic models are used for recognition. Recognition lattice outputs are combined. Experiments are performed on the Wall Street Journal data, and a 37.5% relative reduction of Word Error Rate is reported. The proposed approach is also compared with a general speaker adaptive training approach.

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Correcting Keyboard Layout Errors and Homoglyphs in Queries

Keyboard layout errors and homoglyphs in cross-language queries impact our ability to correctly interpret user information needs and offer relevant results. We present a machine learning approach to correcting these errors, based largely on character-level n-gram features. We demonstrate superior performance over rule-based methods, as well as a significant reduction in the number of queries that yield null search results.

CHIMoney (Workshop at CHI-2014)

Shopping with Bonus Money: eBay, loyalty schemes and consumer spending

Darrell Hoy, Elizabeth Churchill, Atish Das Sarma, Kamal Jain, Darrell Hoy, Elizabeth Churchill, Atish Das Sarma, Kamal Jain

No information

Keywords
In ECIR 2014 (To Appear)

A Study of Query Term Deletion using Large-scale E-commerce Search Logs

Bishan Yang, Nish Parikh, Gyanit Singh, Neel Sundaresan

Query term deletion is one of the commonly used strategies for query rewriting. In this paper, we study the problem of query term deletion using large-scale e-commerce search logs. Especially we focus on queries that do not lead to user clicks and aim to predict a reduced and better query that can lead to clicks by term deletion. Accurate prediction of term deletion can potentially help users recover from poor search results and improve shopping experience.

To achieve this,we use various term-dependent and query-dependent measures as features and build a classifier to predict which term is the most likely to be deleted from a given query. Different from previous work on query term deletion, we compute the features not only based on the query history and the available document collection, but also conditioned on the query category, which captures the high-level context of the query.

We validate our approach using a large collection of query sessions logs from a leading e-commerce site, and show that it provides promising performance in term deletion prediction, and significantly outperforms baselines that rely on query history and corpus-based statistics without incorporating the query context information.

Keywords
arXiv, May, 2014

Enhancing Visual Fashion Recommendations with Users in the Loop

Anurag Bhardwaj, Vignesh Jagadeesh, Wei Di, Robinson Piramuthu, Elizabeth Churchill

We describe a completely automated large scale visual recommendation system for fashion. Existing approaches have primarily relied on purely computational models to solving this problem that ignore the role of users in the system.

In this paper, we propose to overcome this limitation by incorporating a user-centric design of visual fashion recommendations. Specifically, we propose a technique that augments 'user preferences' in models by exploiting elasticity in fashion choices. We further design a user study on these choices and gather results from the 'wisdom of crowd' for deeper analysis.

Our key insights learnt through these results suggest that fashion preferences when constrained to a particular class, contain important behavioral signals that are often ignored in recommendation design.

Further, presence of such classes also reflect strong correlations to visual perception which can be utilized to provide aesthetically pleasing user experiences. Finally, we illustrate that user approval of visual fashion recommendations can be substantially improved by carefully incorporating these user-centric feedback into the system framework.

Tutorial at WWW-2014

E-commerce Product Search: Personalization, Diversification, and beyond

Atish Das Sarma, Nish Parikh, Neel Sundaresan

The focus of this tutorial will be e-commerce product search. Several research challenges appear in this context, both from a research standpoint as well as an application standpoint. We will present various approaches adopted in the industry,

review well-known research techniques developed over the last decade, draw parallels to traditional web search highlighting the new challenges in this setting, and dig deep into some of the algorithmic and technical approaches developed for this context.

A specific approach that will involve a deep dive into literature, theoretical techniques, and practical impact is that of identifying most suited results quickly from a large database, with settings various across cold start users, and those for whom personalization is possible.

In this context, top-k and skylines will be discussed specifically as they form a key approach that spans the web, data mining, and database communities and presents a powerful tool for search across multi-dimensional items with clear preferences within each attribute, like product search as opposed to regular web search.

Keywords
Journal of the Association for Computing Machinery (JACM) – 2013

Distributed Random Walks

Atish Das Sarma, Danupon Nanongkai, Gopal Pandurangan, Prasad Tetali

Performing random walks in networks is a fundamental primitive that has found applications in many areas of computer science, including distributed computing. In this article, we focus on the problem of sampling random walks efficiently in a distributed network and its applications. Given bandwidth constraints, the goal is to minimize the number of rounds required to obtain random walk samples. All previous algorithms that compute a random walk sample of length ℓ as a subroutine always do so naively, that is, in O(ℓ) rounds.

The main contribution of this article is a fast distributed algorithm for performing random walks. We present a sublinear time distributed algorithm for performing random walks whose time complexity is sublinear in the length of the walk. Our algorithm performs a random walk of length ℓ in Õ(√ℓD) rounds (Õ hides polylog n factors where n is the number of nodes in the network) with high probability on an undirected network, where D is the diameter of the network. For small diameter graphs, this is a significant improvement over the naive O(ℓ) bound.

Furthermore, our algorithm is optimal within a poly-logarithmic factor as there exists a matching lower bound [Nanongkai et al. 2011]. We further extend our algorithms to efficiently perform k independent random walks in Õ(√kℓD + k) rounds. We also show that our algorithm can be applied to speedup the more general Metropolis-Hastings sampling. Our random-walk algorithms can be used to speed up distributed algorithms in applications that use random walks as a subroutine. We present two main applications.

First, we give a fast distributed algorithm for computing a random spanning tree (RST) in an arbitrary (undirected unweighted) network which runs in Õ(√mD) rounds with high probability (m is the number of edges). Our second application is a fast decentralized algorithm for estimating mixing time and related parameters of the underlying network. Our algorithm is fully decentralized and can serve as a building block in the design of topologically-aware networks.

Keywords
SIAM 2013

Probabilistic Combination of Classifier and Cluster Ensembles for Non-transductive Learning

Ayan Acharya, Eduardo R.Hruschka, Joydeep Ghosh, Badrul Sarwar, Jean-David Ruvini

Unsupervised models can provide supplementary soft constraints to help classify new target data under the assumption that similar objects in the target set are more likely to share the same class label. Such models can also help detect possible dierences between training and target distributions,

which is useful in applications where concept drift may take place. This paper describes a Bayesian frame work that takes as input class labels from existing classefiers (designed based on labeled data from the source domain),

as well as cluster labels from a cluster ensemble operating solely on the target data to be classified and yields a con-ensus labeling of the target data. This framework is particularly useful when the statistics of the target data drift or change from those of the training data.

We also show that the proposed framework is privacy-aware and allows performing distributed learning when data/models have sharing restrictions. Experiments show that our framework can yield superior results to those provided by applying classifier ensembles only.

Keywords
American Economic Journal: Microeconomics 2013, 5(1): 1–27

Set-Asides and Subsidies in Auctions

Susan Athey, Dominic Coey, Jonathan Levin

Set-asides and subsidies are used extensively in government procurement and resource sales. We analyze these policies in an empirical model of US Forest Service timber auctions.

The model fits the data well both within the sample of unrestricted sales used for estimation, and when we predict (out-of-sample) outcomes for small business set-asides. Our estimates suggest that restricting entry substantially reduces efficiency and revenue, although it increases small business participation.

An alternative policy of subsidizing small bidders would increase revenue and small bidder profit, with little efficiency cost. We explain these findings by connecting to the theory of optimal auction design. (JEL D44, H57, L73, Q23)

Keywords
PVLDB-2013

On the Embeddability of Random Walk Distances

Adelbert Chang, Atish Das Sarma, Haitao Zheng, Ben Y.Zhao

Analysis of large graphs is critical to the ongoing growth of search engines and social networks. One class of queries centers around node affinity, often quantified by random-walk distances between node pairs, including hitting time, commute time, and personalized PageRank (PPR).

Despite the potential of these "metrics," they are rarely, if ever, used in practice, largely due to extremely high computational costs. In this paper, we investigate methods to scalably and efficiently compute random-walk distances, by "embedding" graphs and distances into points and distances in geometric coordinate spaces.

We show that while existing graph coordinate systems (GCS) can accurately estimate shortest path distances, they produce significant errors when embedding random-walk distances.

Based on our observations, we propose a new graph embedding system that explicitly accounts for per-node graph properties that affect random walk. Extensive experiments on a range of graphs show that our new approach can accurately estimate both symmetric and asymmetric random-walk distances.

Once a graph is embedded, our system can answer queries between any two nodes in 8 microseconds, orders of magnitude faster than existing methods. Finally, we show that our system produces estimates that can replace ground truth in applications with minimal impact on application output.

Keywords
Tutorial at CIKM-2013

Beyond Skylines and Top-k Queries: Representative Databases and e-Commerce Product Search

Atish Das Sarma, Ashwin Lall, Nish Parikh, Neel Sundaresan

Skyline queries have been a topic of intense study in the database area for over a decade now. Similarly, the top-k retrieval query has been heavily investigated by both the database as well as the web research communities. This tutorial will delve into the background of these two areas, and specifically focus on the recent challenges with respect to returning a small set of results to users, as well as requiring minimal intervention or input from them.

These are two main concerns with skylines and top-k respectively, and therefore have drawn a great deal of attention in the recent years with several interesting ideas being proposed in the research community. This tutorial will cover the current approaches to representative database selection. We will focus on both the theoretical models as well as the practical aspects from an industry standpoint.

The topics of covered in this tutorial will include identifying representative subsets of the skyline set, interaction based approaches, e-commerce product search, and leveraging aggregate user preference statistics.

Keywords
CIKM ’13 Proceedings of the 22nd ACM international conference on Conference on information & knowledge management Pages 1137-1146

On Segmentation of eCommerce Queries

Nish Parikh, Prasad Sriram, Mohammad AlHasan

In this paper, we present QSEGMENT, a real-life query segmentation system for eCommerce queries. QSEGMENT uses frequency data from the query log which we call buyers′ data and also frequency data from product titles what we call sellers′ data.

We exploit the taxonomical structure of the marketplace to build domain specific frequency models. Using such an approach, QSEGMENT performs better than previously described baselines for query segmentation.

Also, we perform a large scale evaluation by using an unsupervised IR metric which we refer to as user-intent-score. We discuss the overall architecture of QSEGMENT as well as various use cases and interesting observations around segmenting eCommerce queries.

Keywords
In proceedings of The 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2013. 829-836. (Best Paper Award Winner)

Chelsea Won, and You Bought a T-shirt: Characterizing the Interplay Between Twitter and E-Commerce

Haipeng Zhang, Nish Parikh, Neel Sundaresan

The popularity of social media sites like Twitter and Facebook opens up interesting research opportunities for understanding the interplay of social media and e-commerce. Most research on online behavior, up until recently, has focused mostly on social media behaviors and e-commerce behaviors independently.

In our study we choose a particular global ecommerce platform (eBay) and a particular global social media platform (Twitter). We quantify the characteristics of the two individual trends as well as the correlations between them.

We provide evidences that about 5% of general eBay query streams show strong positive correlations with the corresponding Twitter mention streams, while the percentage jumps to around 25% for trending eBay query streams. Some categories of eBay queries, such as 'Video Games' and 'Sports', are more likely to have strong correlations.

We also discover that eBay trend lags Twitter for correlated pairs and the lag differs across categories. We show evidences that celebrities' popularities on Twitter correlate well with their relevant search and sales on eBay.

The correlations and lags provide predictive insights for future applications that might lead to instant merchandising opportunities for both sellers and e-commerce platforms.

Keywords
WSDM-2013

Arrival and Departure Dynamics in Social Networks

Shaomei Wu, Atish Das Sarma, Alex Fabrikant, Silvio Lattanzi, Andrew Tomkins

In this paper, we consider the natural arrival and departure of users in a social network, and ask whether the dynamics of arrival, which have been studied in some depth, also explain the dynamics of departure, which are not as well studied.

Through study of the DBLP co-authorship network and a large online social network, we show that the dynamics of departure behave differently from the dynamics of formation.

In particular, the probability of departure of a user with few friends may be understood most accurately as a function of the raw number of friends who are active. For users with more friends, however, the probability of departure is best predicted by the overall fraction of the user's neighborhood that is active, independent of size.

We then study global properties of the sub-graphs induced by active and inactive users, and show that active users tend to belong to a core that is densifying and is significantly denser than the inactive users. Further, the inactive set of users exhibit a higher density and lower conductance than the degree distribution alone can explain. These two aspects suggest that nodes at the fringe are more likely to depart and subsequent departure are correlated among neighboring nodes in tightly-knit communities.

Keywords
Lightning Talk and Poster @ Extremely Large Databases XLDB 2013.

Building a Network of E-commerce Concepts

Sandip Gaikwad, Sanjay Ghatare, Nish Parikh, Rajendra Shinde

We present a method for developing a network of e-commerce concepts. We define concepts as collection of terms that represent product entities or commerce ideas that users are interested in. We start by looking at large corpora (Billions) of historical eBay buyer queries and seller item titles.

We approach the problem of concept extraction from corpora as a market-baskets problem by adapting statistical measures of support and confidence. The concept-centric meta-data extraction pipeline is built over a map-reduce framework. We constrain the concepts to be both popular and concise.

Evaluation of our algorithm shows that high precision concept sets can be automatically mined. The system mines the full spectrum of precise e-commerce concepts ranging all the way from "ipod nano" to "I'm not a plastic bag" and from "wakizashi sword" to "mastodon skeleton".

Once the concepts are detected, they are linked into a network using different metrics of semantic similarity between concepts. This leads to a rich network of e-commerce vocabulary. Such a network of concepts can be the basis of enabling powerful applications like e-commerce search and discover as well as automatic e-commerce taxonomy generation. We present details about the extraction platform, and algorithms for segmentation of short snippets of e-commerce text as well as detection and linking of concepts.

Keywords
Workshop at WSDM-2014

Data Design for Personalization: Current Challenges and Emerging Opportunities

Elizabeth Churchill, Atish Das Sarma

Personalization is central to most Internet experiences. Personalization is a data-driven process, whether the data are explicitly gathered (e.g., by asking people to fill out forms) or implicitly (e.g. through analysis of behavioral data).

It is clear that designing for effective personalization poses interesting engineering and computer science challenges. However, personalization is also a user experience issue. We believe that encouraging dialogue and collaboration between data mining experts, content providers, and user-focused researchers will offer gains in the area of personalization for search and for other domains.

This workshop is part of a larger effort we are developing: D2D: Data to Design - Design to Data.

Our vision is to provide a forum for researchers and practitioners in computer and systems sciences, data sciences, machine learning, information retrieval, interaction and interface design, and human computer interaction to interact.

Our goal is to explore issues surrounding content and presentation personalization across different devices, and to set an agenda for cross-discipline, collaborative engagement.

Keywords
in Proceedings of the 22nd international conference on World Wide Web (WWW ’13)

Anatomy of a Web-Scale Resale Market: A Data Mining Approach

Yuchen Zhao, Neel Sundaresan, Zeqian Shen, Philip Yu

Reuse and remarketing of content and products is an integral part of the internet. As E-commerce has grown, online resale and secondary markets form a significant part of the commerce space. The intentions and methods for reselling are diverse. In this paper, we study an instance of such markets that affords interesting data at large scale for mining purposes to understand the properties and patterns of this online market.

As part of knowledge discovery of such a market, we first formally propose criteria to reveal unseen resale behaviors by elastic matching identification (EMI) based on the account transfer and item similarity properties of transactions.

Then, we present a large-scale system that leverages MapReduce paradigm to mine millions of online resale activities from petabyte scale heterogeneous ecommerce data. With the collected data, we show that the number of resale activities leads to a power law distribution with a ‘long tail’, where a significant share of users only resell in very low numbers and a large portion of resales come from a small number of highly active resellers.

We further conduct a comprehensive empirical study from different aspects of resales, including the temporal, spatial patterns, user demographics, reputation and the content of sale postings. Based on these observations, we explore the features related to “successful” resale transactions and evaluate if they can be predictable.

We also discuss uses of this information mining for business insights and user experience on a real-world online marketplace.

Keywords
SIGIR 2013: 193-202

Faster and smaller inverted indices with treaps.

Roberto Konow, Gonzalo Navarro, Charles L. A. Clarke, Alejandro López-Ortiz

We introduce a new representation of the inverted index that performs faster ranked unions and intersections while using less space. Our index is based on the treap data structure, which allows us to intersect/merge the document identifiers while simultaneously thresholding by frequency, instead of the costlier two-step classical processing methods. To achieve compression we represent the treap topology using compact data structures. Further, the treap invariants allow us to elegantly encode differentially both document identifiers and frequencies. Results show that our index uses about 20% less space, and performs queries up to three times faster, than state-of-the-art compact representations.

Keywords
To appear in Proceedings of IEEE International Conference on Computer Vision & Pattern Recognition (CVPR) 2013

Dense Non-Rigid Point-Matching Using Random Projections

Raffay Hamid, Dennis DeCoste, Chih-Jen Lin

We present a robust and efficient technique for matching dense sets of points undergoing non-rigid spatial transformations. Our main intuition is that the subset of points that can be matched with high confidence should be used to guide the matching procedure for the rest.

We propose a novel algorithm that incorporates these high-confidence matches as a spatial prior to learn a discriminative subspace that simultaneously encodes both the feature similarity as well as their spatial arrangement.

Conventional subspace learning usually requires spectral decomposition of the pair-wise distance matrix across the point-sets, which can become inefficient even for moderately sized problems. To this end, we propose the use of random projections for approximate subspace learning, which can provide significant time improvements at the cost of minimal precision loss.

This efficiency gain allows us to iteratively find and remove high-confidence matches from the point sets, resulting in high recall. To show the effectiveness of our approach, we present a systematic set of experiments and results for the problem of dense non-rigid image-feature matching.

Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL), 2012.

Structuring E-Commerce Inventory

Karin Maugé, Khashayar Rohanimanesh, Jean-David Ruvini

Large e-commerce enterprises feature millions of items entered daily by a large variety of sellers. While some sellers provide rich, structured descriptions of their items, a vast majority of them provide unstructured natural language descriptions.

In the paper we present a 2 steps method for structuring items into descriptive properties. The first step consists in unsupervised property discovery and extraction. The second step involves supervised property synonym discovery using a maximum entropy based clustering algorithm.

We evaluate our method on a year worth of eCommerce data and show that it achieves excellent precision with good recall.

Keywords
ACL 2012: 805-814

Structuring E-Commerce Inventory

Karin Maugé, Khashayar Rohanimanesh, Jean-David Ruvini

Large e-commerce enterprises feature millions of items entered daily by a large variety of sellers. While some sellers provide rich, structured descriptions of their items, a vast majority of them provide unstructured natural language descriptions.

In the paper we present a 2 steps method for structuring items into descriptive properties. The first step consists in unsupervised property discovery and extraction. The second step involves supervised property synonym discovery using a maximum entropy based clustering algorithm.

We evaluate our method on a year worth of eCommerce data and show that it achieves excellent precision with good recall.

Keywords
CIKM 2012:596-604

Large-scale Item Categorization for e-Commerce

Dan Shen, Jean-David Ruvini, Badrul Sarwar

This paper studies the problem of leveraging computationally intensive classification algorithms for large scale text categorization problems. We propose a hierarchical approach which decomposes the classification problem into a coarse level task and a fine level task.

A simple yet scalable classifier is applied to perform the coarse level classification while a more sophisticated model is used to separate classes at the fine level. However, instead of relying on a human-defined hierarchy to decompose the problem, we use a graph algorithm to discover automatically groups of highly similar classes.

As an illustrative example, we apply our approach to real-world industrial data from eBay, a major e-commerce site where the goal is to classify live items into a large taxonomy of categories.

In such industrial setting, classification is very challenging due to the number of classes, the amount of training data, the size of the feature space and the real world requirements on the response time. We demonstrate through extensive experimental evaluation that (1) the proposed hierarchical approach is superior to flat models, and (2) the data-driven extraction of latent groups works significantly better than the existing human-defined hierarchy.

Keywords
SDM 2012

Multi-Skill Collaborative Teams based on Densest Subgraphs

Atish Das Sarma, Amita Gajewar, Atish Das Sarma, Amita Gajewar

We consider the problem of identifying a team of skilled individuals for collaboration, in the presence of a social network, with the goal to maximize the collaborative compatibility of the team. Each node in the social network is associated with skills, and edge-weights specify affinity between respective nodes. We measure collaborative compatibility objective as the density of the induced subgraph on selected nodes.

This problem is NP-hard even when the team requires individuals of only one skill. We present a 3-approximation algorithm for the single-skill team formulation problem. We show the same approximation can be extended to a special case of multiple skills.

Our problem generalizes the formulation studied by Lappas et al. [KDD ’09] who measure team compatibility in terms of diameter or spanning tree. The experimental results show that the density-based algorithms outperform the diameter-based objective on several metrics.

Keywords
Journal of the American Society for Information Science and Technology (JASIST), 2012

Automatic Identification of Personal Insults on Social News Sites

Sara OwsleySood, Elizabeth Churchill, Judd Antin

As online communities grow and the volume of user-generated content increases, the need for community management also rises. Community management has three main purposes: to create a positive experience for existing participants, to promote appropriate, socionormative behaviors,

and to encourage potential participants to make contributions. Research indicates that the quality of content a potential participant sees on a site is highly influential; off-topic, negative comments with malicious intent are a particularly strong boundary to participation or set the tone for encouraging similar contributions. A problem for community managers, therefore, is the detection and elimination of such undesirable content. As a community grows, this undertaking becomes more daunting. Can an automated system aid community managers in this task?

In this paper, we address this question through a machine learning approach to automatic detection of inappropriate negative user contributions. Our training corpus is a set of comments from a news commenting site that we tasked Amazon Mechanical Turk workers with labeling. Each comment is labeled for the presence of profanity, insults, and the object of the insults. Support vector machines trained on these data are combined with relevance and valence analysis systems in a multistep approach to the detection of inappropriate negative user contributions.

The system shows great potential for semiautomated community management.

Keywords
Second International ICST Conference, AMMA 2011. Vol. 80, Lecture Notes of the Institute for Computer Science, Social Informatics and Telecommunications Engineering. Springer, 2012. (Revised Selected Papers.)

Auctions, Market Mechanisms, and Their Applications

Sanmay Das, Sebastien Lahaie, Boleslaw Szymanski

No Information

Keywords
WSDM 2012

Online selection of diverse results

Debmalya Panigrahi, Atish Das Sarma, Gagan Aggarwal, Andrew Tomkins

The phenomenal growth in the volume of easily accessible information via various web-based services has made it essential for service providers to provide users with personalized representative summaries of such information.

Further, online commercial services including social networking and micro-blogging websites, e-commerce portals, leisure and entertainment websites, etc. recommend interesting content to users that is simultaneously diverse on many different axes such as topic, geographic specificity, etc.

The key algorithmic question in all these applications is the generation of a succinct, representative, and relevant summary from a large stream of data coming from a variety of sources. In this paper, we formally model this optimization problem, identify its key structural characteristics, and use these observations to design an extremely scalable and efficient algorithm.

We analyze the algorithm using theoretical techniques to show that it always produces a nearly optimal solution. In addition, we perform large-scale experiments on both real-world and synthetically generated datasets, which confirm that our algorithm performs even better than its analytical guarantees in practice, and also outperforms other candidate algorithms for the problem by a wide margin.

Keywords
WWW 2012

Your two weeks of fame and your grandmother’s

James Cook, Atish Das Sarma, Alex Fabrikant, Andrew Tomkins

Did celebrity last longer in 1929, 1992 or 2009? We investigate the phenomenon of fame by mining a collection of news articles that spans the twentieth century, and also perform a side study on a collection of blog posts from the last 10 years.

By analyzing mentions of personal names, we measure each person's time in the spotlight, and watch the distribution change from a century ago to a year ago. We expected to find a trend of decreasing durations of fame as news cycles accelerated and attention spans became shorter.

Instead, we find a remarkable consistency through most of the period we study. Through a century of rapid technological and societal change, through the appearance of Twitter, communication satellites and the Internet, we do not observe a significant change in typical duration of celebrity.

We also study the most famous of the famous, and find different results depending on our method for measuring duration of fame. With a method that may be thought of as measuring a spike of attention around a single narrow news story, we see the same result as before:

stories last as long now as they did in 1930. A second method, which may be thought of as measuring the duration of public interest in a person, indicates that famous people's presence in the news is becoming longer rather than shorter, an effect most likely driven by the wider distribution and higher volume of media in modern times.

Similar studies have been done with much shorter timescales specifically in the context of information spreading on Twitter and similar social networking site. However, to the best of our knowledge, this is the first massive scale study of this nature that spans over a century of archived data, thereby allowing us to track changes across decades.

Keywords
WWW (Companion Volume) 2012: 73-82

Rewriting null e-commerce queries to recommend products

Nish Parikh, Neel Sundaresan

In e-commerce applications product descriptions are often concise. E-Commerce search engines often have to deal with queries that cannot be easily matched to product inventory resulting in zero recall or null query situations.

Null queries arise from differences in buyer and seller vocabulary or from the transient nature of products. In this paper, we describe a system that rewrites null e-commerce queries to find matching products as close to the original query as possible.

The system uses query relaxation to rewrite null queries in order to match products. Using eBay as an example of a dynamic marketplace, we show how using temporal feedback that respects product category structure using the repository of expired products, we improve the quality of recommended results.

The system is scalable and can be run in a high volume setting. We show through our experiments that high quality product recommendations for more than 25% of null queries are achievable.

Keywords
SIGMOD 2012

Interactive regret minimization

Danupon Nanongkai, Ashwin Lall, Atish Das Sarma, Kazuhisa Makino

We study the notion of regret ratio proposed in [19] Nanongkai et al. [VLDB10] to deal with multi-criteria decision making in database systems. The regret minimization query proposed in [19] Nanongkai et al. was shown to have features of both skyline and top-k:

it does not need information from the user but still controls the output size. While this approach is suitable for obtaining a reasonably small regret ratio, it is still open whether one can make the regret ratio arbitrarily small. Moreover, it remains open whether reasonable questions can be asked to the users in order to improve efficiency of the process.

In this paper, we study the problem of minimizing regret ratio when the system is enhanced with interaction. We assume that when presented with a set of tuples the user can tell which tuple is most preferred.

Under this assumption, we develop the problem of interactive regret minimization where we fix the number of questions and tuples per question that we can display, and aim at minimizing the regret ratio. We try to answer two questions in this paper:

(1) How much does interaction help? That is, how much can we improve the regret ratio when there are interactions?

(2) How efficient can interaction be? In particular, we measure how many questions we have to ask the user in order to make her regret ratio small enough.

We answer both questions from both theoretical and practical standpoints. For the first question, we show that interaction can reduce the regret ratio almost exponentially. To do this, we prove a lower bound for the previous approach (thereby resolving an open problem from [19] Nanongkai et al.), and develop an almost-optimal upper bound that makes the regret ratio exponentially smaller.

Our experiments also confirm that, in practice, interactions help in improving the regret ratio by many orders of magnitude. For the second question, we prove that when our algorithm shows a reasonable number of points per question, it only needs a few questions to make the regret ratio small.

Thus, interactive regret minimization seems to be a necessary and sufficient way to deal with multi-criteria decision making in database systems.

Keywords
Information Economics and Policy, Volume 24, Issue 1, Pages 3–14, March 2012 (and also NBER Working Paper #16507)

Supply Responses to Digital Distribution: Recorded Music and Live Performances

Chris Nosko, Julie Mortimer, Alan Sorensen

No Information

Keywords

Bayes-Nash Equilibria of the Generalized Second-Price Auction

Renato Gomes

We develop a Bayes–Nash analysis of the generalized second-price (GSP) auction, the multi-unit auction used by search engines to sell sponsored advertising positions. Our main result characterizes the efficient Bayes–Nash equilibrium of the GSP and provides a necessary

and sufficient condition that guarantees existence of such an equilibrium. With only two positions, this condition requires that the click–through rate of the second position is sufficiently smaller than that of the first.

When an efficient equilibrium exists, we provide a necessary and sufficient condition for the auction revenue to decrease as click–through rates increase. Interestingly, under optimal reserve prices, revenue increases with the click–through rates of all positions. Further, we prove that no inefficient equilibrium of the GSP can be symmetric.

Our results are in sharp contrast with the previous literature that studied the GSP under complete information.

Keywords
Design, Guidelines, and Requirements, Personal and Ubiquitous Computing, 2012

Live Mobile Collaboration for Video Production

Marco deSa, David Shamma, Elizabeth Churchill

Traditional cameras and video equipment are gradually losing the race with smart-phones and small mobile devices that allow video, photo and audio capturing on the go. Users are now quickly creating movies and taking photos whenever and wherever they go, particularly at concerts and events.

Still, in-situ media capturing with such devices poses constraints to any user, especially amateur ones. In this paper, we present the design and evaluation of a mobile video capture suite that allows for cooperative ad-hoc production. Our system relies on ad-hoc in-situ collaboration offering users the ability to switch between streams and cooperate with each other in order to capture better media with mobile devices.

Our main contribution arises from the description of our design process focusing on the prototyping approach and the qualitative analysis that followed. Furthermore, we contribute with lessons and design guidelines that emerged and apply to in-situ design of rich video collaborative experiences and with the elicitation of functional and usability requirements for collaborative video production using mobile devices.

Keywords
STOC 2011 (Invited and accepted to SICOMP)

Distributed Verification and Hardness of Distributed Approximation

Atish Das Sarma, Stephan Holzer, Liah Kor, Amos Korman, Danupon Nanongkai, Gopal Pandurangan, David Peleg, Roger Wattenhofer, Atish Das Sarma, Stephan Holzer, Liah Kor, Amos Korman, Danupon Nanongkai, Gopal Pandurangan, David Peleg, Roger Wattenhofer

We study the verification problem in distributed networks, stated as follows. Let H be a subgraph of a network G where each vertex of G knows which edges incident on it are in H. We would like to verify whether H has some properties, e.g., if it is a tree or if it is connected (every node knows in the end of the process whether H has the specified property or not). We would like to perform this verification in a decentralized fashion via a distributed algorithm. The time complexity of verification is measured as the number of rounds of distributed communication.

In this paper we initiate a systematic study of distributed verification, and give almost tight lower bounds on the running time of distributed verification algorithms for many fundamental problems such as connectivity, spanning connected subgraph, and s-t cut verification.

We then show applications of these results in deriving strong unconditional time lower bounds on the hardness of distributed approximation for many classical optimization problems including minimum spanning tree, shortest paths, and minimum cut.

Many of these results are the first non-trivial lower bounds for both exact and approximate distributed computation and they resolve previous open questions. Moreover, our unconditional lower bound of approximating minimum spanning tree (MST) subsumes and improves upon the previous hardness of approximation bound of Elkin [STOC 2004] as well as the lower bound for (exact) MST computation of Peleg and Rubinovich [FOCS 1999]. Our result implies that there can be no distributed approximation algorithm for MST that is significantly faster than the current exact algorithm, for any approximation factor.

Our lower bound proofs show an interesting connection between communication complexity and distributed computing which turns out to be useful in establishing the time complexity of exact and approximate distributed computation of many problems.

Keywords
People