The phenomenal growth in the volume of easily accessible information via various web-based services has made it essential for service providers to provide users with personalized representative summaries of such information.
Further, online commercial services including social networking and micro-blogging websites, e-commerce portals, leisure and entertainment websites, etc. recommend interesting content to users that is simultaneously diverse on many different axes such as topic, geographic specificity, etc.
The key algorithmic question in all these applications is the generation of a succinct, representative, and relevant summary from a large stream of data coming from a variety of sources. In this paper, we formally model this optimization problem, identify its key structural characteristics, and use these observations to design an extremely scalable and efficient algorithm.
We analyze the algorithm using theoretical techniques to show that it always produces a nearly optimal solution. In addition, we perform large-scale experiments on both real-world and synthetically generated datasets, which confirm that our algorithm performs even better than its analytical guarantees in practice, and also outperforms other candidate algorithms for the problem by a wide margin.