The theme of this paper is that anomaly detection splits into two parts: developing the right features, and then feeding these features into a statistical system that detects anomalies in the features. Most literature on anomaly detection focuses on the second part. Our goal is to illustrate the importance of the first part. We do this with two real-life examples of anomaly detectors in use at eBay.
Nieves R. Brisaboa, Guillermo de Bernardo, Roberto Konow, Gonzalo Navarro, Diego Seco
Efficient processing of aggregated range queries on two-dimensional grids is a common requirement in information retrieval and data mining systems, for example in Geographic Information Systems and OLAP cubes. We introduce a technique to represent grids supporting aggregated range queries that requires little space when the data points in the grid are clustered, which is common in practice. We show how this general technique can be used to support two important types of aggregated queries, which are ranked range queries and counting range queries. Our experimental evaluation shows that this technique can speed up aggregated queries up to more than an order of magnitude, with a small space overhead.
Given a text, it may be useful to determine the age, gender, native language, nationality, personality and other demographic attributes of its author. This task is called author profiling, and has been studied by different areas, especially from linguistics and natural language processing, by extracting different content- and style-based features from training documents and then using various machine learning approaches.
In this paper we address the author profiling task by using several compression-inspired strategies. More specifically, we generate different models to identify the age and the gender of the author of a given document without analysing or extracting specific features from the textual content, making them style-oblivious approaches.
We compare and analyse their behaviour over datasets of different nature. Our results show that by using simple compression-inspired techniques we are able to obtain very competitive results in terms of accuracy and we are orders of magnitude faster for the evaluation phase when compared to other state-of-the-art complex and resource-demanding techniques.
PcapWT: An efficient packet extraction tool for large volume network traces
Young-Hwan Kim, Roberto Konow, Diego Dujovne, Thierry Turletti, Walid Dabbous, Gonzalo Navarro:
Network packet tracing has been used for many different purposes during the last few decades, such as network software debugging, networking performance analysis, forensic investigation, and so on. Meanwhile, the size of packet traces becomes larger, as the speed of network rapidly increases. Thus, to handle huge amounts of traces, we need not only more hardware resources, but also efficient software tools. However, traditional tools are inefficient at dealing with such big packet traces. In this paper, we propose pcapWT, an efficient packet extraction tool for large traces. PcapWT provides fast packet lookup by indexing an original trace using a wavelet tree structure. In addition,pcapWT supports multi-threading for avoiding synchronous I/O and blocking system calls used for file processing, and is particularly efficient on machines with SSD. PcapWTshows remarkable performance enhancements in comparison with traditional tools such as tcpdump and most recent tools such as pcapIndex in terms of index data size and packet extraction time. Our benchmark using large and complex traces shows thatpcapWT reduces the index data size down below 1% of the volume of the original traces. Moreover, packet extraction performance is 20% better than with pcapIndex. Furthermore, when a small amount of packets are retrieved, pcapWT is hundreds of times faster than tcpdump.
An optimal index solving top-k document retrieval [Navarro and Nekrich, SODA'12] takes O(m+k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n-3n bytes, with O(m + (k+log log n)log log n) time, on typical texts. The index is up to 25 times faster than the best previous compressed solutions, and requires at most 5% more space in practice (and in some cases as little as one half). Apart from replacing classical by compressed data structures, our main idea is to replace suffix tree sampling by frequency thresholding to achieve compression.