Notes Explorer: Toward Structured Retrieval in Semi-structured Information Spaces

Proceedings of IJCAI, the International Joint Conference in Artificial Intelligence. 1997
Notes Explorer: Toward Structured Retrieval in Semi-structured Information Spaces

A semi-structured information space consists of multiple collections of textual documents containing fielded or tagged sections. The space can be highly heterogeneous, because each collection has its own schema, and there are no enforced keys or formats for data items across collections.

Thus, structured methods like SQL cannot be easily employed, and users often must make do with only full-text search. In this paper, we describe an intermediate approach that provides structured querying for particular types of entities, such as companies, people, and skills.

Entity-based retrieval is enabled by normalizing entity references in a heuristic, type-dependent manner. To organize and filter search results, entities are categorized as playing particular roles (e.g., company as client, as vendor, etc.) in particular collection types (directories, client engagement records, etc.).

The approach can be used to retrieve documents and can also be used to construct entity profiles - summaries of commonly sought information about an entity based on the documents’ content. The approach requires only a modest amount of meta-information about the source collections, much of which is derived automatically. On a set of typical user queries in a large corporate information space, the approach produces a dramatic improvement in retrieval quality over knowledge-free methods like full-text search.

Another publication from the same category: Machine Learning and Data Science

IEEE Computing Conference 2018, London, UK

Regularization of the Kernel Matrix via Covariance Matrix Shrinkage Estimation

The kernel trick concept, formulated as an inner product in a feature space, facilitates powerful extensions to many well-known algorithms. While the kernel matrix involves inner products in the feature space, the sample covariance matrix of the data requires outer products. Therefore, their spectral properties are tightly connected. This allows us to examine the kernel matrix through the sample covariance matrix in the feature space and vice versa. The use of kernels often involves a large number of features, compared to the number of observations. In this scenario, the sample covariance matrix is not well-conditioned nor is it necessarily invertible, mandating a solution to the problem of estimating high-dimensional covariance matrices under small sample size conditions. We tackle this problem through the use of a shrinkage estimator that offers a compromise between the sample covariance matrix and a well-conditioned matrix (also known as the "target") with the aim of minimizing the mean-squared error (MSE). We propose a distribution-free kernel matrix regularization approach that is tuned directly from the kernel matrix, avoiding the need to address the feature space explicitly. Numerical simulations demonstrate that the proposed regularization is effective in classification tasks.