Algebraic Techniques for Multilingual Document Clustering

01:01:52

0 views

Published February 8, 2011

About this talk

Google Tech Talks January 25, 2011 Presented by Brett Bader. ABSTRACT Multilingual documents pose difficulties for clustering by topic, not least because translating everything to a common language is not feasible with a large corpus or many languages. This presentation will address those difficulties with a variety of novel algebraic methods for efficiently clustering multilingual text documents, and brieflyillustrate their implementation via high performance computing. The methods use a multilingual parallel corpus as a 'Rosetta Stone' from which algorithmic variations (including statistical morphological analysis to bypass the need for stemming) of Latent Semantic Analysis (LSA) are able to learn concepts in term space. New documents are projected into this concept space to produce language-independent feature vectors for subsequent use in similarity calculations or machine learning applications. Our experiments show that the new methods have better performance than LSA, and possess some interesting and counter-intuitive properties. Brett W. Bader received his Ph.D. in computer science from the University of Colorado at Boulder, studying higher-order methods for optimization and solving systems of nonlinear equations. In 2003, Brett received the John von Neumann Research Fellowship at Sandia National Laboratories, where he now develops algorithms for multi-way data analysis and machine learning for informatics applications in networks and text.