[Back]


Talks and Poster Presentations (with Proceedings-Entry):

W. Gansterer, A. Janecek, R. Neumayer:
"Spam filtering based on latent semantic indexing";
Talk: 2007 SIAM Conference on Data Mining, Minneapolis, MN, USA; 2007-04-28; in: "2007 SIAM Conference on Data Mining Workshop and Tutorial Proceedings", (2007), 9 pages.



English abstract:
In this paper, a study on the classification performance of a vector space model (VSM) and of latent semantic indexing (LSI) applied to the task of spam filtering is summarized. Based on a feature set used in the extremely widespread, de-facto standard spam filtering system SpamAssassin, a vector space model and latent semantic indexing are applied for classifying e-mail messages as spam or not spam. The test data sets used are partly from the official TREC 2005 data set and partly self collected. The investigation of LSI for spam filtering summarized here evaluates the relationship between two central aspects: (i) the truncation of the SVD in LSI and (ii) the resulting classification performance in this specific application context. It is shown that a surprisingly large amount of truncation is often possible without heavy loss in classification performance. This forms the basis for good and extremely fast approximate (pre-) classification strategies, which are very useful in practice. The approaches investigated in this paper are shown to compare favorably to two important alternatives: (i) They achieve better classification results than SpamAssassin, and (ii) they are better and more robust than a related LSI-based approach using textual features which has been proposed earlier.

German abstract:
In this paper, a study on the classification performance of a vector space model (VSM) and of latent semantic indexing (LSI) applied to the task of spam filtering is summarized. Based on a feature set used in the extremely widespread, de-facto standard spam filtering system SpamAssassin, a vector space model and latent semantic indexing are applied for classifying e-mail messages as spam or not spam. The test data sets used are partly from the official TREC 2005 data set and partly self collected. The investigation of LSI for spam filtering summarized here evaluates the relationship between two central aspects: (i) the truncation of the SVD in LSI and (ii) the resulting classification performance in this specific application context. It is shown that a surprisingly large amount of truncation is often possible without heavy loss in classification performance. This forms the basis for good and extremely fast approximate (pre-) classification strategies, which are very useful in practice. The approaches investigated in this paper are shown to compare favorably to two important alternatives: (i) They achieve better classification results than SpamAssassin, and (ii) they are better and more robust than a related LSI-based approach using textual features which has been proposed earlier.

Keywords:
information retrieval, vector space model, latent semantic indexing, feature selection, e-mail classification, spam filtering


Electronic version of the publication:
http://publik.tuwien.ac.at/files/pub-inf_4757.pdf


Created from the Publication Database of the Vienna University of Technology.