W. Gansterer, A. Janecek, R. Neumayer:
"Spam Filtering Based on Latent Semantic Indexing";
in: "Survey of Text Mining II: Clustering, Classification, and Retrieval",
herausgegeben von: The University of Tennessee;
In this chapter, the classiﬁcation performance of Latent Semantic Indexing (LSI)
applied to the task of detecting and ﬁltering unsolicited bulk or commercial e-mail
(UBE, UCE, commonly called “spam”) is studied. Comparisons to the simple Vector
Space Model (VSM) and to the extremely widespread, de-facto standard for spam
ﬁltering, the SpamAssassin system, are summarized. It is shown that VSM and LSI
achieve signiﬁcantly better classiﬁcation results than SpamAssassin.
Obviously, the classiﬁcation performance achieved in this special application
context strongly depends on the feature sets used. Consequently, the various clas-
siﬁcation methods are also compared using two different feature sets: (i.) a set of
purely textual features of e-mail messages which are based on standard word- and
token-extraction techniques, and (ii.) a set of application-speciﬁc “meta features” of
e-mail messages as extracted by the SpamAssassin system. It is illustrated that the
latter tends to achieve consistently better classiﬁcation results.
A third central aspect discussed in this chapter is the issue of problem reduction
in order to reduce the computational effort for classiﬁcation, which is of particular
importance in the context of time-critical on-line spam ﬁltering. In particular, the
effects of truncation of the SVD in LSI and of a reduction of the underlying feature
set are investigated and compared. It is shown that a surprisingly large amount of
problem reduction is often possible in the context of spam ﬁltering without heavy
loss in classiﬁcation performance.
Text categorisation, spam filtering, latent semantic indexing
Erstellt aus der Publikationsdatenbank der Technischen Universitšt Wien.