The concept of unsolicited commercial e-mail, or “spam”, is diverse and includes such examples as advertisements for products or web sites, get rich quick schemes, chain letters, and pornography. This is a collection of spam and non-spam e-mails assembled by George Forman at Hewlett-Packard in June and July of 1999. Forman, together with a team of collaborators, also extracted 57 numeric features from the e-mails that could potentially be used to classify the e-mails.
Note that this is a personal collection, and thus some of the features are highly specific (e.g., the name “George”, the phone number 650-857-7835, etc.).
y
is equal to 1 if spam, 0 if notX
contains:
word_freq_WORD
that record the percent
of words in the e-mail that match WORD. For example, if word_freq_you
equals 1.43, it means that 1.43% of words in the e-mail are “you”.char_freq_CHAR
that record the percent
of characters in the e-mail that match CHAR.capital_run_length_average
: average length of uninterrupted sequences of
capital letterscapital_run_length_longest
: length of longest uninterrupted sequence of
capital letterscapital_run_length_total
: sum of length of uninterrupted sequences of
capital letters (i.e., the total number of capital letters in the e-mail)Xtest
and ytest
: 1601 additional instances. Training and testing sets were
sampled at random from the original data set, which contained 4601 instances.I obtained this data set from the UCI Machine Learning Repository. The data set was originally created by Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt at Hewlett-Packard Labs in Palo Alto, CA.