QSF(1)			   User Manuals			   QSF(1)

NAME
       qsf - quick spam filter

SYNOPSIS
       Filtering:      qsf [-snrta] [-d FILE] [-S SUBJECT]
       Training:       qsf -T SPAM NONSPAM [-d FILE]
       Retraining:     qsf -[m|M] [-d FILE] [-w WEIGHT] [-a]
       Database:       qsf -[p|D|R|O] [-d FILE]
       Database merge: qsf -E OTHERDB [-d FILE]
       Help:	       qsf -[h|l|V]

DESCRIPTION
       qsf reads a single email on standard input, and by default
       outputs it on standard output.  If the email is determined
       to  be  spam, an additional header ("X-Spam: YES") will be
       added, and optionally the subject line can  have	 "[SPAM]"
       prepended to it.

       qsf  is	intended to be used in a procmail(1) recipe, in a
       ruleset such as this:

	      :0 wf:$HOME/.qsflock
	      | qsf -ra

	      :0 H:
	      * X-Spam: YES
	      $HOME/mail/spam

       For more examples, including sample  procmail(1)	 recipes,
       see the EXAMPLES section below.

TRAINING
       Before  qsf  can be used properly, it needs to be trained.
       A good way to train qsf is to collect a copy of	all  your
       email  into  two	 folders - one for spam, and one for non-
       spam.  Once you have done this, you can use  the	 training
       function, like this:

	      qsf -T spam-folder non-spam-folder

       This  will  generate a database that can be used by qsf to
       guess whether email received in the future is spam or not.
       Note  that this initial training run may take a long time,
       but you should only need to do it once.

       To mark a single message as spam, pipe it to qsf with  the
       --mark-spam  or	-m  ("mark  as	spam") option.	This will
       update the database accordingly and discard the email.

       To mark a single message as non-spam, pipe it to qsf  with
       the  --mark-nonspam  or	-M  ("mark  as non-spam") option.
       Again, this will discard the email.

       If a message has been mis-tagged, simply send it to qsf as
       the  opposite  type, i.e. if it has been mistakenly tagged
       as spam, pipe it into qsf --mark-nonspam --weight=2 to add
       it  to  the  non-spam side of the database with double the
       usual weighting.

OPTIONS
       The qsf options are listed below.

       -d, --database FILE
	      Use  FILE	 as  the  spam/non-spam	 database.    The
	      default  is  to  use /var/lib/qsfdb and, if that is
	      not available or is read-only, $HOME/.qsfdb.   This
	      option can also be useful if there is a system-wide
	      database but you do not want to use it - specifying
	      your own here will override the default.

       -s, --subject
	      Rewrite  the  Subject  line of any email that turns
	      out to be spam, adding "[SPAM]" to the start of the
	      line.

       -S, --subject-marker SUBJECT
	      Instead of adding "[SPAM]", add SUBJECT to the Sub-
	      ject line of any email that turns out to	be  spam.
	      Implies -s.

       -n, --no-header
	      Do not add an X-Spam header to messages.

       -r, --add-rating
	      Insert  an additional header X-Spam-Rating which is
	      a rating of the "spamminess" of a message from 0 to
	      100;  90	and  above  are counted as spam, anything
	      under 90 is not considered spam.

       -t, --test
	      Instead of passing the message out on standard out-
	      put,  output  nothing, and exit 0 if the message is
	      not spam, or exit 1 if the message is spam.

       -a, --allowlist
	      Enable  the  allow-list.	 This  causes  the  email
	      address given in the message's "From:" header to be
	      checked against a list; if  it  matches,	then  the
	      message  is  always treated as non-spam, regardless
	      of what the token database  says.	  When	specified
	      with  a  retraining flag, -a -m (mark as spam) will
	      remove that address from the allow-list as well  as
	      marking  the  message  as	 spam, and -a -M (mark as
	      non-spam) will add that address to  the  allow-list
	      as  well	as  marking the message as non-spam.  The
	      idea is that you add all of  your	 friends  to  the
	      allow-list,  and	then  none of their messages ever
	      get marked as spam.

       -T, --train SPAM NONSPAM
	      Train the database using the two mbox folders  SPAM
	      and NONSPAM, by testing each message in each folder
	      and updating the database each time  a  message  is
	      miscategorised.	This  is  done several times, and
	      may take a while to run.	Specify	 the  -a  (allow-
	      list)  flag  to  add  every  sender  in the NONSPAM
	      folder to your allow-list as a side-effect  of  the
	      training process.

       -m, --mark-spam
	      Instead of passing the message out on standard out-
	      put, mark its  contents  as  spam	 and  update  the
	      database	accordingly.   If  the allow-list (-a) is
	      enabled, the message's "From:" address  is  removed
	      from the allow-list.

       -M, --mark-nonspam
	      Instead of passing the message out on standard out-
	      put, mark its contents as non-spam and  update  the
	      database	accordingly.   If  the allow-list (-a) is
	      enabled, the message's "From:" address is added  to
	      the allow-list (see the -a option above).

       -w, --weight WEIGHT
	      When  marking  as	 spam  or  non-spam,  update  the
	      database with  a	weighting  of  WEIGHT  per  token
	      instead  of the default of 1.  Useful when correct-
	      ing mistakes, eg a message that has been mistakenly
	      detected as spam should be marked as non-spam using
	      a weighting of 2, i.e. double the usual  weighting,
	      to counteract the error.

       -p, --prune
	      Remove  redundant	 entries  from	the  database and
	      clean it up a little.  This is  automatically  done
	      after  several  calls to --mark-spam or --mark-non-
	      spam, and	 during	 training  with	 --train  if  the
	      training	takes  a  large	 number	 of rounds, so it
	      should rarely be necessary to use --prune manually.

       -D, --dump
	      Dump  the	 contents  of the database as a platform-
	      independent  text	 file,	suitable  for	archival,
	      transfer	to  another machine, and so on.	 The data
	      is output on stdout.

       -R, --restore
	      Rebuild the database from	 scratch  from	the  text
	      file on stdin.

       -O, --tokens
	      Instead  of  filtering, output a list of the tokens
	      found in the  message  read  from	 standard  input,
	      along  with  the	number	of  times  each token was
	      found.  This is only useful if you want to use  qsf
	      as a general tokeniser for use with another filter-
	      ing package.

       -E, --merge OTHERDB
	      Merge  the  OTHERDB  database  into   the	  current
	      database.	  This	can be useful if you want to take
	      one user's mailbox and merge it  into  the  system-
	      wide  one,  for instance (this would be done by, as
	      root,    doing	qsf    -d    /var/lib/qsfdb    -E
	      /home/user/.qsfdb	      and      then	 removing
	      /home/user/.qsfdb).

       -B, --benchmark SPAM NONSPAM
	      Benchmark the training process using the	two  mbox
	      folders  SPAM and NONSPAM.  A temporary database is
	      created and trained using the first 75% of the mes-
	      sages  in	 each  folder,	and then the final 25% of
	      messages in each folder is tested to see	how  many
	      false  positives	and  false  negatives occur. Some
	      timing information is also displayed.

	      This is probably only of use  to	developers,  when
	      trying different classification algorithms.

       -h, --help
	      Print  a	usage message on standard output and exit
	      successfully.

       -l, --license
	      Print details of the program's license on	 standard
	      output and exit successfully.

       -V, --version
	      Print  version  information  on standard output and
	      exit successfully.

FILES
       /var/lib/qsfdb
	      The default (system-wide) spam  database.	  If  you
	      wish  to	install	 qsf  system-wide, this should be
	      read-only to everyone; there  should  be	one  user
	      with  write access who can update the spam database
	      with qsf --mark-spam and qsf  --mark-non-spam  when
	      necessary.

       $HOME/.qsfdb
	      The default spam database for per-user data.  Users
	      without write access to  the  system-wide	 database
	      will  have  their	 data  written	here, and the two
	      databases will be read together.

NOTES
       Currently, you cannot use qsf to check for spam while  the
       database	 is  being  updated.   This  means  that while an
       update is in progress, all email is passed through as non-
       spam.

       There  is  an upper size limit of 512Kb on incoming email;
       anything larger than this is just passed through	 as  non-
       spam, to avoid tying up machine resources.

EXAMPLES
       To  filter  all	of your mail through qsf, with the allow-
       list enabled and the "spam rating" header being added, add
       this to your .procmailrc file:

	      :0 wf:$HOME/.qsflock
	      | qsf -ra

       If you want qsf to add "[SPAM]" to the subject line of any
       messages it thinks are spam, do this instead:

	      :0 wf:$HOME/.qsflock
	      | qsf -sra

       To automatically mark any email	sent  to  spambox@yourdo-
       main.com as spam (this is the "naive" version):

	      :0 H:
	      * ^To:.*spambox@yourdomain.com
	      | qsf -am

       To  do the same, but cleverly, so that only email to spam-
       box@yourdomain.com which qsf does NOT already classify  as
       spam  gets  marked as spam in the database (this stops the
       database getting too heavily weighted):

	      # If sent to spambox@yourdomain.com:
	      :0
	      * ^To:.*spambox@yourdomain.com
	      {
		   :0 wf:$HOME/.qsflock
		   | qsf -a

	      # NB the above two lines can be skipped if you've
	      #	   already piped the message through qsf.

		   # If the qsf database says it's not spam,
		   # mark it as spam!
		   :0 H:
		   * X-Spam: NO
		   | qsf -am
	      }

       Remove the -a option in the above examples  if  you  don't
       want to use the allow-list.

       A  more complicated filtering example - this will only run
       qsf on messages which don't have	 a  subject  line  saying
       "your  <something>  is  on  fire"  and  which don't have a
       sender address ending in "@foobar.com", meaning that  mes-
       sages  with  that subject line OR that sender address will
       NEVER be marked as spam, no matter what:

	      :0 wf:$HOME/.qsflock
	      * ! ^Subject: Your .* is on fire
	      * ! ^From: .*@foobar.com
	      | qsf -ra

       For more on procmail(1) recipes, see the procmailrc(5) and
       procmailex(5) manual pages.

       A couple of macros to add to your .muttrc file, if you use
       mutt(1) as a mail user agent:

	      # Press F5 to mark a message as spam and delete it
	      macro index <f5>	"<pipe-message>qsf  -am\n<delete-
	      message>"
	      macro  pager  <f5> "<pipe-message>qsf -am\n<delete-
	      message>"

	      # Press F9 to mark a message as non-spam
	      macro index <f9> "<pipe-message>qsf -aM\n"
	      macro pager <f9> "<pipe-message>qsf -aM\n"

       Again, remove the -a option in the above examples  if  you
       don't want to use the allow-list.

THE ALLOW-LIST
       A  feature  called  the "allow-list" can be switched on by
       specifying the --allowlist or -a option.	 This causes mes-
       sages'  "From:"	addresses to be checked against a list of
       people you have said to allow all messages from, and if	a
       message's  "From:"  address  is	in  the list, it is never
       marked as spam.	This means you can add all  your  friends
       to  an "allow-list" and qsf will then never mis-file their
       messages - a quick way to do this is to	use  -a	 with  -T
       (train); everyone in your non-spam folder who has sent you
       an email will be added  to  the	allow-list  automatically
       during training.

       In  general, you probably always want to enable the allow-
       list, so always specify the -a option when using qsf.

       The only times you might want to turn it off are when peo-
       ple  on your allow-list are prone to getting viruses or if
       a virus is causing email to be sent to you  that	 is  pre-
       tending to be from someone on your allow-list.

BACKUP AND RESTORE
       Because	the database format is platform-specific, it is a
       good idea to periodically dump the database to a text file
       using  qsf -D so that, if necessary, it can be transferred
       to another machine and restored with qsf -R later on.

       Also note that since the actual contents of email messages
       are  never stored in the database (see TECHNICAL DETAILS),
       you can safely share your qsf database with friends - sim-
       ply dump your database to a file, like this:

	      qsf -D > your-database-dump.txt

       Once  you have sent your-database-dump.txt to another per-
       son, they can do this:

	      qsf -R < your-database-dump.txt

       They will then have an identical database to yours.

TECHNICAL DETAILS
       When a message is  passed  to  qsf,  any	 attachments  are
       decoded,	 all  HTML  elements are removed, and the message
       text is then broken up into "tokens", where a "token" is a
       single  word  or	 URL.  Each token is hashed using the MD5
       algorithm (see below for why), and that hash is then  used
       to look up each token in the qsf database.

       In  addition  to	 single-word  tokens, qsf adds doubled-up
       tokens so that word pairs get added to the database.  This
       makes  the  database  a bit bigger (although the automatic
       pruning tends to take care of  that)  but  makes	 matching
       more exact.

       Within the database, each token has two numbers associated
       with it: the number of times that token has been	 seen  in
       spam,  and  the	number	of times it has been seen in non-
       spam.  These two numbers, along with the total  number  of
       spam  and  non-spam messages seen, are then used to give a
       "spamminess" value for that particular token.  This "spam-
       miness"	value  ranges from "definitely not spammy" at one
       end of the scale, through "neutral" in the middle,  up  to
       "definitely spammy" at the other end.

       Once  a	"spamminess" value has been calculated for all of
       the tokens in the message, a summary calculation	 is  made
       to give an overall "is this spam?"  probability rating for
       the message.  If the overall probability is 0.9 or  above,
       the message is flagged as spam.

       One  thing  to note is that any non-textual attachments in
       the message (images, binary executables, etc) are  treated
       as  single  tokens - that is, if an image is attached to a
       message, the whole image is treated as a single "word" for
       the  purposes of the database.  This allows binary attach-
       ments to be processed instead of ignoring them,	while  at
       the  same time avoiding filling up the database with junk.

       In addition to the probability test is  the  "allow-list".
       If  enabled  (with  the	-a option), the whole probability
       check is skipped if the sender of the message is listed in
       the allow-list, and the message is not marked as spam.

       When  training  the  database,  a message is split up into
       tokens as described above, and then  the	 numbers  in  the
       database	 for  each token are simply added to: if you tell
       qsf that a message is spam, it adds one to the "number  of
       times  seen  in	spam"  counter for each token, and if you
       tell it a message is not spam, it adds one to the  "number
       of times seen in non-spam" counter for each token.  If you
       specify a weight, with -w, then the number you specify  is
       added instead of one.

       To  stop	 the database growing uncontrollably, it is auto-
       matically pruned after every 500th update (or, in training
       mode,  after  every  15 rounds).	 The pruning step removes
       any tokens with very small message counts, and scales  all
       counts down if they start getting too large.

       Finally,	 the  reason MD5 hashes were used is privacy.  If
       the actual tokens from the messages, and the actual  email
       addresses  in  the  allow-list, were stored, you could not
       share a single qsf database between multiple users because
       bits  of	 everyone's  messages  would be in the database -
       things like emailed passwords, keywords relating	 to  per-
       sonal  gossip, and so on.  So a hash is stored instead.	A
       hash is a "one-way" function; it is easy to turn	 a  token
       into  a	hash but very hard (some might say impossible) to
       turn a hash back into the token	that  created  it.   This
       means  that  you	 end  up with a database with no personal
       information in it.

AUTHOR
       The primary author of qsf, to whom  bug	reports,  patches
       etc should be sent, is:

	      Andrew Wood <andrew.wood@ivarch.com>
	      http://www.ivarch.com/

       Suggestions  and	 patches have been received from the fol-
       lowing people:

	      Tom Parker <http://www.bits.bris.ac.uk/palfrey/>

	      Dr Kelly A. Parker

	      Vesselin Mladenov <http://www.antipodes.bg/>

BUGS
       If you find any bugs, please contact the author, either by
       email or by using the contact form on the web site.

SEE ALSO
       procmail(1), procmailrc(5), procmailex(5)

       Someone	has  written a guide to using qsf with KMail that
       can be found at:
       http://www.softwaredesign.co.uk/Information.SpamFil-
       ters.html

LICENSE
       This  is	 free  software,  distributed  under the ARTISTIC
       license.

Linux			   August 2003			   QSF(1)
