# ----------------------------------------------------------
# TextOCR scanner and image validator SA-plugin v. 3.1
# Written by M. Blapp, ImproWare AG, Switzerland
# ----------------------------------------------------------
# 
# README:
# -------
# 
# textocr.pm is a plugin for spamassassin 3.1+ to detect
# suspect pictures and extract text from them with gocr.
# The OCR dictionary functionaliy has been replaced with
# regexes. The plugin can also verify the validity of the
# pictures and detects spoofing of the content type.
# 
# 
# HISTORY:
# --------
# 
# 31.03.2006, v. 1.00
# 
# Initial revision
# 
# 01.04.2006, v. 1.01
# 
# Added more words to scanlist
# 02.04.2006, v. 1.02
# 
# Check return values of netpbm utils
# 
# 03.04.2006, v 1.1
# 
# Remove the eval function and replace
# it with parsed_metadata(). Now we can
# track errors, count the words we found.
# The plugin detects now forged content type
# entries.
# 
# 03.04.2006, v 1.11
# 
# Add a check for suspect pictures and add some
# score for it. There are new pics going around
# with obfuscated content so ocr scanners are useless
# again :-(
# 
# 04.04.2006, v. 1.2
# 
# The GIF module from Image::ExifTool doesn't recognize
# GIFs without colortable as valid pics and just skips them.
# The result is a matching SPAMPIC_BROKEN_GIF entry which is
# wrong. You should definitly patch your Image::ExifTool installation
# with the provided patch at http://antispam.imp.ch/patches/patch-GIF-Colortable
# 
# 04.04.2006, v. 1.2.1
# 
# Don't scan small pictures, even not for header parsing as it
# seems Image::ExifTool has again problems with this. Fix the
# size calculations.
# 
# 04.04.2006, v. 1.2.2
# 
# Count the non standard Image::ExifTool failures as soft errors
# and add NONSTD_ tests for them.
# 
# 08.04.2006, v 1.3
# 
# Much more words to scan for, added a second method to scan jpeg
# pics which helps with pics having white font and a lot of noisy
# distorts. Rename the SUSPECT_ tests and lower the scores for
# them.
# 
# 08.04.2006, v 1.3.1
# 
# Added two other jpeg scanmethods which give a higher match
# possibility.
# 
# 12.04.2006, v 1.4
# 
# Added a timeout (default 10 seconds) and change the scanmethods.
# Now we scan also normalized pnm files, this seems to help a lot
# on some jpegs. Removed some debug statements.
# 
# 12.04.2006 v. 1.4.1
# 
# Add a scanlimit, only scan a limited number of images.
# Fix a logical error, really redirect all error output to
# stderr as I've implemented some time ago, but now it works.
# 
# 12.04.2006 v. 1.4.2
# 
# Rename some vars to make it more logic, add new spamwords.
# Add some perldoc documentation. Change minpixratio_ocr to
# 4000 as there are more and more supect pics around.
# 
# 14.04.2006 v. 1.5
#
# Important change. Ignore raw pnm files if parsing has failed
# or gocr dumped core (yes this can happen, I'll soon post
# a fix for gocr).
#
# 14.04.2006 v. 1.6
#
# Important change. Alter the whole plugin to use pipes and
# kill stalled pids after we left the 'helper_run_mode'. Added
# three count rules to count alpha nummeric chars.
#
# 14.04.2006 v. 1.6.1
#
# Sort out identical chars. Some moirees and patterns are often found in
# pictures and they show after a OCR scan repeated chars of the same
# type. Not really a sign of words. Added some examples about the ALPHA rules.
# 
# 06.06.2006 v. 1.6.2
#
# Fix typo: pngtpnm -> pngtopnm. Now png pictures finally work too.
#
# 09.06.2006 v. 1.7
#
# Add rules against multiple small pictures in HTML mails where
# OCR is almost useless.
#
# 03.09.2006 v. 1.8
#
# Add support for animated gifs. Mostly contributed by Romeo Benzoni.
# Thanks a lot ! Add ~10 new rules.
#
# Important: You need now p5-Imager and libungif support.
#
# 08.09.2006 v. 1.9
#
# Handle broken gif pictures and try to fix them if possible. I've
# fixed some of the regexes and added a lot of new rules to match
# the recent spams.
#
# 21.10.2006 v. 2.0
#
# Catch the recent image spam with combined pictures and transparent
# backgrounds, or images which have different offsets. Try to catch those
# tricks all together.
#
# 22.10.2006 v. 2.1
#
# Composed anims were not really correctly combined. Fix this issue.
#
# 26.10.2006 v. 2.2
#
# Catch recent spampics with underline colors. Reorganize the plugin a bit.
# Fix logic error introduced in v 2.1
#
# 17.11.2006 v. 3.0
#
# Add fuzzy string support, but match full and simple regex matches
# still directly. Add a maximum score to still do OCR to prevent useless
# picture scans. The wordlist is now a simple arrray at the top of the
# config.
#
# Important: You need now the perl Module String::Approx.
#
# A lot of the new features have been borrowed by the
# Fuzzy OCR Plugin (Thanks Christian !)
#
# 1.12.2006 v. 3.1
#
# Changed ocrtext_minpixels_ocr to need only 20000 pixel pictures.
# Changed priority to 100, allowing metatests which did not work
# previously.
# Added ocrtext_pwords, a list of positive words which give negative
# counts. It's almost left empty since releasing this information would
# give spammers a new opportunity.
#
# DESCRIPTION:
# ------------
# 
# Scan suspect pictures and parse them with gocr. Very big and
# small pictures are skipped. The suspicious word list has to
# be defined in the spamassassin conf.
#
# 
# NOTICE:
# -------
# 
# 'r' and 'n' are very similar, and many ocr programms often
# can't make a difference between them. So just use '[rn] instead
# of a single char.
# 
# 
# INSTALLATION:
# -------------
# 
# You'll need:
# 
# - The perl module Perl-Imager. You need an already installed
#   libungif port. Please make sure gif pictures are really enabled.
#
# - Perl module Image::ExifTool and a patch for GIF pics:
#   http://antispam.imp.ch/patches/patch-GIF-Colortable
#
# - Perl module String::Approx
#
# - Gocr from http://jocr.sourceforge.net and a patch to
#   avoid segfaults with gocr:
#   http://antispam.imp.ch/patches/patch-gocr-segfault
#
# - Netpbm from http://netpbm.sourceforge.net
# 
# - Libungif for fixgif and animated gif support.
#
# You can extract the plugin with 'patch < patch-ocrtext'
#
# Check if you have the necessary tools like giftopnm,jpegtopnm
# pngtopnm, djpeg, pnminvert all in the same path.
#
