# ----------------------------------------------------------
# TextOCR scanner and image validator SA-plugin v 1.7
# Written by M. Blapp, ImproWare AG, Switzerland
# ----------------------------------------------------------
# 
# README:
# -------
# 
# textocr.pm is a plugin for spamassassin 3.1+ to detect
# suspect pictures and extract text from them with gocr.
# The OCR dictionary functionaliy has been replaced with
# regexes. The plugin can also verify the validity of the
# pictures and detects spoofing of the content type.
# 
# 
# HISTORY:
# --------
# 
# 31.03.2006, v. 1.00
# 
# Initial revision
# 
# 01.04.2006, v. 1.01
# 
# Added more words to scanlist
# 02.04.2006, v. 1.02
# 
# Check return values of netpbm utils
# 
# 03.04.2006, v 1.1
# 
# Remove the eval function and replace
# it with parsed_metadata(). Now we can
# track errors, count the words we found.
# The plugin detects now forged content type
# entries.
# 
# 03.04.2006, v 1.11
# 
# Add a check for suspect pictures and add some
# score for it. There are new pics going around
# with obfuscated content so ocr scanners are useless
# again :-(
# 
# 04.04.2006, v. 1.2
# 
# The GIF module from Image::ExifTool doesn't recognize
# GIFs without colortable as valid pics and just skips them.
# The result is a matching SPAMPIC_BROKEN_GIF entry which is
# wrong. You should definitly patch your Image::ExifTool installation
# with the provided patch at http://antispam.imp.ch/patches/patch-GIF-Colortable
# 
# 04.04.2006, v. 1.2.1
# 
# Don't scan small pictures, even not for header parsing as it
# seems Image::ExifTool has again problems with this. Fix the
# size calculations.
# 
# 04.04.2006, v. 1.2.2
# 
# Count the non standard Image::ExifTool failures as soft errors
# and add NONSTD_ tests for them.
# 
# 08.04.2006, v 1.3
# 
# Much more words to scan for, added a second method to scan jpeg
# pics which helps with pics having white font and a lot of noisy
# distorts. Rename the SUSPECT_ tests and lower the scores for
# them.
# 
# 08.04.2006, v 1.3.1
# 
# Added two other jpeg scanmethods which give a higher match
# possibility.
# 
# 12.04.2006, v 1.4
# 
# Added a timeout (default 10 seconds) and change the scanmethods.
# Now we scan also normalized pnm files, this seems to help a lot
# on some jpegs. Removed some debug statements.
# 
# 12.04.2006 v. 1.4.1
# 
# Add a scanlimit, only scan a limited number of images.
# Fix a logical error, really redirect all error output to
# stderr as I've implemented some time ago, but now it works.
# 
# 12.04.2006 v. 1.4.2
# 
# Rename some vars to make it more logic, add new spamwords.
# Add some perldoc documentation. Change minpixratio_ocr to
# 4000 as there are more and more supect pics around.
# 
# 14.04.2006 v. 1.5
#
# Important change. Ignore raw pnm files if parsing has failed
# or gocr dumped core (yes this can happen, I'll soon post
# a fix for gocr).
#
# 14.04.2006 v. 1.6
#
# Important change. Alter the whole plugin to use pipes and
# kill stalled pids after we left the 'helper_run_mode'. Added
# three count rules to count alpha nummeric chars.
#
# 14.04.2006 v. 1.6.1
#
# Sort out identical chars. Some moirees and patterns are often found in
# pictures and they show after a OCR scan repeated chars of the same
# type. Not really a sign of words. Added some examples about the ALPHA rules.
# 
# 06.06.2006 v. 1.6.2
#
# Fix typo: pngtpnm -> pngtopnm. Now png pictures finally work too.
#
# 09.06.2006 v. 1.7
#
# Add rules against multiple small pictures in HTML mails where
# OCR is almost useless.
#
#
# DESCRIPTION:
# ------------
# 
# Scan suspect pictures and parse them with gocr. Very big and
# small pictures are skipped. The suspicious word list has to
# be defined in the spamassassin conf.
#
# 
# NOTICE:
# -------
# 
# 'r' and 'n' are very similar, and many ocr programms often
# can't make a difference between them. So just use '[rn] instead
# of a single char.
# 
# 
# INSTALLATION:
# -------------
# 
# You'll need:
# 
# - Perl module Image::ExifTool and a patch for GIF pics:
#   http://antispam.imp.ch/patches/patch-GIF-Colortable
#
# - Gocr from http://jocr.sourceforge.net and a patch to
#   avoid segfaults with gocr:
#   http://antispam.imp.ch/patches/patch-gocr-segfault
#
# - Netpbm from http://netpbm.sourceforge.net
# 
# You can extract the plugin with 'patch < patch-ocrtext'
#
# Check if you have the necessary tools like giftopnm,jpegtopnm
# pngtopnm, djpeg, pnminvert all in the same path.
#
