Mail Server Content Filter Download

The Content Filter requires that Microsoft XML 4.0 be installed on the system where the mail server is running. If this is the first time using this version of the filter, run registry editor and check for the existence of the key "HKLM\Software\Classes\Msxml2.domdocument.4.0". If not present, you should first install XML 4.0 from http://www.microsoft.com/xml/.

The filter will function with XML 3.0 but it is recommended that XML 4.0 be installed.

Refer to the README.TXT file included in the specific distribution for instructions on installing and uninstalling the filter.

 

Version 2.3.7

for Communigate Pro for Windows
for Microsoft Exchange 2000
for Merak Mail (Unicode Version for Windows NT/2000/XP)
for Merak Mail (ASCII Version for Windows 95/98/ME) - (it will function on NT/2000/XP but offers no performance counter support)

 

Known Issues

   

Revision History

Version 2.4.0, Not Yet Released
Added global variables:
__is_auth
__is_not_auth
__is_local_ip
__is_not_local_ip
__is_ssl
__is_not_ssl

These are supported by the Merak Mail version 8.0.0 and higher only; 0 if false, 1 if true; for other versions of the filter and for Merak Mail prior to v8.0.0 they evaluate as not authenticated, not local IP, and not SSL. These global variables can be used in any mathematical script during email processing, Generally the intent is that rules can be relaxed if authenticated or from a trusted IP or the mail can be whitelisted blanketly. The same configuration files do not use these variables yet.

Added global variable  __mail_file_size; it has the value of the number of bytes in the mail file being processed, including all headers and attachments.
   
Version 2.3.7, Released 18 July 2004
Added <dnslookup> node to permit use of DNS-based RBL's. All IP addresses and domain names are extracted from the email based on the selected <scope> and DNS is queried with a configurable domain name using the IP address, extracted domain name, or the IP address of the extracted domain name. Sample configurations using multi.surbl.org and sbl-xbl.spamhaus.org are provided in regexflt_filters.xml. See the examples provided. Documentation for the new nodes will be available online soon.
   
Version 2.3.6, Released 12 April 2004
Added log output options to see actual collection contents being processed by an email.
When a header line appears to be a continuation (blanks or tabs at the start) but contains no content, the line had been appended to the previous line, effectively not changing the previous line. But this header line, marked as a continuation line but with no content, represents a potential Outlook folding vulnerability. The filter no longer will append such a line but instead will process the line intact as a separate header line so that expressions can detect it to block the email.
Some firewalls drop an SMTP connection when detecting malformed data and leave the mail server with a partial email, usually with no body. The filter incorrectly considered the last line received to be a body line when accumulating statistics and this prevented the rule checking body size from seeing an empty email and rejecting it.
   
Version 2.3.5, Released 22 January 2004
Added global counter of HTML entity codes representing alphanumeric characters which can be used to identify attempts at obscuring common spam text. The global variable _html_quoted_alnum_count represents the number of occurrences of characters a-z, A-Z and 0-9 that are encoded using the &#nnn; form. A new rule using this was added to the regex_bulkmailers.xml configuration file.
Several global variables for use in rules were being updated during each pass of the email through the filters instead of one time only; this inflated totals that could be used to make decisions on mail disposition.
   
Version 2.3.4, 15 December 2003
Collections "text", "text_packed", "text_stripped", "packed" and "stripped" enhanced to convert accented characters to their non-accented counterparts.
   
Version 2.3.3, 10 December 2003
Strip out uuencoded data from email text before processing expressions.
Fixed "text_packed" collection (was incorrectly stripping all text and leaving punctuation -- a typo).
Updated RegEx engine with latest code (no functional differences; performance improvements).
   
Version 2.3.2, 3 December 2003
Merak Mail: Automatically disable <multiplerecipients> processing when old Merak interface is used. The feature cannot be provided without the new Merak interface being supported by the version of Merak Mail in use.
Added "text_packed" collection.
   
Version 2.3.0, 10 November 2003
Communigate Pro: initial interface version inquiry failed because of expected parameter not being present in request.
Added collection attribute to <expression> and <expressions> node to specify the specific variation of the mail content against which to test the expression. Variations include translated (original), stripped HTML, stripped punctuation, stripped whitespace, HTML tags only, and certain combinations.
Added field to log output to identify the variation the expression test against.
Merak Mail: Corrected extraction of recipient addresses from old Merak filter interface parameters.
   
Version 2.2.6, 29 October 2003
Fixed buffer overflow in SMTP mailer related to a large list of recipients.
Fixed buffer overflow in <copy> processing related to a large list of recipients.
   
Version 2.2.5, 26 October 2003
If no default <result> is specified in the first filter defined that gets processed for an email a null pointer was being dereferenced causing processing on that email to fail. This condition is now detected and an appropriate undefined result error will be output to the log.
The <clientip> node under <scope> was not properly parsed; it now functions as documented.
Some boundary header lines were not parsed properly resulting in attachments not being separated, which could result in some base64-encoded attachments not being scanned in their decoded state.
Added global variable __mail_body_size to be used to test for number of bytes in email body such as in a test for an empty email. Header content is not counted. See regexflt_filters.xml for example usage.
Exchange 2000: <copy> operation now performed using multibyte character set to aid in redelivery by dropping email into pickup directory. Unicode files were not accepted in the pickup directory.
Add 2 retries on SMTP connection failures when trying to process <email> nodes; additional error information logged.
   
Version 2.2.4, 5 August 2003
Minor fix in HTML entity code expansion to correctly ignore entity codes not formatted correctly.
Minor fix in comment removal to check for badly terminated comments; too much was being ignored for "<!--" prefixed comments when there was no proper "-->" terminator.
Added trace output of regular expression compilation errors.
Fixed creation of URL-safe and Filename-safe macro substitutions when high ASCII characters are present (128-255).
   
Version 2.2.3, 24 July 2003
Adjusted code that reloads configuration files when changed so that if the reload fails the previous configuration continues to be used.
Merak Mail: When processing a message retrieved from a POP3 mailbox, Merak Mail does not provide recipient information to the filter. This special case was not handled properly and resulted in acceptance of the email regardless of the actual result code the filter determined was to be used.
   
Version 2.2.2, 18 July 2003
<result> nodes are now processed in the order the nodes within it appear. (Previously the order was all <copy> and <imap> nodes, then all <modifysource> nodes, then <http> nodes, then <email> nodes)
   
Version 2.2.1, 15 July 2003 (Private Release)
Tokenized email for Bayesian corpus database purposes. Bayesian filtering methods are being explored; this is not (yet) a feature of the RegEx filter that can be used.
Added full list of HTML entity codes to expansion table.
Expanded list of MIME types where attachments are skipped.
   
Version 2.2.0, 9 July 2003
Replaced RegEx engine with one designed from the ground up for C++, eliminating Unicode memory issues. The new engine has some advanced features, and slightly different syntax requirements, however at this time all RegEx expressions loaded are converted when required. No changes are required to your existing RegEx expressions at this time. (A future version will permit specifying that this translation is not needed so as to maintain compatibility with all expressions created prior to this engine.)
   
Version 2.1.9, 7 May 2003
Fixed an elusive bug that caused a variety of error reports in the logs (rare to some, common to others). It was related to the Unicode versions of the filter and the presence of characters outside the normal ASCII range of 0-255. A character mapping array that was indexed by the character value was not being allocated properly for Unicode characters. This fix may result in increased execution times or memory utilization, so be aware if you have a heavily loaded server with large user-customized XML files, but should only significantly affect the initial loading of an XML configuration. Optimizations are being explored since in the majority of cases characters outside the normal ASCII range are not used. If you have a heavily loaded Merak mail server, we recommend using the Ascii version of the filter temporarily rather than the Unicode version, to ensure that you don't suffer problems caused by the increased memory requirements this fix has resulted in. We will optimize this in another release soon.
To facilitate individual per-user filtering configurations emails received destined for multiple recipients are now separated and remailed to the individual recipient so that processing and final disposition of each email can be handled independently. Handling of multiple-recipient emails is defined in the new <configuration> node <multiplerecipients>. If this node does not exist, multiple recipient emails are handled as they were before this feature was implemented. To use this feature, copy the entire <multiplerecipients> node from the sample configuration file to your installation if upgrading. This feature is currently only supported in the Merak Mail version.
The whitelist filter from regexflt_whitelist.xml has been split into two different filters (primary & secondary) and the default configuration now uses one filter prior to any other filters, and one filter after some processing that can reject an email. This was done as an example of the many ways to create filters and to work through a problem where spam was beginning to use many common terms that were whitelisted (order/reservation/confirmation typically). If you copy the new regexflt_whitelist.xml file into your existing installation, you must make some adjustments within your main <index> XML file.
Tweaked some expressions in the regexflt_bulkmailers.xml configuration file based on feedback about false rejects. (Remember, the filter is just a tool and the samples provided are a good starting point for many sites, but the intention is that you customize these to your needs based on the types of email handled by your server.)
   
Version 2.1.8, 1 May 2003
Added noinreplyto option to <header> node.
Adding stripping of null HTML sequences like <i></i> before expressions are processed.
Added <modifysource> node to specify actions to take to modify an email as part of a <result>. This feature is currently only available in the Merak Mail version.
Full conversion of all possible numeric quoted HTML characters (i.e.; "&#123;") before expressions are processed.
   
Version 2.1.7, 23 April 2003
Added global variable __HTML_COMMENT_INSIDE_WORD_COUNT that contains count of number of HTML comments appearing bounded by alphanumeric characters, indicating obfuscated text.
   
Version 2.1.6, 19 April 2003
Excluded Microsoft Word HTML comments surrounding version-dependant structures from HTML comment statistics.
   
Version 2.1.5, 17 April 2003
Changes made in v2.1.4 to consider a message part as text/html if not specified affected how the filter skipped over binary attachments, causing expressions tests against decoded base64 binary content (and false rejects because of it). The content-type header now is an important aspect in determining how to handle a message part. Additional content-types added that will not be processed by the filter in any way. A future update will have a user-configurable list of content-types (MIME types) to skip, for now the internal list has been significantly expanded to cover many common MIME types.
Reorganized regexflt_bulkmailers.xml configuration file for easier maintenance.
   
Version 2.1.4, 4 April 2003
Mail content defaults to text/html unless an explicit content-type header is encountered. This is done to accommodate filtering email that's taking advantage of some mail clients ability to display an HTML email without specifying the content type in the header and obfuscating the content by embedding comments.
Added macro {source:bytecount} that expands into the number of bytes of the source email file.
HTML comment filter now looks for <!  through >.
Added global variables __HTML_COMMENT_BAD_COUNT and __HTML_COMMENT_COUNT that contain counts the number of instances an invalid HTML comment and valid HTML comment are found in the message content. These can be used within the logic attribute of the <expressions> node.
   
Version 2.1.3, 7 November 2002
Return actual result buffer size to Merak Mail.
Fixed 2-byte overflow writing to result buffer that would overwrite result code being returned to Merak Mail.
All macros can now be accessed as HTML-safe text using ".html" at the end of the macro name.
All macros can now be accessed as URL-safe text using ".url" at the end of the macro name.
All macros can now be accessed as filename-safe text using ".file" at the end of the macro name.
Added log <option> type result buffer to control logging of raw result buffer returned to mail server.
   
Version 2.1.2, 1 November 2002
Fixed bug introduced in v2.1.0 where the filter was incorrectly counting accept and reject results for each recipient in multiple recipient emails to determine overall result to return to Merak. If you're using v2.1.0, upgrade to this version!
Removed debugging code that was timing RegEx processing for each expression. This was not supposed to be in the release version.
   
Version 2.1.1, 29 October 2002 (Private Release)
Added capability to provide global variables for use by filters so that information about the email can be used in logic expressions.
Changed sample to use global variables __lf_count and __cr_count to test for unbalanced CR & LF (instead of an internally determined result), allowing more fine tuning of this particular filter.
   
Version 2.1.0, 22 October 2002
Significant improvement in performance processing large text/html and base64-encoded text/plain and text/html message parts. Previously, inefficient buffering of data prior to processing would take exponentially longer when processing larger message parts, often resulting in mail timeouts. For example, a selection of 4,222 emails collected in September 2002 processed in 5,029 seconds using v2.0.22. The same 4,222 emails processed in 656 seconds using this version. The improvement is significant only for emails that require buffering before processing; this is not a blanket improvement in processing speed for any email.
Added mail attribute to <copy> node to permit the exclusion of the email content in the copied file (useful when it is desired only to write data from the prepend or append attribute). The default if the attribute is not present includes copying the email.
Added operation attribute to <copy> node to permit specifying overwrite (replacing or creating the destination file) or append (appending to the destination file). Default is overwrite. Appending is multi-thread safe.
Added <maxbuffer> node to <parser> node to allow a maximum buffer size to be specified. When certain types of messages are found, such us text/html or encoded text/plain parts, they are buffered in full before processing instead of being processed line-by-line. If the total size of the buffered message part exceeds what is specified in this option, the buffer is trimmed before processing such that one half of the specified characters from the beginning of the buffer and one half of the specified characters from the end of the buffer make up what is processed.
Added high resolution time information to log lines.
Reorganized common code shared with the AntiVirus filter DLL.
Added macros for time of day.
Added log <option> type copy to control logging of copied files.
Added attribute enabled to <option> node to selectively enable/disable log options without removing the declaration.
Buffer overflow (triggered by emails that have TO header lines with more than 4,096 bytes of recipient data) that caused the SMTP process to terminate fixed.
   
Version 2.0.23, 12 October 2002 (private release)
Added prepend and append attributes to the <copy> node to configure writing of information to the copied email before and/or after the email itself.
<imap> node internally stored as if it were a <copy> node with the IMAP header data configured as if it were in a prepend attribute.
Multiple (unlimited) <copy> nodes can be specified in a <result> node.
   
Version 2.0.22, 11 October 2002
Fix to one XML function call during filter loading that was using the wrong method and did not function properly under XML 4.0 though it worked under XML 3.0.
   
Version 2.0.21, 10 October 2002
XML 4.0 supported if available on the system.
   
Version 2.0.20, 9 October 2002
Internal changes to how macros are processed.
<target_index> node implemented to permit a hierarchy of <index> nodes in other files.
Added name attribute to <index> node.
Added log <option> types interface and interface buffer to log information about the interface used to call the filter and the raw data passed using the interface.
<log> options were not reinitialized prior to reloading configuration when configuration changes were detected.
Extensive email support added as an optional action <email> node in a <result> node. Unlimited emails can be defined. Each email can have any number of body elements and attachments.
The benign error message regarding missing <rules> node in the sample "pass_everything" filter will no longer appear.
Separated white-list rules in sample configuration into their own filter file.
   
Version 2.0.19, 29 August 2002
Corrected buffer initialization for RegEx operator '[]'. In the Unicode version of the filter the buffer defining the character set was only half-initialized resulting in significant repetitious matching whenever characters over ASCII 127 were encountered within an email being processed.
Added a preliminary version of a filter designed to catch the infamous Nigeria bank scam emails.
   
Version 2.0.18, 22 August 2002
Exchange 2000: Fixed error introduced in 2.0.14 that allowed rejected mail to be delivered even though the SMTP client received an error indicating non-delivery.
Split mregexflt_filters.xml sample configuration file into several files to provide an example of using multiple <target_filter> nodes in a configuration.
Added several more expressions to mregexflt_bulkmailers.xml that capture additional bulk mailers based on link information.
Added fully qualified filename to logging information on main index load.
   
Version 2.0.17, 14 August 2002
Merak Mail: Fixed <copy> operation so that it functions under Windows 95/98/ME. Previously a function was used to perform the copy that wasn't supported under these operating systems so the copy failed.
   
Version 2.0.16, 13 August 2002 (private release)
Added error checking/log output on problems creating or writing to copied files (based on <copy> or <imap> nodes in <result>).
   
Version 2.0.15, 13 August 2002
Updated XML configuration sample.
Replaced <type> node with <option> node for <log> configuration. You can continue using the old <type> nodes, however if you use any of the new <option> nodes, you must convert the <type> nodes to <option> nodes.
   
Version 2.0.14, 11 August 2002 (private release)
Corrected parsing of message parts for a situation where a line containing spaces existed following the end of a message boundary header. Previously to this fix, when encountered, processing of that email would terminate, an error would be logged, and the email would be accepted.
Log output of final result returned to mail server for the email. In the case of multiple recipients where some recipients would accept the email and some would reject it, it is necessary to accept the email for all recipients because there is not yet support for partial delivery of an incoming email to a subset of recipients. While log output per-recipient correctly identifies the proper disposition of the email for that recipient, the result actually controlling final disposition might be different because of other recipients' dispositions.
Added attribute "behavior" to <select> node to control behavior in processing to allow the first matching node or all matching nodes. With the support for multiple <target_filter> nodes, it was necessary to provide a method select between stopping the attempt to match additional <select> nodes that exist within a <select> node (like a C++ switch/case statement) or continuing to find all applicable matches. See the <select> node documentation for a discussion about the methodology. The default behavior is "single" to maintain backward compatibility with the sample configuration files provided prior to this version.
Added macro "{smtp:recipient_id}". When an email arrives for multiple recipients, it is processed once for each recipient because there may be different filters that apply to some recipients. Using this macro in a filename for a <copy> or <imap> node you can ensure a unique copy of the email is made for each recipient.
Added <select on="all"> as a method of using behavior="single" or behavior="all" against a sub-group of <select> and <target_filter> nodes.
Added option to log results of all rules with any matched expression. This is helpful in reviewing performance of rules and expressions. See <option> for a description.
Removed performance counter support from ASCII version; it was not loadable under Windows 95/98 because of a DLL requirement only present in NT/2000/XP.
   
Version 2.0.13, 7 August 2002 (private release)
Implemented support for multiple <target_filter> nodes to select multiple filters to process against an email. The filters are processed in the order defined; duplicate filters are only run once. The first filter to create a result ends processing and that result is returned. If no filter creates a result, the default result defined in the first filter is returned.
RCPTO log line now is the raw RCPTTO data. A new log line RECIPIENT has been added with the individual recipient being processed.
Changes to logging output to more readily identify email-related log entries versus filter-wide and internal error log entries.
   
Version 2.0.12, 1 August 2002 (private release)
One obscure memory leak fixed in RegEx code; complex expressions on long source lines could leave a temporary buffer allocated after examination.
   
Version 2.0.11, 31 July 2002
A variety of memory leaks were fixed, including a major one introduced in v2.0.8. It is strongly recommended to upgrade to this version from v2.0.8 or later because of a significant growth in memory usage that occurs as each email is processed.
   
Version 2.0.10, 29 July 2002
Communigate Pro: Errors during processing will result in a FAILURE message returned to the mail server to prevent mail server timeouts.
Communigate Pro: Parsing queue header information from file to obtain MAIL FROM, RCPT TO, and CLIENT IP data.
Communigate Pro: Multiple RCPT TO addresses are processed individual against filter rules. If all recipients receive the same result it is returned to Communigate Pro, otherwise the lowest level of non-delivery is returned. Essentially, if any recipient needs to get the email, all will. Communigate Pro offers no method of returning results per-recipient.
Mail processed by Merak's remote accounts feature was not getting processed by the filter because no recipient information was passed to the filter. The filter will now process an email received with no recipient information.
   
Version 2.0.9, 14 July 2002
Message parts with content type "text/html" or "text/plain" that are BASE64-encoded are decoded and processed against filter rules as a single line.
Minor fixes to "text/html" pre-processing; space added between source lines when concatenated.
Changed log output to allow for more digits in length of line field as entire message parts are processed as a single line. Log output will be revised in a future version to provide more formatting and configuration options.
Added node <helo> to <scope> options permitting filter rules against HELO string from SMTP client.
   
Version 2.0.8, 13 July 2002
Message parts with content type "text/html" are pre-processed to remove all "<!--  -->" comments and <br>'s and convert certain escaped codes to real characters (such as &nbsp; and &amp;) and the entire message part is processed against filter rules as a single line.
Lengthy source lines are now logged truncated, only showing the relevant matched portion of the line plus a small range before and after the match.
Fixed buffer allocation that was one byte too small (occurred on continuation lines of quoted-printable message parts).
   
Version 2.0.7, 12 July 2002
Fixed quickresult evaluation from last header line (it was being ignored).
Added performance counter DLL to distribution to permit real-time monitoring of filter activities. Functional only for Windows NT/2000/XP.
   
Version 2.0.6, 9 July 2002 (private release)
Added workaround for an incorrectly formatted request buffer passed to the filter from Merak Mail 5.0.1beta (new interface format only).
Additional information logged when a problem occurs parsing the request buffer passed from Merak Mail.
   
Version 2.0.5, 8 July 2002 (private release)
Merak Mail only: Changed handling of multiple recipients to process a filter for each recipient. If any recipient accepts the message, the message is delivered to all recipients. (Information is returned to Merak Mail to distinguish those recipients the mail should be delivered to however this information is not presently used as of Merak Mail 5.0.1.)
Optimized code so email is not reformatted for XML transmission via HTTP unless a result defined in the applicable filter requires it.
RCPT TO data is now passed through the <select> nodes stripped of extraneous characters and is only the email address.
Output filter version information to log file.
   
Version 2.0.4, 3 July 2002 (private release)
XML format errors are now logged in detail. XML parsing errors are also logged, but currently with minimal information; you'll know something is not correctly configured, at least.
Full support for new Merak Mail filter interface allowing return of SMTP error codes and message to be delivered to the SMTP client.
Fixed problem with synchronization of reloaded configuration files that occurs on heavily loaded mail servers.
Errors during configuration file processing that occur before a log output file is defined will be appended to "regexflt.log" instead of ignored.
   
Version 2.0.3, 24 June 2002
Fixed error in assembling result codes for experimental support for a proposed new content filter interface for Merak Mail implemented. (Has no effect on current use in a production environment with released beta Merak Mail 5.0 versions, v2.0.2 is OK to keep using.)
   
Version 2.0.2, 22 June 2002
Enabled sending the full email (without attachments but with headers of attachments) via the <http> result node in XML form. Documentation and sample ASP code forthcoming in another release.
Experimental support for a proposed new content filter interface for Merak Mail implemented.
   
Version 2.0.1, 18 June 2002
Added <delay> node to permit specifying a delay in milliseconds as a form of tarpitting SMTP clients sending rejected mail.
Recompiled to include run-time code that was inadvertently left out and expected to exist as a separate DLL.
   
Version 2.0.0, 18 June 2002
First public release of beta version for Merak Mail and Microsoft Exchange 2000.

Return to main page