Filtering in tin

0. Status

This is an overview of the new filtering capabilities of tin. This
document will be absorbed in the main documentation at some point.


1. Introduction

Tin's filtering mechanism has changed significantly since version 1.3beta.
Originally there were only two possibilities:

1) kill an article matching a rule.
2) hot-select an article matching a rule.

This led to constant confusion, as it seemed important which rule
came first in the filter file, but it wasn't. Then if an article was
selected for whatever reason it couldn't be killed even if it was Craig
Shergold telling you how to make money fast in a crosspost to alt.test.
This binary concept isn't modern anyway, so a much more up-to-date fuzzy
mechanism was necessary: scoring.

When using tin's new scoring mechanism you assign a "score" to each
filter rule. The scores of rules matching the current article are added
and the final score of the article decides if it is regular, marked hot
or killed.

The standard "kill" and standard "select" already in your filter-file have
the score "score_kill" and "score_select" respectively (See section 4).


2. Changes to the filter-file format

Tin understands the additional "score" command in the filter-file now.

Old style rule:

scope=*
type=0
case=0
subj=*$$$*

New style rule:

group=*
case=0
score=-100
subj=*$$$*
#####

So you can give the individual rule a weight, based on your opinion
about the rule. e.g. if you want to be sure to never read a certain
individual again, you may give the rule a score of (-)9000.

If you want only "classical" filtering and don't want to mess around
with score values, you can use the magic words "kill" and "hot" as score
values in your filter file. Example:

group=*
case=0
score=kill
subj=*$$$*
#####

These are handled as default values at program initialization time and
may be somewhat easier to remember.

You might have noticed by the examples above that tin inserts a line of
hashes between two rules now. This is *not* required, it just improves
readability.


3. Changes in the filter menu

The on screen filter menu is now more compact and fits easily on small
terminals such as a small xterm or a 640x200 CON: window now. It has been
enhanced to allow you to enter a score for the rule you are adding. It
should be in the range from 1 to SCORE_MAX otherwise it will default to
"score_select" for select filter rules and "score_kill" for kill filter
rules (See section 4).


4. Internal defaults and config options

There are some constants defined in tin.h and tinrc:

SCORE_MAX is the maximum score an article can reach. Any value
above this is cut to SCORE_MAX, the same goes for negative scores.
recommended: 10000

"score_kill" is the default score given for any kill rule, if no other
is specified.
recommended: -100

"score_select" is the default score for any auto-selection rule, if no
other is specified.
recommended: 100

"score_limit_kill" and "score_limit_select" are the limits that must be
crossed to mark an article as killed or selected.
recommended when used with values given above: -50/+50.

"score_kill", "score_select", "score_limit_kill" "score_limit_select" are
config options. You can find them in tin's configuration file
(~/.tin/tinrc). They can also be changed at runtime in the config menu.


5. Overview of "filter"-commands

Everything here is also described in the file ~/.tin/filter, albeit more
concisely.

All lines are of the form:
command=value

Valid "command"s are:

add a comment to the following rule:

comment= a short text

multiple comment lines may be used, comments lines _must_ be right before
the scope selection.

scope selection:

group=newsgroup_pattern_list

newsgroup_pattern_list is a comma-separated list of newsgroup_patterns

newsgroup_patterns can be a pattern (wildmat-style) or !pattern,
negating the match of pattern. This is the same format used for the
AUTO(UN)SUBSCRIBE environment variable.

Tin doesn't rework your filter file, the new pattern matching is only
used when you enter new entries by hand.

additional info:

case=num    num: 0=casesensitive, 1=caseinsensitive
score=num   num: score value of rule, can now also be one of the magic words
                 "kill" or "hot", which are equivalent to
                 SCORE_KILL and SCORE_SELECT respectively.
time=num    num: time_t value; when rule expires. When tin writes the filter
                 file it adds the time in human readable form as a comment in
                 parentheses after the numeric value. When reading the file
                 tin uses _only_ the numeric value, not the human readable form.

matches:            matched to:

subj=pattern        Subject:
from=pattern        From:
                    Tin converts the contents of the From-header to an old-style
                    e-mail address, i.e. ''some@body.example (John Doe)''
                    instead of ''John Doe <some@body.example>'', before trying to
                    match the patterns in the filter rule. That way a rule tailored to
                    to match the full from header "jsmith@ac.example (John Smith)" will still
                    work when John posts with a different newsreader which uses
                    "John Smith <jsmith@ac.example>".
msgid=pattern       Message-Id: *AND* full References:
msgid_last=pattern  Message-Id: and last Reference:s entry only
msgid_only=pattern  Message-Id:
refs_only=pattern   References: line (e.g. <123@ether.net>) without Message-Id:
lines=num           Lines: ; <num matches less than, >num matches more than.
gnksa=[<>]?NUM      GNKSA parse_from() return code
xref=pattern        Xref: ; filter crossposts to groups matching pattern
xref_max=num                } Xref: ; potentially obsolete, collects
xref_score=num,pattern      } scores for crossposting in other
                            } newsgroups

When you are using wildmat pattern-matching, patterns in ~/.tin/filter
should be delimited with "*", verbatim wildcards in patterns must be
escaped with "\". When using the built-in filter-file functions, tin tries
to take care of it for itself, except when you are entering text in the
built in kill/hot-menu. Then you have to quote manually because tin
doesn't know if e.g. "\[" is already quoted or not.

GNKSA return codes: these are the return codes of the From:-address
parser, enabling you to filter on certain kinds of syntactical and
semantical errors present in that header. For an up-to-date list see the
definitions in extern.h and the parser source code in misc.c, the
following is just a short introduction.

   0-99: internal codes
code   error description
   0   no error, valid address
   1   internal error, should not happen (blame me)

   100-199: general syntactical errors
code   error description
 100   left angle bracket ("<") missing in route address
 101   left parenthesis ("(") missing in oldstyle address (realname comment)
 102   right parenthesis (")") missing in oldstyle address (realname comment)
 103   at-sign ("@") missing in mail address

   200-299: right hand side (FQDN part) of address, syntax and semantics
code   error description
 200   right hand side (RHS) of address is a single component
 201   RHS has an unknown top level domain (3 or more characters)
 202   RHS has a malformed top level domain
 203   RHS has an unknown country code as top level domain
 204   illegal character in RHS
 205   leading or trailing dot or two consecutive dots in RHS
 206   RHS has a component longer than 63 characters
 207   RHS has a component with leading or trailing hyphen ("-")
 208   RHS has a component starting with a digit (with ENFORCE_RFC1034 only)
 209   RHS is not a valid IP address
 210   RHS is an IP address from private IP space (see RFC1918) or loopback
 211   brackets ("[", "]") around IP address missing in RHS

   300-399: syntactical errors left hand side (localpart) of address
code   error description
 300   there was no localpart found at all in address
 301   localpart contains illegal characters
 302   localpart has leading, trailing or consecutive dots

   400-499: syntactical errors in realname part
code   error description
 400   illegal character in unquoted word in realname part
 401   illegal character in quoted word in realname part
 402   illegal character in encoded word in realname part
 403   bad syntax in encoded word in realname part
 404   illegal character in oldstyle realname part (one of "()<>\")
 405   illegal character in realname part


6. EXAMPLES

6.1 WILDMAT EXAMPLES

none given, too simple, find out yourself ,-)

6.2 REGEXP EXAMPLES

Be sure to change Wildcard setting from WILDMAT (default) to REGEX to make
the following examples to work properly. This can be done using the internal
configuration menu or in file ~/.tin/tinrc

comment= this kills all articles about CNews, DNEWS or diablo
comment= in news.software.* but not in news.software.readers
group=news.software.*,!news.software.readers
case=1
score=kill
subj=([cd]news|diablo)


comment= this should mark all articles about tin, rtin, tind, ktin or cdtin
comment= as hot
group=*
case=1
score=hot
subj=\b(cd|[rk]?)?tin(d|pre)?[-.0-9]*\b


comment= mark own articles and followups to own articles as hot in all groups
comment= except local ones
comment= match From: (a bit complex) and/or
comment= Message-ID: (I'm the only user who's posting on this server)
group=*,!akk.*,!tin.*
case=1
score=hot
from=urs@(.*\.)?((akk\.uni-karlsruhe|arbeitsen)\.de|(karlsruhe|tin|akk)\.org|ka\.nu)
msgid=@akk3(?:-dmz)?\.akk\.uni-karlsruhe\.de>


comment= stupid ppl. sometimes read control.cancel to see if there are any
comment= forged cancels around... the next rule helps you a bit
comment= ignore know despammers and net.* cancels
group=control.cancel
case=1
score=kill
from=(news@news\.msfc\.nasa\.gov|clewis@ferret\.ocunix\.on\.ca|jem@xpat\.com|(jeremy|lysander)@exit109\.com|howardk@iswest\.com|cosmo.roadkill.*rauug\.mil\.wi\.us|spamless@pacbell\.net|cwilkins@.*\.clark\.net)
msgid_only=<net-monitor-cancel


comment= this might help when reading alt.*
comment= ignore all postings with $$$ or *** or !!!
comment= ignore all postings shorter then 3 lines
comment= ignore all postings crossposted into more then 10 groups
group=alt.*,
case=1
score=-200
subj=[$*!]{3,}
lines=<3
xref_max=10


comment= mark own articles and direct replies based on message-id
comment= use 2*hot as score to unkill otherwise killed articles
group=*
case=1
score=200
msgid_last=doeblitz\.ts\.rz\.tu-bs\.de

comment= unmark own articles based on message-id
comment= -> only f'ups to own articles keep marked hot
group=*
case=1
score=-200
msgid_only=doeblitz\.ts\.rz\.tu-bs\.de


comment= kill all articles which do not have your message-id
comment= as last refenrece _if_ article has any references
group=de.newusers.questions
case=1
score=-100
refs_only=.*<[^@\s]+@\S+(?<!akk3\.akk\.uni-karlsruhe\.de)>$

comment= Kill all articles from John Smith, who writes under different
comment= addresses at ac.me, e.g john@ac.me and boss@ac.me
group=*
case=1
score=kill
from=@ac\.example\s\(John\sSmith\)$

7. TODO

- make the time value in the filter file more human readable.
- rewrite filtering order to get optimal performance
- filtering on arbitrary header lines
- throw out "xref*" when it is obsolete for sure
