Bayes advanced questions
12 answers - 5219 bytes -

Michael Monnerie wrote:
Dear SA users, I've had an offlist comparison of bayes DBs, and we found
some interesting differences. We're trying to find out why bayes on
server #1 makes better scores.:
Server #1 local.cf (SA 3.1.1):
Server #1 bayes dump:
0.000 0 93053 0 non-token data: nspam
0.000 0 53428 0 non-token data: nham
0.000 0 1261864 0 non-token data: ntokens
<snip>
Server #2 bayes dump:
0.000 0 155791 0 non-token data: nspam
0.000 0 80523 0 non-token data: nham
0.000 0 129852 0 non-token data: ntokens
From the numbers I would say that server #2 had learned more spam+ham,
but has about 1/10th of tokens. That server is also far less accurate
with bayes than server #1. Could the ntokens be the reason?
Yes, ntokens could be the reason. Another aspect to consider is the training
input. If either server gets manually trained it will have a big leg-up in
accuracy over one that does not.
Particularly on servers with a site-wide DB used against broadly diverse spread
of mail, increasing the token limit will improve accuracy.
However, this comes at the expense of increased storage needs and slower
performance. (In particular, expiry takes a LT longer with larger DBs)
With the new SPAM this last weeks, that tries to poison bayes, it could
maybe be effective with the default of 150.000 tokens?
Possibly, but unlikely. Realistically SA is largely immune to poisoning due to
the chi-squared combining. The size of the DB should matter very little in terms
of poison.
Another tip for all: With server #1 setting
8.00
you could expect this message to be autolearned:
>X-Spam-Status: Yes, hits=8.7 required=5.0 tests=BAYES_99=3.5,
>HTML_MESSAGE=0.001,HTML_MIME_NHTML_TAG=0,HTML_TAG_ EXIST_TBDY=0.282,
>MIME_HTMLNLY=0.389,RELAY_DE=0.01,REPLY_TEMPTY=0.51 2,
>SARE_FRGED_EBAY=4 autolearn=no bayes=1.0000
But it is autolearn=no.
I would not expect that message to be autolearned. The score used in checking
thresholds is NT the same as the final message score. The score used is the
score the message would have got if:
bayes was disabled
the AWL was disabled
no userconf (ie:black/whitelists) rules were enabled.
Since that message scored 8.7, and derives 3.5 of it's points from BAYES_99, it
does not surprise me at all the message was not learned.
Also, EVEN if the learning score is over the threshold, SA will not learn a
message as spam unless:
there are at least 3.0 points of header rules
there are at least 3.0 points of body rules
Existing learning would not place the message in a low bayes category (ie:
don't learn as spam if the message would have hit BAYES_00 otherwise)
This shows, that manual re-feeding SPAM can be
effective for your Bayes, because this sure-is-spam would not have been
learned automatically.
Very true. You can definitely get great improvements in accuracy by training
manually.
I personally view the autolearner as a supplement to my own training. It is not
a perfect system.
Since it's already BAYES_99, you could say
"don't bother, I'll be fine" *g* but bayes needs to be trained
permanently, because tokens time out
Also realize that just because the message got BAYES_99 doesn't mean there are
no tokens in it that can be learned from. Spam mutates. New phrases and words
creep in. These need to be learned from, even if the current message is already
BAYES_99.
And why was SARE_FRGED_EBAY set down to 4? It was so nice at 100+
If I had to guess, FPs.
Also, we set bayes_expiry_max_db_size to 50000, and made
sa-learn
But still those numbers:
>0.000 0 242424 0 non-token data: nspam
>0.000 0 313252 0 non-token data: nham
>0.000 0 134001 0 non-token data: ntokens
Why are still 134k tokens there?
As per the docs for bayes_expiry_max_db_size, SA will never reduce the bayes DB
below 100,000. No matter how small you set this value.
bayes_expiry_max_db_size (default: 150000)
What should be the maximum size of the Bayes tokens database? When expiry
occurs, the Bayes system will keep either 75% of the maximum value, or 100,000
tokens, whichever has a larger value. 150,000 tokens is roughly equivalent to a
8Mb database file.
So SA was likely aiming for 100,000. However, SA performs expiry by picking a
"cut-off" age and dropping all the tokens older than that. It keeps stepping
back the cut-off time till it goes under the target token count. It then uses
the previous time.
This effectively prevents bayes from expiring out your whole bayes DB at once if
all the tokens have the same atime.
This time-step approach likely resulted in the 134k tokens. There are likely
>35k tokens all with the same age right at the cut-off mark. Had SA stepped up
the age any further, you'd have ended up with less than 100k tokens.
No.1 | | 2447 bytes |
| 
Dear SA users, I've had an offlist comparison of bayes DBs, and we found
some interesting differences. We're trying to find out why bayes on
server #1 makes better scores.:
Server #1 local.cf (SA 3.1.1):
bayes_expiry_max_db_size 2000000
bayes_auto_expire 0
bayes_file_mode 0777
8.00
1.0
Server #1 bayes files:
-rw-rw-rw-+ 1 vscan vscan 19738624 May 10 10:04 bayes_db_seen
-rw-rw-rw-+ 1 vscan vscan 41697280 May 10 10:04 bayes_db_toks
Server #1 bayes dump:
0.000 0 93053 0 non-token data: nspam
0.000 0 53428 0 non-token data: nham
0.000 0 1261864 0 non-token data: ntokens
Server #2 local.cf:
bayes_auto_learn 1
bayes_learn_to_journal 1
bayes_auto_expire 1
ok_languages de en es
ok_locales en
Server #2 bayes files:
21M 2006-05-10 10:20 bayes_seen
5,3M 2006-05-10 10:20 bayes_toks
Server #2 bayes dump:
0.000 0 155791 0 non-token data: nspam
0.000 0 80523 0 non-token data: nham
0.000 0 129852 0 non-token data: ntokens
From the numbers I would say that server #2 had learned more spam+ham,
but has about 1/10th of tokens. That server is also far less accurate
with bayes than server #1. Could the ntokens be the reason?
With the new SPAM this last weeks, that tries to poison bayes, it could
maybe be effective with the default of 150.000 tokens?
Another tip for all: With server #1 setting
8.00
you could expect this message to be autolearned:
X-Spam-Status: Yes, hits=8.7 required=5.0 tests=BAYES_99=3.5,
HTML_MESSAGE=0.001,HTML_MIME_NHTML_TAG=0,HTML_TAG_ EXIST_TBDY=0.282,
MIME_HTMLNLY=0.389,RELAY_DE=0.01,REPLY_TEMPTY=0.51 2,
SARE_FRGED_EBAY=4 autolearn=no bayes=1.0000
But it is autolearn=no. This shows, that manual re-feeding SPAM can be
effective for your Bayes, because this sure-is-spam would not have been
learned automatically. Since it's already BAYES_99, you could say
"don't bother, I'll be fine" *g* but bayes needs to be trained
permanently, because tokens time out
And why was SARE_FRGED_EBAY set down to 4? It was so nice at 100+
Also, we set bayes_expiry_max_db_size to 50000, and made
sa-learn
But still those numbers:
0.000 0 242424 0 non-token data: nspam
0.000 0 313252 0 non-token data: nham
0.000 0 134001 0 non-token data: ntokens
Why are still 134k tokens there?
mfg zmi
No.2 | | 2247 bytes |
| 
Mittwoch, 10. Mai 2006 23:41 Matt Kettler wrote:
Particularly on servers with a site-wide DB used against broadly
diverse spread of mail, increasing the token limit will improve
accuracy.
However, this comes at the expense of increased storage needs and
slower performance. (In particular, expiry takes a LT longer with
larger DBs)
DB Files are about 60MB together, so not really big (I just got a
pricelist with the new 750GB SATA drive from Seagate *g*).
And tonights expiry for server #1:
bayes: synced databases from journal in 11 seconds: 1968 unique entries
(3059 total entries)
So it's not too long also. Could possibly be longer on a server that
gets some million mails per day, of course.
score used is the score the message would have got if:
bayes was disabled
the AWL was disabled
no userconf (ie:black/whitelists) rules were enabled.
Thats good info which should be in the man page.
Since that message scored 8.7, and derives 3.5 of it's points from
BAYES_99, it does not surprise me at all the message was not learned.
Also, EVEN if the learning score is over the threshold, SA will not
learn a message as spam unless:
there are at least 3.0 points of header rules
there are at least 3.0 points of body rules
Existing learning would not place the message in a low bayes
category (ie: don't learn as spam if the message would have hit
BAYES_00 otherwise)
This is written in the man page, except the last line with the BAYES_00
wasn't clear to me from there. Is this valid just for BAYES_00 and
BAYES_99, or also BAYES_05 and BAYES_95?
Since it's already BAYES_99, you could say
"don't bother, I'll be fine" *g* but bayes needs to be trained
permanently, because tokens time out
Also realize that just because the message got BAYES_99 doesn't mean
there are no tokens in it that can be learned from. Spam mutates. New
phrases and words creep in. These need to be learned from, even if
the current message is already BAYES_99.
Yes, this is very valuable info for others also I believe.
Thanks for your help on this,
mfg zmi
No.3 | | 3495 bytes |
| 
Michael Monnerie wrote:
Mittwoch, 10. Mai 2006 23:41 Matt Kettler wrote:
>Particularly on servers with a site-wide DB used against broadly
>diverse spread of mail, increasing the token limit will improve
>accuracy.
>>
>However, this comes at the expense of increased storage needs and
>slower performance. (In particular, expiry takes a LT longer with
>larger DBs)
>
>
DB Files are about 60MB together, so not really big (I just got a
pricelist with the new 750GB SATA drive from Seagate *g*).
And tonights expiry for server #1:
bayes: synced databases from journal in 11 seconds: 1968 unique entries
(3059 total entries)
That's the journal sync, not the expiry part. The expiry part takes much
longer.
So it's not too long also. Could possibly be longer on a server that
gets some million mails per day, of course.
>score used is the score the message would have got if:
>bayes was disabled
>the AWL was disabled
>no userconf (ie:black/whitelists) rules were enabled.
>
>
Thats good info which should be in the man page.
It is In SA 3.1.x it's in the docs for the autolearn threshold plugin:
>Since that message scored 8.7, and derives 3.5 of it's points from
>BAYES_99, it does not surprise me at all the message was not learned.
>>
>Also, EVEN if the learning score is over the threshold, SA will not
>learn a message as spam unless:
>there are at least 3.0 points of header rules
>there are at least 3.0 points of body rules
>Existing learning would not place the message in a low bayes
>category (ie: don't learn as spam if the message would have hit
>BAYES_00 otherwise)
>
>
This is written in the man page, except the last line with the BAYES_00
wasn't clear to me from there. Is this valid just for BAYES_00 and
BAYES_99, or also BAYES_05 and BAYES_95?
I looked into the code for SA 3.1.0's PerMsgStatus.pm and
Plugin/AutoLearnThreshold.pm.
The limitation is actually done by computing score of the bayes rules,
not the actual bayes percentage.
Learning as ham will be inhibited if the score of the "learn" rules (ie:
bayes) totals more than +1.0.
Learning as spam will be inhibited if e score of the "learn" rules (ie:
bayes) totals less than -1.0.
Note: by "learn" rules, I mean rules declared with the "learn" tflag,
which at this time is just bayes.
So in SA 3.1.0, existing training ranking BAYES_00 and BAYES_05 will
inhibit spam learning.
BAYES_60 or higher will inhibit ham learning.
Since it's already BAYES_99, you could say
"don't bother, I'll be fine" *g* but bayes needs to be trained
permanently, because tokens time out
>Also realize that just because the message got BAYES_99 doesn't mean
>there are no tokens in it that can be learned from. Spam mutates. New
>phrases and words creep in. These need to be learned from, even if
>the current message is already BAYES_99.
>
>
Yes, this is very valuable info for others also I believe.
Thanks for your help on this,
mfg zmi
No.4 | | 2246 bytes |
| 
Donnerstag, 11. Mai 2006 08:06 Matt Kettler wrote:
And tonights expiry for server #1:
bayes: synced databases from journal in 11 seconds: 1968 unique
entries (3059 total entries)
That's the journal sync, not the expiry part. The expiry part takes
much longer.
It comes from "sa-learn ". How could I see when it
expires something? Could it be because the ntokens are still not 2
mio., that I don't have an expire?
is says you
should stop SA before , is that a must or a
recommendation? The man page doesn't ask for it.
>score used is the score the message would have got if:
>bayes was disabled
>the AWL was disabled
>no userconf (ie:black/whitelists) rules were enabled.
>
Thats good info which should be in the man page.
It is In SA 3.1.x it's in the docs for the autolearn threshold
plugin:
>Plugin_AutoLearnThreshold.html
Not really. No mentioning that bayes/awl/userconf are not counted.
I looked into the code for SA 3.1.0's PerMsgStatus.pm and
Plugin/AutoLearnThreshold.pm.
The limitation is actually done by computing score of the bayes
rules, not the actual bayes percentage.
Learning as ham will be inhibited if the score of the "learn" rules
(ie: bayes) totals more than +1.0.
Learning as spam will be inhibited if e score of the "learn" rules
(ie: bayes) totals less than -1.0.
Note: by "learn" rules, I mean rules declared with the "learn" tflag,
which at this time is just bayes.
So in SA 3.1.0, existing training ranking BAYES_00 and BAYES_05 will
inhibit spam learning.
BAYES_60 or higher will inhibit ham learning.
This is very good info and would be nice documenting in man/wiki. I
could update the wiki, but I don't believe I'm qualified enough.
For example, the man page says:
* Also note that auto-learning occurs using scores from either scoreset
* 0 or 1
But who except the devs knows what's scoreset 0 or 1?
For people with several MXs this is good info also:
It explains why 2nd MX often generate FPs.
mfg zmi
No.5 | | 894 bytes |
| 
Hello Michael,
Wednesday, May 10, 2006, 5:21:14 PM, you wrote:
And why was SARE_FRGED_EBAY set down to 4? It was so nice at 100+
it was set to 104 to over-ride user driven whitelists, but
I felt that was somewhat out of standard practices to have a single
rule flag a message as spam, say E-bay hires some new company tomorrow
to send mailings for them, it's possible it could hit the forged rule.
The real change here was to make a meta rule with the
SARE_FRGED_EBAY rule that scores +100 if the USER_IN_WHITELIST is
also hit, that way the rule only scores 100 if the message is going
to get -100 from the whitelist. In the end, the message is still
only going to get 4 points from the forgery. I suggest if you feel
comfortable with this, just add a score line to your local.cf and
give it any score you feel comfortable with ;)
No.6 | | 4197 bytes |
| 
Michael Monnerie wrote:
Donnerstag, 11. Mai 2006 08:06 Matt Kettler wrote:
And tonights expiry for server #1:
bayes: synced databases from journal in 11 seconds: 1968 unique
entries (3059 total entries)
>That's the journal sync, not the expiry part. The expiry part takes
>much longer.
It comes from "sa-learn ".
First, adding is redundant. implies because it
would be foolish for SA to attempt expiry without syncing first.
It won't hurt anything, but it's redundant.
>How could I see when it
expires something? Could it be because the ntokens are still not 2
mio., that I don't have an expire?
It's almost certainly going to run an expire. I'm just pointing out that those
11 seconds are NT a part of the expire. They're just how long the sync part took.
Your expiry will take much longer. server 1 with such a large bayes DB it
could take 10 minutes or more.
That said, SA should report details of the expiry right after the sync
# sa-learn
bayes: synced databases from journal in 1 seconds: 938 unique entries (986 total ent
ries)
expired old bayes database entries in 118 seconds
214732 entries kept, 1312 deleted
token frequency: 1-occurrence tokens: 2.50%
token frequency: less than 8 occurrences: 71.89%
is says you
should stop SA before , is that a must or a
recommendation?
It's a recommendation. If SA is still running and is in the middle of
auto-learning, sa-learn will have to wait for it to finish before it can lock
the DB R/W.
The man page doesn't ask for it.
score used is the score the message would have got if:
bayes was disabled
the AWL was disabled
no userconf (ie:black/whitelists) rules were enabled.
Thats good info which should be in the man page.
>It is In SA 3.1.x it's in the docs for the autolearn threshold
>plugin:
>>
>
>Plugin_AutoLearnThreshold.html
Not really. No mentioning that bayes/awl/userconf are not counted.
>I looked into the code for SA 3.1.0's PerMsgStatus.pm and
>Plugin/AutoLearnThreshold.pm.
>>
>The limitation is actually done by computing score of the bayes
>rules, not the actual bayes percentage.
>>
>Learning as ham will be inhibited if the score of the "learn" rules
>(ie: bayes) totals more than +1.0.
>Learning as spam will be inhibited if e score of the "learn" rules
>(ie: bayes) totals less than -1.0.
>>
>Note: by "learn" rules, I mean rules declared with the "learn" tflag,
>which at this time is just bayes.
>>
>So in SA 3.1.0, existing training ranking BAYES_00 and BAYES_05 will
>inhibit spam learning.
>BAYES_60 or higher will inhibit ham learning.
This is very good info and would be nice documenting in man/wiki. I
could update the wiki, but I don't believe I'm qualified enough.
For example, the man page says:
* Also note that auto-learning occurs using scores from either scoreset
* 0 or 1
But who except the devs knows what's scoreset 0 or 1?
It's in the manpage, under the description of the "score" keyword.
If four valid scores are listed, then the score that is used depends on how
SpamAssassin is being used. The first score is used when both Bayes and network
tests are disabled (score set 0). The second score is used when Bayes is
disabled, but network tests are enabled (score set 1). The third score is used
when Bayes is enabled and network tests are disabled (score set 2). The fourth
score is used when Bayes is enabled and network tests are enabled (score set 3).
For people with several MXs this is good info also:
It explains why 2nd MX often generate FPs.
mfg zmi
No.7 | | 2145 bytes |
| 
Thu, May 11, 2006 at 06:17:14PM +0200, Michael Monnerie wrote:
bayes: synced databases from journal in 11 seconds: 1968 unique
entries (3059 total entries)
That's the journal sync, not the expiry part. The expiry part takes
much longer.
It comes from "sa-learn ". How could I see when it
expires something? Could it be because the ntokens are still not 2
mio., that I don't have an expire?
Yes. The expiry logic is well documented in the sa-learn PD, but the basics
are that:
You set the max db size to 2000000, which means that SA tries to expire down
to 2000000*0.75 = 1500000 tokens. According to your post, you only have
1261864 tokens which is less than 1500000, so there's nothing to do for an
expiry. You'd need a minimum of 1501000 tokens in the DB for an expire to
actually run.
is says you
should stop SA before , is that a must or a
recommendation? The man page doesn't ask for it.
It's completely unnecessary to stop SA (that'd be a horrible requirement
wouldn't it?).
>score used is the score the message would have got if:
>bayes was disabled
>the AWL was disabled
>no userconf (ie:black/whitelists) rules were enabled.
>
Thats good info which should be in the man page.
It is In SA 3.1.x it's in the docs for the autolearn threshold
plugin:
>Plugin_AutoLearnThreshold.html
Not really. No mentioning that bayes/awl/userconf are not counted.
Really? Did you look at the plugin PD?
Note that certain tests are ignored when determining whether a message
should be trained upon:
* rules with tflags set to (the Bayesian rules)
* rules with tflags set to (user configuration)
* rules with tflags set to
For example, the man page says:
* Also note that auto-learning occurs using scores from either scoreset
* 0 or 1
But who except the devs knows what's scoreset 0 or 1?
Anyone who's read the documentation for "score" ? ;)
No.8 | | 1955 bytes |
| 
Donnerstag, 11. Mai 2006 20:00 Matt Kettler wrote:
First, adding is redundant. implies
because it would be foolish for SA to attempt expiry without syncing
first.
I found that in the documentation after I sent the mail.
Your expiry will take much longer. server 1 with such a large
bayes DB it could take 10 minutes or more.
expired old bayes database entries in 118 seconds
214732 entries kept, 1312 deleted
token frequency: 1-occurrence tokens: 2.50%
token frequency: less than 8 occurrences: 71.89%
K, because I still didn't reach my tokens limit, I never really had an
expire running.
is says you
should stop SA before , is that a must or a
recommendation?
It's a recommendation. If SA is still running and is in the middle of
auto-learning, sa-learn will have to wait for it to finish before it
can lock the DB R/W.
K, learning is quick anyway, so no problem here. The expire will
definitely run much longer, as you say. But what happens when SA wants
to auto-learn another message while expire runs? Will it wait and
timeout or just skip autolearning? Skipping would be no problem for me,
but a timeout could be nasty.
It's in the manpage, under the description of the "score" keyword.
If four valid scores are listed, then the score that is used depends
on how SpamAssassin is being used. The first score is used when both
Bayes and network tests are disabled (score set 0). The second score
is used when Bayes is disabled, but network tests are enabled (score
set 1). The third score is used when Bayes is enabled and network
tests are disabled (score set 2). The fourth score is used when Bayes
is enabled and network tests are enabled (score set 3).
Ah, it's simply the four different score numbers. If english would be my
mother tongue, I could have guessed that probably. Thx.
mfg zmi
No.9 | | 1402 bytes |
| 
Donnerstag, 11. Mai 2006 20:06 Theo Van Dinter wrote:
is says you
should stop SA before , is that a must or a
recommendation? The man page doesn't ask for it.
It's completely unnecessary to stop SA (that'd be a horrible
requirement wouldn't it?).
K, forget the question from my previous post. This answers it.
Not really. No mentioning that bayes/awl/userconf are not counted.
Really? Did you look at the plugin PD?
Note that certain tests are ignored when determining whether a
message should be trained upon:
* rules with tflags set to (the Bayesian rules)
* rules with tflags set to (user configuration)
* rules with tflags set to
Uhm, those two paragraphs seem too complicated for my brain to translate
and understand. I didn't recognise this as describing the same than
Matt did - I understood him, but not the docu.
But who except the devs knows what's scoreset 0 or 1?
Anyone who's read the documentation for "score" ? ;)
A reference to that manpage would have helped there. In general, SA is
very good documented, but the info is scattered in lots of places. The
wiki is a good (but necessary) thing to find the connections between
several problems. I'm sure lots of people have had a hard time trying
to understand the full concept.
mfg zmi
No.10 | | 787 bytes |
| 
Thu, May 11, 2006 at 10:10:39PM +0200, Michael Monnerie wrote:
K, learning is quick anyway, so no problem here. The expire will
definitely run much longer, as you say. But what happens when SA wants
to auto-learn another message while expire runs? Will it wait and
timeout or just skip autolearning? Skipping would be no problem for me,
but a timeout could be nasty.
It depends what "timeout" means in this context. What's going on is that
anytime a process needs to get a write lock on the db, there's contention
if other processes already have it locked. By default, processes can
wait either 300s for a lock before giving up (all commands run through
sa-learn), or 10s (everything else).
So "skip autolearning" requires a "timeout".
No.11 | | 705 bytes |
| 
Donnerstag, 11. Mai 2006 22:22 Theo Van Dinter wrote:
It depends what "timeout" means in this context. *What's going on is
that anytime a process needs to get a write lock on the db, there's
contention if other processes already have it locked. *By default,
processes can wait either 300s for a lock before giving up (all
commands run through sa-learn), or 10s (everything else).
So "skip autolearning" requires a "timeout".
That's K, so normally, amavisd-new calling SA should timeout after 10s,
that should cause no problem. 300s would timeout amavis already, which
would mean it forwards the message without it being checked and marked.
mfg zmi
No.12 | | 575 bytes |
| 
Michael Monnerie wrote:
K, learning is quick anyway, so no problem here. The expire will
definitely run much longer, as you say. But what happens when SA wants
to auto-learn another message while expire runs? Will it wait and
timeout or just skip autolearning? Skipping would be no problem for me,
but a timeout could be nasty.
Autolearning always skips if the r/w lock fails for any reason, it never
attempts to wait.
So autolearning might get skipped due to expiry, journal syncs, or even another
message already being learned