Spamassassin's TxRep Reputation and Bayesian filters

August 17, 2025 by Roberto Puzzanghera 16 comments

TxRep was designed as an enhanced replacement of the AutoWhitelist plugin. TxRep, just like AWL, tracks scores of messages previously received, and adjusts the current message score, either by boosting messages from senders who send ham or penalizing senders who have sent spam previously. This not only treats some senders as if they were whitelisted but also treats spammers as if they were blacklisted. Each message from a particular sender adjusts the historical total score which can change them from a spammer if they send non-spam messages. Senders who are considered non-spammers can become treated as spammers if they send messages which appear to be spam. Simpler told TxRep is a score averaging system. It keeps track of the historical average of a sender, and pushes any subsequent mail towards that average.

The Bayesian classifier in Spamassassin tries to identify spam by looking at what are called tokens; words or short character sequences that are commonly found in spam or ham. If I've handed 100 messages to sa-learn that have the phrase penis enlargement and told it that those are all spam, when the 101st message comes in with the words penis and enlargment, the Bayesian classifier will be pretty sure that the new message is spam and will increase the spam score of that message.

Bayes is essentially a statistical classifier: it looks at the tokens (words, headers, URLs, etc.) and calculates the probability that the message is spam, regardless of the sender, but only the content.

TxRep, on the other hand, tracks the sender's reputation (email address + IP).

I assume that you have a "spamassassin" DB and user as already done in the previous page.

Changelog

  • Aug 18, 2025: improved the "Training Bayes" section

Create the DB tables

> mysql -u root -p

USE spamassassin;
CREATE TABLE txrep (
  username varchar(100) NOT NULL default '',
  email varchar(255) NOT NULL default '',
  ip varchar(40) NOT NULL default '',
  msgcount int(11) NOT NULL default '0',
  totscore float NOT NULL default '0',
  signedby varchar(255) NOT NULL default '',
  last_hit timestamp NOT NULL default CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (username,email,signedby,ip),
  KEY last_hit (last_hit)
) ENGINE=InnoDB;

CREATE TABLE bayes_expire (
  id int(11) NOT NULL default '0',
  runtime int(11) NOT NULL default '0',
  KEY bayes_expire_idx1 (id)
) ENGINE=InnoDB;

CREATE TABLE bayes_global_vars (
  variable varchar(30) NOT NULL default '',
  value varchar(200) NOT NULL default '',
  PRIMARY KEY  (variable)
) ENGINE=InnoDB;

INSERT INTO bayes_global_vars VALUES ('VERSION','3');

CREATE TABLE bayes_seen (
  id int(11) NOT NULL default '0',
  msgid varchar(200) binary NOT NULL default '',
  flag char(1) NOT NULL default '',
  PRIMARY KEY  (id,msgid)
) ENGINE=InnoDB;

CREATE TABLE bayes_token (
  id int(11) NOT NULL default '0',
  token binary(5) NOT NULL default '',
  spam_count int(11) NOT NULL default '0',
  ham_count int(11) NOT NULL default '0',
  atime int(11) NOT NULL default '0',
  PRIMARY KEY  (id, token),
  INDEX bayes_token_idx1 (id, atime)
) ENGINE=InnoDB;

CREATE TABLE bayes_vars (
  id int(11) NOT NULL AUTO_INCREMENT,
  username varchar(200) NOT NULL default '',
  spam_count int(11) NOT NULL default '0',
  ham_count int(11) NOT NULL default '0',
  token_count int(11) NOT NULL default '0',
  last_expire int(11) NOT NULL default '0',
  last_atime_delta int(11) NOT NULL default '0',
  last_expire_reduce int(11) NOT NULL default '0',
  oldest_token_age int(11) NOT NULL default '2147483647',
  newest_token_age int(11) NOT NULL default '0',
  PRIMARY KEY  (id),
  UNIQUE bayes_vars_idx1 (username)
) ENGINE=InnoDB;

Configure

Load TxRep editing v341.pre

# TxRep - Reputation database that replaces AWL 
loadplugin Mail::SpamAssassin::Plugin::TxRep

and commenting out this line in the /etc/mail/spamassassin/v310.pre file:

# loadplugin Mail::SpamAssassin::Plugin::AWL

Configure TxRep and Bayes editing the config file /etc/mail/spamassassin/90-sql.cf:

# spamassassin MySQL user pwd
MYSQL_PWD=xxxxxxx
cat >> /etc/mail/spamassassin/90-sql.cf << __EOF__

# txrep
use_txrep 1
txrep_autolearn 2
txrep_factory                   Mail::SpamAssassin::SQLBasedAddrList
user_awl_dsn                    DBI:mysql:spamassassin:localhost
user_awl_sql_username           spamassassin
user_awl_sql_password           ${MYSQL_PWD}
user_awl_sql_table              txrep

# bayesian
use_bayes 1
bayes_auto_learn 1
bayes_store_module              Mail::SpamAssassin::BayesStore::MySQL
bayes_sql_dsn                   DBI:mysql:spamassassin:localhost
bayes_sql_username              spamassassin
bayes_sql_password              ${MYSQL_PWD}
# increasing bayes score for 99% and 99.9% probability  
# score BAYES_99  4.5 
# score BAYES_999 0.5 
# if bayes declared it as spam do not use other filters and discard the msg to speed up the sa process 
# shortcircuit BAYES_99           spam
# shortcircuit BAYES_00           ham 
__EOF__

You will find that, once your bayesian system has been properly trained, it will be very effective to the point that a great deal of confidence can be assigned to it, so you may want  to increase its spam score, which is 3.5 for spam probability from 99 to 100% and 0.2 for spam probability from 99.9 to 100% (the above scores are added each other). For example you can put this in your local.cf:

score BAYES_99  4.5 
score BAYES_999 0.5

Testing

In order to test the learning system save a raw spam message into a file named spam.txt and run sa-learn in this way (supposing that postmaster@yourdomain.tld is the email recipient)

sa-learn --debug --spam -upostmaster@yourdomain.tld spam.txt

Training Bayes

The bayesian classifier can only score new messages if it already has 200 known spams and 200 known hams. So it is the time to train our learning system. Prepare a folder where you have a lot of messages that you are sure are only spam (at least 200 spam messages) and another with only ham.

Then run sa-learn. For a mailbox with spam:

sa-learn --showdots --username=user@domain.tld --spam spam-directory/*

For a mailbox with ham:

sa-learn --showdots --username=user@domain.tld --ham ham-directory/*

It is important to do both.

The previous commands specify the user to whom the training applies (this is stored in the username column of the bayes_vars table). This process must be repeated for each user. If you manage a small server with similar users (a family or a small business, for example) or very inactive users, you may want to use a single database. In this case, you need to add the following setting to spamassassin:

 bayes_sql_override_username spamd

where spamd is the name of a fictitious user, which can also be replaced with another name.

When users are new or inactive, you can train Bayes with public archives such as:

When Bayes doesn't seem to be working as expected, you need to inspect or even recreate the database. Here's how to dump the data in the database for a given user:

# sa-learn --username=user@domain.tld --dump magic 
0.000          0          3          0  non-token data: bayes db version
0.000          0      1403          0  non-token data: nspam
0.000          0      35828          0  non-token data: nham
0.000          0     131377          0  non-token data: ntokens
0.000          0 1752483848          0  non-token data: oldest atime
0.000          0 1755514834          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0 1755209886          0  non-token data: last expiry atime
0.000          0    2764800          0  non-token data: last expire atime delta
0.000          0      25694          0  non-token data: last expire reduction count

where nspam and nham are the number of tokens that are considered spam and ham, respectively.

You can also see the probability of a given token being classified as spam as follows along with the number of spam emails with that token (second column) and the number of ham messages with that token (third column):

# sa-learn --username=user@domain.tld --dump magic
0.001          0         18 1734626230  0c3a0e5670
0.004          0          4 1625134245  d86d9aa596
0.016          1          1 1745383124  cbeff1a8d7
0.003          0          6 1737832208  5eeffebb61
0.016          0          1 1619626749  6349b16724
0.016          1          1 1632300906  44e5d3d677
0.016          0          1 1697654287  a113ad81e7
0.016          0          1 1711770221  7f1a642f1d

If you suspect that Bayes has been poorly trained (for example, you get the BAYES_00 tag for an obvious spam, as if you had trained Bayes by reversing ham and spam), you can clean up a user's database like this before training it again:

sa-learn --username=user@domain.tld --clear

You can check how a given spam message (spam.txt in the example below) is classified like this:

# spamc -r < spam.txt

pts rule name              description
---- ---------------------- --------------------------------------------------
0.7 SPF_SOFTFAIL           SPF: sender does not match SPF record (softfail)
4.5 BAYES_99               BODY: Bayes spam probability is 99 to 100%
                           [score: 1.0000]
0.5 BAYES_999              BODY: Bayes spam probability is 99.9 to 100%
                           [score: 1.0000]
-0.0 NO_RELAYS              Informational: message was not relayed via SMTP
0.1 URI_HEX                URI: URI hostname has long hexadecimal sequence
0.0 HTML_FONT_LOW_CONTRAST BODY: HTML font color similar or identical to
                           background
0.0 T_MXG_EMAIL_FRAG       BODY: URI with email in fragment
0.0 HTML_MESSAGE           BODY: HTML included in message
0.0 PDS_FROM_NAME_TO_DOMAIN From:name looks like To:domain
0.0 PDS_FRNOM_TODOM_NAKED_TO Naked to From name equals to Domain
0.0 TO_NO_BRKTS_HTML_IMG   To: lacks brackets and HTML and one image

As already said, when you get BAYES_00 with a spam message it is a sign that the training process was not done using folders with pure spam/ham and it is necessary to repeat it.

Purging the txrep table

The txrep table is going to grow day after day depending on the traffic on your mail server. Most of the records are single spam event that will rarely produce another hit so that you can decide to clean out them to optimize the volume of data stored in that table and speed up the mysql query consequently.

Thus, let's create a file which stores the MySQL query. Modify this example entering the MySQL executable and the spamassassin MySQL account:

# spamassassin MySQL user pwd
MYSQL_PWD=xxxxxxx

cat > /usr/local/bin/txrep_purge.sh << __EOF__ 
#!/bin/sh 
/usr/bin/mysql -u spamassassin -p"${MYSQL_PWD}" -e "USE spamassassin; DELETE FROM txrep WHERE last_hit <= (now() - INTERVAL 120 day);" 
exit 0 
__EOF__

chown root:mysql /usr/local/bin/txrep_purge.sh 
chmod ug+x /usr/local/bin/txrep_purge.sh 
chmod o-rwx /usr/local/bin/txrep_purge.sh

So "spamassassin" is the myql user and "sa_pwd" is the password (this account must have the priviledges for the "spamassassin" DB both from the mail server's IP, from the apache's IP (userprefs via Roundcube) and now from the mysql host (localhost). Don't add spaces after -p.

Finally add the cronjob:

cat >> /etc/cron.d/qmail << __EOF__ 
# txrep
1 1 25 * * /usr/local/bin/txrep_purge.sh >> /var/log/cron
__EOF__

Comments

Error SQL bayes

Hi!

In a new installation that I just did I find the following problem:

...

Thu Jul 15 23:21:07 2021 [1447790] dbg: bayes: _put_tokens: SQL error: Incorrect string value: '\xBE\x82]\x05\x96' for column `spamassassin`.`bayes_token`.`token` at row 1
Thu Jul 15 23:21:07 2021 [1447790] dbg: bayes: _put_tokens: SQL error: Incorrect string value: '\xFE\xE9\x99\xF0' for column `spamassassin`.`bayes_token`.`token` at row 1
Thu Jul 15 23:21:07 2021 [1447790] dbg: bayes: _put_tokens: SQL error: Incorrect string value: '\xCA\xDE\xA3' for column `spamassassin`.`bayes_token`.`token` at row 1
Thu Jul 15 23:21:07 2021 [1447790] dbg: bayes: _put_tokens: SQL error: Incorrect string value: '\xFB\x07' for column `spamassassin`.`bayes_token`.`token` at row 1

....

The database and the table have on utf8mb4_unicode_ci.

Thanks!!

Reply |

Error SQL bayes

Hi, try to change from char to binary the bayes_token.token field as shown here

Let me know if it solves

Reply |

Failed to parse line

Hi Roberto,

I am getting this error:

May 27 01:09:53 mail spamd[10301]: config: failed to parse line, skipping, in "/etc/mail/spamassassin/sql.cf": auto_whitelist_factory Mail::SpamAssassin::SQLBasedAddrList

Cheers.

Reply |

Failed to parse line

did you commented this line?

#auto_whitelist_factory          Mail::SpamAssassin::SQLBasedAddrList

Reply |

Failed to parse line

Hi Roberto,

The error is in your guide. Where you have:

auto_whitelist_factory          Mail::SpamAssassin::SQLBasedAddrList

it should be

txrep_factory          Mail::SpamAssassin::SQLBasedAddrList

Thanks.

Reply |

Failed to parse line

Thank you. Actually I modified that line in my server but forgot to do the same in this page

Reply |

A small observation

and commenting out this line:

loadplugin Mail::SpamAssassin::Plugin::AWL

This line is located in the v310.pre file. So the text should read:

and commenting out this line in the /etc/mail/spamassassin/v310.pre file:

loadplugin Mail::SpamAssassin::Plugin::AWL

Reply |

Is MySQL really required?

Hi Roberto,

We have here bayes and TxRep enabled without MySQL, with all data being written to /etc/mail/spamassassin/.spamassassin, since we have no interest in using MySQL as we don't need userpref here.

Do you think the MySQL approach is really necessary?

I'd rather keep things simple here.

Cheers

Reply |

Is MySQL really required?

Hi, I choosed the mysql approach because I find it easier to purge the database by means of an SQL query, but I think you can get rid of mysql if you do the same with a command line script...

Reply |

Spamassassin 3.4.3 table column name changed

Hi Roberto,

The column "count" in table "txrep" is renamed to "msgcount" from version 3.4.3. Look into section "TxRep and Awl plugins has been modified..." at https://svn.apache.org/repos/asf/spamassassin/tags/spamassassin_release_3_4_3/UPGRADE.

Please update your guide as underneath when creating new table:

CREATE TABLE txrep (
  username varchar(100) NOT NULL default '',
  email varchar(255) NOT NULL default '',
  ip varchar(40) NOT NULL default '',
  msgcount int(11) NOT NULL default '0',
  totscore float NOT NULL default '0',
  signedby varchar(255) NOT NULL default '',
  last_hit timestamp NOT NULL default CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (username,email,signedby,ip),
  KEY last_hit (last_hit)
) ENGINE=InnoDB;

Or modifiy the table with the following command to upgrade from prior version:

ALTER TABLE `txrep` CHANGE `count` `msgcount` INT(11) NOT NULL DEFAULT '0';

Otherwise, the following error shall be recorded in spam log:

SQL error: Unknown column 'msgcount' in 'field list'

Reply |

Spamassassin 3.4.3 table column name changed

Thank you, corrected.

Reply |

cleanup old data

By adding

ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,

to the awl table definition you can easily spot old entries and delete them.

echo -e 'USE spamassassin;
DELETE FROM awl WHERE ts < (NOW()-(28*24*3600));' | mysql -uspamassassin -pPASSWORD

Reply |

Oops, the sql statement for

Oops, the sql statement for clean up should be

DELETE FROM awl WHERE ts < (NOW() - INTERVAL 28 DAY)

Reply |

yes, that's even better. I'll

yes, that's even better. I'll update this page as soon as possible

Reply |

AWL and Bayesean

hi,

Ended up with this error message:

Mon Jan 17 21:14:27 2011 [5769] info: config: failed to parse line, skipping, in "(no file)": bayes_auto_learn_threshold_spa 10
Mon Jan 17 21:14:27 2011 [5769] info: config: failed to parse line, skipping, in "(no file)": bayes_auto_learn_threshold_spa 10
Mon Jan 17 21:14:27 2011 [5769] info: spamd: processing message <4D34A31D.7050002@domain.xyz for user@domain.xyz:5002
Mon Jan 17 21:14:27 2011 [5769] info: spamd: identified spam (99.3/4.0) for user@domain.xyz:5002 in 0.0 seconds, 1226 bytes.

Reply |

Userprefs

Hi,

it's seems like the messages was rejected correctly because the sender is blacklisted, as the score is close to 100. Was it rejected?

Reply |

Add a comment