Thanks Thomas, for the info and explanation, that makes sense.
One question though, I'm trying to understand the difference between spammy
and hammy entries in the database, so I did the following query:
assp=# select * from hmmdb where pkey like '%testosterone%';
pkey
| pvalue | pfrozen
----------------------------------------------------------------------------
---------+-----------+---------
testosterone\x1Cand\x1Cfeel\x1Cstrong\x1Cssub
| 0.9999999 | 0
@macmedics.com\x1Cfree\x1Ctestosterone\x1Cand\x1Cfeel\x1Cstrong
| 0.9999999 | 0
ssub\x1Cboost\x1Cfree\x1Ctestosterone\x1Cand
| 0.9999999 | 0
@macmedics.com\x1Cssub\x1Cboost\x1Cfree\x1Ctestosterone\x1Cand
| 0.9999999 | 0
@macmedics.com\x1Cboost\x1Cfree\x1Ctestosterone\x1Cand\x1Cfeel
| 0.9999999 | 0
free\x1Ctestosterone\x1Cand\x1Cfeel\x1Cstrong
| 0.9999999 | 0
@macmedics.com\x1C98d6915738f9d2b8e981c34b\x1Cssub\x1Cboost\x1Cfree\x1Ctesto
sterone | 0.9999999 | 0
boost\x1Cfree\x1Ctestosterone\x1Cand\x1Cfeel
| 0.9999999 | 0
98d6915738f9d2b8e981c34b\x1Cssub\x1Cboost\x1Cfree\x1Ctestosterone
| 0.9999999 | 0
@macmedics.com\x1Ctestosterone\x1Cand\x1Cfeel\x1Cstrong\x1Cssub
| 0.9999999 | 0
(10 rows)
These spammy entries look identical to the disclaimers which you apparently
were saying were corrected-notspam. Sorry, I apparently don't know enough
Perl to figure out how the code is dealing with this, are these entries
simply in a different section of the database or does each entry in fact
contain enough info to identify whether it is a spam or ham word? When I
just dump the database, spam and ham entries appear to be together so it
appears to be the latter.
Thanks again,
- Phil
Re: <https://sourceforge.net/p/assp/mailman/message/36699848/> [Assp-user]
Disclaimers not being removed?
From: Thomas Eckardt - 2019-06-22 07:16:34
Attachments: Message
<https://sourceforge.net/p/assp/mailman/attachment/tITC.207679798c.OF3B47793
A.C5717FD8-ONC1258421.0023541F-C1258421.0027F2FD%40thockar.com/1/> as HTML
>I also noticed the regex had truncated words in some but not all cases so
I fixed that
ASSP_WordStem.pm is installed and used -> word stemming is done and
stop-words are removed. Any try to "fix" this, is wrong!
If the disclamer is not stemmed in the mail - another language was
detected for the mail. There is nothing you can (and should) fix.
The disclamer-definition and every mail are processed as follows:
- remove all special characters and spaces
- detect the language
- stem all words according to the detected language
Another way to make sure the disclamer is ignored by assp, is to compose
one or more faked mails, which contains only disclaimers (possibly
multiple times).
Put them in the oposit correction folder.
companyname\x1Cis\x1Can\x1Ciphone\x1Cpowered |
0.9999999 | 0
(here this would be corrected-notspam)
Make sure the MD5 hash of the body is different in all these mails.
Remove the disclamer-definition.
The discalimer content will get a weight of 0.4<>0.6 and will not be
stored in the databases. Or it will get a weight <=0.4 and will be
detected as good.
Thomas
---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
_______________________________________________
Assp-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/assp-user