YAM - Yet Another Mailer (#2) - the spam filter - howto (#6) - Message List

the spam filter - howto
 unsolved

hi.

after several years of sticking with microdot2 i'm now back to YAM. great progress so far. congrats!

now my question: how do i train the spam filter? the prefs tells my that 29 messages are marked as spam. but how did i do that?? selecting "set status to spam" in the context menu does not seem to raise the 29 spammails. my spam folder currently holds over 80 mails.

if i try to mark a mail from the spam filder as spam (set status to spam) this entry is ghosted. but if i select several mails the entry in the popup menu is selectable (does nothing though).

thanks and byebye...

Tree View Flat View (newer first) Flat View (older first)
  • Message #17

    Yes.. it is buggy but it still is best IRC client. :(

  • Message #16

    Just a comment on switching off - I never switch my Amiga off; it is always on, 24/7. Problem comes when after some time some memory gets trashed, usually when there is a lot of network problems (AmIRC also on always) and IRC splits. This causes Amiga to reboot or simply crash and get stuck. No way of quitting YAM or any other program.

    So if there is a solution on this spamfilter saving it would be very helpful for me too.

    Well, the memory trashing probably really comes from AmIRC as AmIRC is known to be buggy and trash memory from time to time. So perhaps you shouldn't keep AmIRC open all the time or use a different IRC client.

    Other than that I can't see any solution, sorry.

  • Message #15

    Just a comment on switching off - I never switch my Amiga off; it is always on, 24/7. Problem comes when after some time some memory gets trashed, usually when there is a lot of network problems (AmIRC also on always) and IRC splits. This causes Amiga to reboot or simply crash and get stuck. No way of quitting YAM or any other program.

    So if there is a solution on this spamfilter saving it would be very helpful for me too.

  • Message #14

    In fact, this really should be fixed. So, if you are able to reproduce the issue, please submit a bug report at http://bugs.yam.ch/ with an attached step-by-step explaination how you ended up with two SPAM folders. Because if we can reproduce your issue ourself we will definitly fix it. So please spent some time in trying to reproduce and prepare a step-by-step bugreport item.

    It seems to be qute reproducable here.

    I have submitted a bug report.

    Thanks for the discussion...

  • Message #13

    Again you have slightly misread me :-) (or I mistyped)

    I have set the SpamFlushThresh to 1 the SpamProbThreshold remains at 80, the value I set it to before the option disappeared from the config window. This is achieving what I want in that the .spamdata is updated if only 1 email status is changed.

    My spam filter related settings.

    [Spam filter]
    SpamFilterEnabled= Y
    SpamFilterForNew = Y
    SpamMarkOnMove = N
    SpamMarkAsRead = N
    SpamABookIsWhite = Y
    SpamProbThreshold= 80
    SpamFlushInterval= 900
    SpamFlushThres = 1
    

    This appears to work well for me, although ironically I haven't had much spam to test on this last couple of days.

    Oh well, you are right. I again mixed up the various configuration options for the SPAM filter. Of course a SpamFlushThres auf 1 should be fine, sorry for that. However, I will discuss with Thore if it isn't better to always keep that value at 1 and why we need to shouldn't have to it set to a higher value per default.

    One further thing on the spam front.

    I disabled and then reenabled the spam filter, just after it got "more agresive" and ended up with two spam folders showing in the folder list. They seem to be identical, I can't remove the extra one. I seem to remember having this problem just after I started using the spam filter. I thought this was a bug that was fixed? I think I had to directly edit the folder list to remove them last time.

    In fact, this really should be fixed. So, if you are able to reproduce the issue, please submit a bug report at http://bugs.yam.ch/ with an attached step-by-step explaination how you ended up with two SPAM folders. Because if we can reproduce your issue ourself we will definitly fix it. So please spent some time in trying to reproduce and prepare a step-by-step bugreport item.

  • Message #12

    Ah ok, then I really mixed it up, sorry for that. However, changing the Treshhold to 1 doesn't really solve anything. Like thore already pointed out, the only acceptable values are between 75 to 99. Any value outside the range will automatically aligned to its nearest allowed value. So in your case you are running always with 75 instead of 1. This was added to prevent a missconfiguration by users as a SPAM filter with probability 1 really doesn't work at all anymore.

    Again you have slightly misread me :-) (or I mistyped)

    I have set the SpamFlushThresh to 1 the SpamProbThreshold remains at 80, the value I set it to before the option disappeared from the config window. This is achieving what I want in that the .spamdata is updated if only 1 email status is changed.

    My spam filter related settings.

    [Spam filter]
    SpamFilterEnabled= Y
    SpamFilterForNew = Y
    SpamMarkOnMove = N
    SpamMarkAsRead = N
    SpamABookIsWhite = Y
    SpamProbThreshold= 80
    SpamFlushInterval= 900
    SpamFlushThres = 1
    

    This appears to work well for me, although ironically I haven't had much spam to test on this last couple of days.

    One further thing on the spam front.

    I disabled and then reenabled the spam filter, just after it got "more agresive" and ended up with two spam folders showing in the folder list. They seem to be identical, I can't remove the extra one. I seem to remember having this problem just after I started using the spam filter. I thought this was a bug that was fixed? I think I had to directly edit the folder list to remove them last time.

  • Message #11

    You have misread me slightly, I set the SpamFlushThresh to 1 not the SpamFlushInterval, that remains at 900. Thus it polls every 900 seconds, and saves if any mails have been marked spam or not spam in that period. The default was set to 50. That's way to high to ever get triggered for me as my ISP has an excellent spam filter, that get's nearly all my spam before I see it. (In a year of testing it never mistakenly marked a good mail as spam!)

    Ah ok, then I really mixed it up, sorry for that. However, changing the Treshhold to 1 doesn't really solve anything. Like thore already pointed out, the only acceptable values are between 75 to 99. Any value outside the range will automatically aligned to its nearest allowed value. So in your case you are running always with 75 instead of 1. This was added to prevent a missconfiguration by users as a SPAM filter with probability 1 really doesn't work at all anymore.

    My question leading on from this is, why do you need to poll? Why not save the data as soon as the user makes an active decision to mark and a mail spam or not spam?

    Well, because of performance reasons. Like with the automatic index saving interval you also don't really want to have YAM saving the .spamdata after each single email or otherwise you will run into performance problems very quickly.

    Why the heck is it such a big issue for you to shutdown YAM before you switch of your Amiga? We really tried hard to explain to you that you anyway require to do that or otherwise you also risk to have the index of some folders to be invalid.

    I like to be able to switch on, work, switch off. This to me is an aspect of the Amiga as opposed to Windows or other where shutting down is necessary. I understand what you explained about invalid indexes, but it is not an issue for me it happns less that 1 in 20 times, and the index is rebuilt in a few seconds, when I need it.

    Well, the times are almost over where you could savely switch off your Amiga at any time. As YAM shows, caching techniques are getting more common and common on Amiga and while you might be able to switch off your Amiga just instantly, chances are also high that you risk to have an invalidated/corrupted filesystem if you just switch it off right after a write operation. Well, at least if you have modern Amiga filesystems running like SFS. These filesystems also cache write operations and also require to wait a few seconds before you can savely switch your amiga off. So IMHO the times were an Amiga could be savely switched off without worrying at all are really over..

    Then train the spam filter properly step-by-step and sooner or later it will automatically identify that mail comming from aweb-dev isn't spam. And if it doesn't work, reset your training data and start from scratch to train it. And please, believe us if we say that such a "filter ordering" is not necessary if you train your SPAM filter correctly.

    perhaps eventually the spam filter will be trained correctly. But 'filter ordering' achieves the desired result imediatly. If the spam filter is so trainable, why allow the address book as a white list? Filter ordering is just a similar thing to a white list IMHO. If you don't agree, don't have time, have tecnical resons for to be dificult to implement then fair enough, It was just an idea that occured whilst I was using YAM.

    And thanks of course for your ideas an passion. Indeed that filter ordering idea might be something possible, but currently we doubt that it will really bring the benefits you are hoping it. So currently we don't have any plans to implement it. However, feel free to post a feature request.

    Please don't try to be smarter than YAM or even like our developers. If we suggest to you to use the default you should do it!

    I simply must object to this statement. In all the 5 years or so I've spent working on AWeb I've never said anything so arrogant to some user who suggested an off the wall idea. Occasionally users can suggest things that the main developer never thought of.

    Sorry if it sounded arrogant, but as I first thought you were playing around with the timeout values of the SpamFlushInterval, I thought I might have to be more agressive in telling you that we as the developer would know best which Spam Filter settings are best to have a working Spam filter engine.

    Of course we are always open for new thoughts and ideas, but also be assured that we know very well what we are doing :)

  • Message #10

    Sorry, but setting SpamFlushThreshold to 1 is totally nonsense and I highly recommand to change it back to the original way. Please don't try to be smarter than YAM or even like our developers. If we suggest to you to use the default you should do it! By setting the value to 1 you simply risk to YAM using the processor unnecessary as it will then poll every second if the spam data needs to be saved.

    You have misread me slightly, I set the SpamFlushThresh to 1 not the SpamFlushInterval, that remains at 900. Thus it polls every 900 seconds, and saves if any mails have been marked spam or not spam in that period. The default was set to 50. That's way to high to ever get triggered for me as my ISP has an excellent spam filter, that get's nearly all my spam before I see it. (In a year of testing it never mistakenly marked a good mail as spam!)

    My question leading on from this is, why do you need to poll? Why not save the data as soon as the user makes an active decision to mark and a mail spam or not spam?

    And in addition, it doesn't really solve your problem.

    But it does, spam data is saved every 900 seconds if any change has been made. I don't need to quit YAM.

    Why the heck is it such a big issue for you to shutdown YAM before you switch of your Amiga? We really tried hard to explain to you that you anyway require to do that or otherwise you also risk to have the index of some folders to be invalid.

    I like to be able to switch on, work, switch off. This to me is an aspect of the Amiga as opposed to Windows or other where shutting down is necessary. I understand what you explained about invalid indexes, but it is not an issue for me it happns less that 1 in 20 times, and the index is rebuilt in a few seconds, when I need it.

    Then train the spam filter properly step-by-step and sooner or later it will automatically identify that mail comming from aweb-dev isn't spam. And if it doesn't work, reset your training data and start from scratch to train it. And please, believe us if we say that such a "filter ordering" is not necessary if you train your SPAM filter correctly.

    perhaps eventually the spam filter will be trained correctly. But 'filter ordering' achieves the desired result imediatly. If the spam filter is so trainable, why allow the address book as a white list? Filter ordering is just a similar thing to a white list IMHO. If you don't agree, don't have time, have tecnical resons for to be dificult to implement then fair enough, It was just an idea that occured whilst I was using YAM.

    Please don't try to be smarter than YAM or even like our developers. If we suggest to you to use the default you should do it!

    I simply must object to this statement. In all the 5 years or so I've spent working on AWeb I've never said anything so arrogant to some user who suggested an off the wall idea. Occasionally users can suggest things that the main developer never thought of.

    Having got upset at the above I must temper it with the fact that I enjoy using YAM and that you guys are doing a good job.

  • Message #9

    Thanks for that, I've taken the risk of setting the SpamFlushThreshold to 1 so that any changes to the spam traing get saved straigh away. (On my own head be it :-))

    Sorry, but setting SpamFlushThreshold to 1 is totally nonsense and I highly recommand to change it back to the original way. Please don't try to be smarter than YAM or even like our developers. If we suggest to you to use the default you should do it! By setting the value to 1 you simply risk to YAM using the processor unnecessary as it will then poll every second if the spam data needs to be saved. And in addition, it doesn't really solve your problem. Why the heck is it such a big issue for you to shutdown YAM before you switch of your Amiga? We really tried hard to explain to you that you anyway require to do that or otherwise you also risk to have the index of some folders to be invalid.

    Another thing I've noticed is that the spam filter seems to be applied before the "ordinary filters". I would prefer that the other way round as almost these filters deal with mailing lists (for me) and these are almost all spam free.

    Well there are a couple with spam, so it might be nice to set the order of filtering, ie filters a b c then the spam filter then filters d e f

    Example: I've never received any spam from aweb-dev@… and it's a pain to have to dig out bug reports from the spam folder.

    Then train the spam filter properly step-by-step and sooner or later it will automatically identify that mail comming from aweb-dev isn't spam. And if it doesn't work, reset your training data and start from scratch to train it. And please, believe us if we say that such a "filter ordering" is not necessary if you train your SPAM filter correctly.

  • Message #8

    SpamFlushInterval defines the time of the interval in seconds while SpamFlushThres defines how many messages must have been classified to actually perform the save process. You may change these values, but this NOT recommended, too!!

    Thanks for that, I've taken the risk of setting the SpamFlushThreshold to 1 so that any changes to the spam traing get saved straigh away. (On my own head be it :-))

    Another thing I've noticed is that the spam filter seems to be applied before the "ordinary filters". I would prefer that the other way round as almost these filters deal with mailing lists (for me) and these are almost all spam free.

    Well there are a couple with spam, so it might be nice to set the order of filtering, ie filters a b c then the spam filter then filters d e f

    Example: I've never received any spam from aweb-dev@… and it's a pain to have to dig out bug reports from the spam folder.

  • Message #7
    1. I'm sure there was an option to set a threshold on the % likley hood that a mail was spam. I had it set to 80% and it work fine for me. (I'd much reather remove spam from my In box than good mail from my spam box)

    Has this option gone, I just can't seem to find it now!

    The option has been removed from the GUI because the default value should suffice. You can change it manually nevertheless. Just change SpamProbThreshold to the desired value, although this is NOT recommended. Accepted range is 75 to 99.

    1. I was confused as spam training data never used to "stick" every time I looked in the spam config section the data remained the same despite me having marked or unmarked an amount of mails the previous day / session.

    I've just realised that this is because .spamdata is only saved at program exit. Is there an option to save it without quiting YAM? Or can it be made to save everytime I mark a mail as not pam or spam? I don't like to quit YAM, I almost never quit YAm or AWeb and like programs when I switch of my Amiga. This is safe with AWeb but apparently not YAM. (or at least not in respect of .spamdata)

    Don't expect YAM to save every little bit of information in an instant. If you switch off your computer without quitting YAM you risc to loose an up-to-date index of your mail folders. This index can be restored, though, becase the mail files still exist. But the spam information is irrevocably lost if YAM didn't have the chance to save it to disk.

    YAM will save the traingdata

    1. when being quit
    2. at definable intervals, but only if enough new mails have been classified since the last save.

    SpamFlushInterval defines the time of the interval in seconds while SpamFlushThres defines how many messages must have been classified to actually perform the save process. You may change these values, but this NOT recommended, too!!

    Better NEVER switch off your computer without shutting down all major applications. You can never know if some applications keep information in memory to speed up things (like YAM does) and only save these upon exit. You will definitely loose data!!

  • Message #6

    I have been having problems with the spam filter lately.

    Since the last update I downloaded (21/04/2007 (GCC 4.0.4) OS4) The spam filter became much more agressive, treating almost all mail as spam, when before it was working about right.

    FRom reading the above I now understand how the filter is trained, and perhaps during the update the training data was overwritten.

    No, an update shouldn't overwrite your training data. However, as you are using a developer nightly build version you have to expect everything and everywhere. So please reset your training data and start the training from scratch. And then, please also have a look at the FAQ item about how to train the SPAM filter correctly:

    http://yam.ch/wiki/FAQ/Using%20YAM#HowtotrainYAMsSPAMfilterproperly

    This should pretty much outline how you properly use the SPAM filter in YAM.

    But two questions arise:

    1. I'm sure there was an option to set a threshold on the % likley hood that a mail was spam. I had it set to 80% and it work fine for me. (I'd much reather remove spam from my In box than good mail from my spam box)

    Has this option gone, I just can't seem to find it now!

    Yes, this option was removed because it was not necessary and just started to confuse people. The internal default value should be already pretty enough for having the SPAM filter working correctly.

    1. I was confused as spam training data never used to "stick" every time I looked in the spam config section the data remained the same despite me having marked or unmarked an amount of mails the previous day / session.

    I've just realised that this is because .spamdata is only saved at program exit. Is there an option to save it without quiting YAM? Or can it be made to save everytime I mark a mail as not pam or spam? I don't like to quit YAM, I almost never quit YAm or AWeb and like programs when I switch of my Amiga. This is safe with AWeb but apparently not YAM. (or at least not in respect of .spamdata)

    Well, the answer is easy: You are not supposed to turn off or reset your Amiga while YAM is still running. Beside the SPAM data saving routines problems, there might also be other issues you are provocating by just blindly turning off your Amiga. Aweb is just a web browser and doesn't keep that much temporary data it would require for the next sesssion, but YAM is different. It not only holds the SPAM data until the next save, it also hold mail folder index data until a predefined saving interval. So, for example. If you would change the status of a mail and immediately turn off your Amiga it will invalidate the mail index of the folder where this mail was stored. So the simple answer/solution is: Make sure YAM is closed before you turn off your Amiga.

  • Message #5

    I have been having problems with the spam filter lately.

    Since the last update I downloaded (21/04/2007 (GCC 4.0.4) OS4) The spam filter became much more agressive, treating almost all mail as spam, when before it was working about right.

    FRom reading the above I now understand how the filter is trained, and perhaps during the update the training data was overwritten.

    But two questions arise:

    1. I'm sure there was an option to set a threshold on the % likley hood that a mail was spam. I had it set to 80% and it work fine for me. (I'd much reather remove spam from my In box than good mail from my spam box)

    Has this option gone, I just can't seem to find it now!

    1. I was confused as spam training data never used to "stick" every time I looked in the spam config section the data remained the same despite me having marked or unmarked an amount of mails the previous day / session.

    I've just realised that this is because .spamdata is only saved at program exit. Is there an option to save it without quiting YAM? Or can it be made to save everytime I mark a mail as not pam or spam? I don't like to quit YAM, I almost never quit YAm or AWeb and like programs when I switch of my Amiga. This is safe with AWeb but apparently not YAM. (or at least not in respect of .spamdata)

  • Message #4

    i now resetted the training data and it seems to behave like described. means, all incoming mail is marked as spam. now i'm finetunig it :D always makes fun.

    byebye...

  • Message #3

    thanks jens.

    very nice and clear explanation of the spam filter stuff. i think i understood that the spam filter needs to be trained. that's clear. but i am not sure if i trained it correctly. i believe that the "number of (not) spam classified emails" reflect the current training status. here it tells me "81" as "not spam" and "17" as "spam". i wonder why only 17 are classified as spam as i am sure that i marked much more mails as spam. where is this value coming from? i'l play around with it a little more.

    Well, you really have to train the SPAM filter in *both* directions. That means you need to train it for mail that is actually SPAM and mail that is obviously NOT spam. As much as you train the SPAM engine for both of them the more confident it will get in automatically recognizing real SPAM mails.

  • Message #2

    now my question: how do i train the spam filter? the prefs tells my that 29 messages are marked as spam. but how did i do that?? selecting "set status to spam" in the context menu does not seem to raise the 29 spammails. my spam folder currently holds over 80 mails.

    if i try to mark a mail from the spam filder as spam (set status to spam) this entry is ghosted. but if i select several mails the entry in the popup menu is selectable (does nothing though).

    Well, if you know Thunderbird on other platforms than you should know how the SPAM filter implementation in YAM currently works. In fact, it is more or less a straight port from Thunderbird and therefore it behaves the very same like there.

    However, if you don't know Thunderbird's SPAM functionality the way how to work with the SPAM engine may be a bit intuitive in the first place. In fact, you need to *train* the SPAM filter properly before it behaves like you would like it to behave.

    If you enable the SPAM filter for the first time (or if you reset it) then it will automatically mark ALL incoming mail as SPAM. Afterwards it is *your* job to tell the SPAM filter what mails are real SPAM and what mails are no SPAM according to the content.

    So lets say you enable the SPAM filter for the first time, it will then automatically mark all new incoming mail as SPAM. After that you have to go through all that mail manually and check if that is really SPAM or not. And if not, you need to select that false positives as so-called HAM or "not SPAM" in YAM. After you have done so it will then a little bit trained and should not mark all new incoming mail as SPAM automatically. As the time walks by the amount of mails you have to manually correct the SPAM status lowers with every tagged/untagged SPAM mail. In fact, an amount of around 50 mails should already be enough to properly sort out 70-80% of common SPAM mails. And after you go on with the training you should very shortly reach the border where the SPAM filter will work properly and automatically be confident up to 98-99%.

    So, as a matter of fact. All you need to do is to train the SPAM engine *manually* by telling it what mails are SPAM and which mails are *no* SPAM. If you understand that technique the SPAM engine is very intuitive....

    thanks jens.

    very nice and clear explanation of the spam filter stuff. i think i understood that the spam filter needs to be trained. that's clear. but i am not sure if i trained it correctly. i believe that the "number of (not) spam classified emails" reflect the current training status. here it tells me "81" as "not spam" and "17" as "spam". i wonder why only 17 are classified as spam as i am sure that i marked much more mails as spam. where is this value coming from? i'l play around with it a little more.

    thanks and byebye...

  • Message #1

    hi.

    after several years of sticking with microdot2 i'm now back to YAM. great progress so far. congrats!

    Thanks, we really worked hard to get YAM better and better and we hopefully can release the final and stable version of 2.5 during the next few months...

    now my question: how do i train the spam filter? the prefs tells my that 29 messages are marked as spam. but how did i do that?? selecting "set status to spam" in the context menu does not seem to raise the 29 spammails. my spam folder currently holds over 80 mails.

    if i try to mark a mail from the spam filder as spam (set status to spam) this entry is ghosted. but if i select several mails the entry in the popup menu is selectable (does nothing though).

    Well, if you know Thunderbird on other platforms than you should know how the SPAM filter implementation in YAM currently works. In fact, it is more or less a straight port from Thunderbird and therefore it behaves the very same like there.

    However, if you don't know Thunderbird's SPAM functionality the way how to work with the SPAM engine may be a bit intuitive in the first place. In fact, you need to *train* the SPAM filter properly before it behaves like you would like it to behave.

    If you enable the SPAM filter for the first time (or if you reset it) then it will automatically mark ALL incoming mail as SPAM. Afterwards it is *your* job to tell the SPAM filter what mails are real SPAM and what mails are no SPAM according to the content.

    So lets say you enable the SPAM filter for the first time, it will then automatically mark all new incoming mail as SPAM. After that you have to go through all that mail manually and check if that is really SPAM or not. And if not, you need to select that false positives as so-called HAM or "not SPAM" in YAM. After you have done so it will then a little bit trained and should not mark all new incoming mail as SPAM automatically. As the time walks by the amount of mails you have to manually correct the SPAM status lowers with every tagged/untagged SPAM mail. In fact, an amount of around 50 mails should already be enough to properly sort out 70-80% of common SPAM mails. And after you go on with the training you should very shortly reach the border where the SPAM filter will work properly and automatically be confident up to 98-99%.

    So, as a matter of fact. All you need to do is to train the SPAM engine *manually* by telling it what mails are SPAM and which mails are *no* SPAM. If you understand that technique the SPAM engine is very intuitive....

Tree View Flat View (newer first) Flat View (older first)

Attachments

No attachments created.