How Apple's Mail.app Junk Filter Works 273

Posted by pudge on Wednesday May 19, 2004 @12:55AM from the would-you-like-to-buy-a-monkey? dept.

fmorgan writes "O'Reilly has now posted the second part on an article about Mac OS X Mail.app spam filtering with more details on what this technology is (and isn't): 'Many myths have emerged about Mail's junk mail filter. No, it's not an extremely complex set of rules, no it doesn't look for keywords, and no, it doesn't use white magic ... Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system.'"

This discussion has been archived. No new comments can be posted.

How Apple's Mail.app Junk Filter Works

Load All Comments

Search 273 Comments Log In/Create an Account

Comments Filter:

Magic (Score:4, Funny)

by Faust7 ( 314817 ) writes: on Wednesday May 19, 2004 @12:56AM (#9192713) Homepage

and no, it doesn't use white magic...

Black, then?
Or is that reserved exclusively for Microsoft?

Share
twitter facebook
- Re:Magic (Score:5, Funny)
  
  by Jameth ( 664111 ) writes: on Wednesday May 19, 2004 @01:18AM (#9192804)
  
  and no, it doesn't use white magic...
  
  Black, then? Or is that reserved exclusively for Microsoft?
  It's not reserved, they have a monopoly.
  
  Parent Share
  twitter facebook
- Re:Magic (Score:2)
  
  by lpangelrob2 ( 721920 ) writes:
  
  Black, then?
  I would have to imagine it would be a little more like red magic [planetnintendo.com]. Pretty versatile, borrows a bit of both, and largely effective, but if you want hardcore effects, you'll have to go all white or all black.
- Re:Magic (Score:3, Funny)
  
  by Inf0phreak ( 627499 ) writes:
  
  Oh yes. I can just imagine how some of the code looks:
  
  if (isspam(mailentry)) HADOKEN(mailentry);
  
  Go here [nuklearpower.com] for an explanation (funny webcomic IMO).
- Information Retrieval (Score:5, Funny)
  
  by ScottGant ( 642590 ) writes: <scott_gant AT sbcglobal DOT netNOT> on Wednesday May 19, 2004 @08:52AM (#9194338) Homepage
  
  This is Information Retrieval not Information Dispersal...Information Transit got the wrong man. I got the right man. The wrong one was delivered to me as the right man, I accepted him on good faith as the right man. Was I wrong?
  
  My name's Lowry. Sam Lowry. I've been told to report to Mr. Warrenn.
  Thirtieth floor, sir. You're expected.
  Um... don't you want to search me?
  No sir.
  Do you want to see my ID?
  No need, sir.
  But I could be anybody.
  No you couldn't sir. This is Information Retrieval.
  
  There you are, your own number on your very own door. And behind that door, your very own office! Welcome to the team, D7-105! Welcome to Information Retrieval
  
  Parent Share
  twitter facebook
Maybe... (Score:5, Interesting)

by ErichTheWebGuy ( 745925 ) writes: on Wednesday May 19, 2004 @01:02AM (#9192734) Homepage

Microsoft can learn a lesson here? Especially in the light of this hole [securityfocus.com], from which a spammer can clearly see that you have opened their messages and validate your address...

Share
twitter facebook
- Re:Maybe... (Score:5, Informative)
  
  by Anonymous Coward writes: on Wednesday May 19, 2004 @01:24AM (#9192832)
  
  That's why, at our site, all incoming email goes through the Anomy Sanitizer [anomy.net]. It removes unknown HTML tags, like <vframe> or <script>, as well as filters offsite images to eliminate so called web-bugs [eff.org].
  
  Oh, and it's fast, too.
  
  Parent Share
  twitter facebook
  - Re:Maybe... (Score:2)
    
    by ErichTheWebGuy ( 745925 ) writes:
    
    Sweet, thanks for the info. I will look into deploying it at our site.
  - Re:Maybe... (Score:5, Interesting)
    
    by nacturation ( 646836 ) writes: <nacturationNO@SPAMgmail.com> on Wednesday May 19, 2004 @02:29AM (#9193048) Journal
    
    I assume web bug images aren't filtered out if they are, for example:
    
    http://host.com/images/1F59C6EA.jpg
    
    A spammer could setup their server (mod_url I think?) so that this gets translated to:
    
    http://host.com/serve_image.php?email_id=1F59C6E A
    
    This would still verify the email address and would generally be transparent to the user. The filter could get smarter and search for numbers, but this is also easily overcome by dictionary words. If you used 5 letter words, you'd have about 10,000 of them to use. You could then represent 100,000,000 (10,000 ^ 2) email addresses using only two five letter words in succession in a URL, such as:
    
    http://host.com/img/abash/zymin/logo.jpg
    
    and rewriting it as before. Each user gets a unique combination of two words that uniquely identifies them. If abash is the 9th word and zymin is the 9914th word, then this is user id (9 * 10,000 + 9914) = 99,914.
    
    Really, the only solution to web bugs is to not load images from unknown senders. Make the user manually load images (mail.app has this feature as do many other clients) if they are not attached as files with the message.
    
    Parent Share
    twitter facebook
    - Re:Maybe... (Score:3, Informative)
      
      by That's Unpossible! ( 722232 ) * writes:
      
      I assume web bug images aren't filtered out if they are, for example:
      
      http://host.com/images/1F59C6EA.jpg
      
      You assume wrong. The guy you're responding to said they remove offsite image tags. So unless the images are embedded in the email (i.e. not web-bugs), they aren't displayed.
      
      You cannot filter web-bugs and still leave images pointing offsite, obviously.
  - Re:Maybe... (Score:3, Interesting)
    
    by Merk ( 25521 ) writes:
    
    Why leave any HTML? Does <blink> make a message more compelling? Do you really need someone to send a message with baloons in the background? If someone really likes the handwriting font, should I be forced to see that in their email?
    
    Sure, sometimes in a complex email it would be nice to be able to use headers or bulleted lists. But nobody should be able to force me to display the message with their ugly-ass markup.
    The only thing that makes any sense here is to use strict stylesheet-based m
    - Re:Maybe... (Score:3, Insightful)
      
      by orasio ( 188021 ) writes:
      
      (I was going to mod you down, but I understood that its a good comment, I just think you are wrong)
      
      Nonsense. HTML mail should be rendered as HTML. If you want to see text-only, or something, you can just read mail as text-only, in your client. If I send mail with baloons, it is because I want people to see my beautiful baloons and gothic handwriting. Messing with that is mangling communication, the other person thinks you saw something you didn't.
      
      No one I know abuses HTML mail to the extent of making it h
      - Re:Maybe... (Score:3, Funny)
        
        by ChaosDiscord ( 4913 ) writes:
        
        Maybe you just need to be more picky about giving your address to people.
        
        I tried that, but my boss got angry when I refused to give him my business address.
      - Re:Maybe... (Score:3, Funny)
        
        by Golias ( 176380 ) writes:
        
        <h1><blink><b>I totally agree!!! </b><blink></h1><table width="980"><tr><td width="206" align="center">It seems to me </td><td width="780">that converting HTML to <i>plain old text</i> should be a <blink><strong>perfectly fine</strong></blink> choice for those who don't want to read your <ul>dumbass, pointless markup. </ul></td></tr></table><p> Some people really <b
- Re:Maybe... (Score:5, Informative)
  
  by karmatic ( 776420 ) writes: on Wednesday May 19, 2004 @01:43AM (#9192906)
  
  Macs are vulnerable to the so-called "hole" as well. In fact, _any_ html compliant email client with image support is.
  
  For example, I wrote some software which takes your email address, and assigns a 5 letter id. The img tag loads an image with the url http://mailserver/get/yourid/image.gif
  
  From this, it's possible to tell 1) If the email is valid, 2) If you click the image (the url contains your ID) 3) How long before you click 4) If you buy.
  
  So, if you're dumb enough to buy from spam you get on a sucker list.
  
  Quit blaming MS - they are unfortunatly the ones who introduced HTML mail, but everyone else who follows suit has problems too.
  
  Parent Share
  twitter facebook
  - Re:Maybe... (Score:5, Informative)
    
    by tkokesh ( 668827 ) writes: on Wednesday May 19, 2004 @01:57AM (#9192949) Homepage
    
    Actually, Mail.app in Mac OS X 10.3 (Panther) has an option in the "Viewing" Preferences: "Display images and embedded objects in HTML messages".
    When this option is unchecked, the user has to click a specific "Load Images" button in order to see the images in an HTML email, which means that the GIF does not get loaded unless the user lets it. For obvious spam emails, of course, the user can just junk the email, and the spammer gets no confirmation of delivery.
    
    Parent Share
    twitter facebook
    - Re:Maybe... (Score:3, Informative)
      
      by myov ( 177946 ) writes:
      
      Messages flagged as spam do not display images (until you click Load Images). I requested this feature a while ago because of all the web bugs embedded in spam.
    - - Good god, man (Score:5, Informative)
        
        by thatguywhoiam ( 524290 ) writes: on Wednesday May 19, 2004 @08:23AM (#9194118)
        
        Wow, a checkbox buried in the preferences options. Apple is unique and ahead of the curve. But wait! There is a fix for outlook too [msnwar.com].
        Well, since you brought it up, yes, let's compare:
        Apple method:
        Open Prefs
        Click Viewing Options
        Uncheck 'Display images and embedded objects in HTML messages'
        
        ... or I can go hunting on the web for this weirdo, non-sanctioned 'patch' for Outlook, and install that. Oh yeah, and ZoneAlarm.
        I'll stick with Apple's method thanks.
        
        Parent Share
        twitter facebook
        
        Re:Good god, man (Score:3, Informative)
        
        by fanfriggintastic ( 751454 ) writes:
        
        Images are off by default in Outloook 2003. You can turn them on for a particular sender or per email, easily, through a link at the top of the message. Piece of cake.
        
        Re:Good god, man (Score:3, Informative)
        
        by geoffspear ( 692508 ) * writes:
        
        My three mouse buttons all work perfectly well with my Mac. They don't restrict you to anything, they just sell their machines with a one-button mouse.
        I don't even need to go hunting for drivers to install if I want to plug in another mouse, or damn near any other USB device. They just work.
  - Not if email is marked as junk... (Score:5, Informative)
    
    by SuperKendall ( 25149 ) * writes: on Wednesday May 19, 2004 @02:40AM (#9193080)
    
    If an email is marked as junk, even if you go to look at it to see if it's really junk no images are loaded so this tracker does not work.
    
    As others have mentioned you can also turn off images for all messages, which is what I would do if it ever started missing spam. So far only one miss in the last six months or so, and no false positives. I'm pretty impressed.
    
    Parent Share
    twitter facebook
    - Re:Not if email is marked as junk... (Score:3, Informative)
      
      by soft_guy ( 534437 ) writes:
      
      I use Mail.app, I have Panther, and I keep everything current. Still, Mail often misses many pieces of spam every day and gives me false positives from time to time. YMMV. Still, I find the junk mail filter useful enough to leave on.
- Re:Maybe... (Score:3, Informative)
  
  by bigberk ( 547360 ) writes:
  
  from which a spammer can clearly see that you have opened their messages and validate your address...
  
  That's old news, I wrote the solution [pc-tools.net] three years ago. Just use a mail client such as this one that strips HTML.
- Re:Maybe... (Score:3, Informative)
  
  by rritterson ( 588983 ) * writes:
  
  Or you can just set Outlook 2003 to not parse html and show it as code instead. You can also tell it not to download images by default which prevents another possible 'notifier'
Vectors..... (Score:4, Interesting)

by BWJones ( 18351 ) * writes: on Wednesday May 19, 2004 @01:03AM (#9192738) Homepage Journal

Each document is in turn represented by a long string of numbers, one for each word in the corpus. In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions. I know it sounds quite geeky but if you can visualize that, you're halfway there.

Ah, it uses vector math. With Altivec, no wonder Mail is so damned fast.

The other really interesting thing about mail is that it implements clustering algorithms to rank and group which makes me wonder why more GIS software is not running on OS X. Image classification would be a no brainer for folks that spend their time examining images and multispectral datasets.

Share
twitter facebook
- Re:Vectors..... (Score:2)
  
  by mrpuffypants ( 444598 ) * writes:
  
  The other really interesting thing about mail is that it implements clustering algorithms to rank and group which makes me wonder why more GIS software is not running on OS X. Image classification would be a no brainer for folks that spend their time examining images and multispectral datasets.
  
  Yes, that is important and all, but the real question is: "How fast does it play PORN?" Truly that is a real multispectral dataset that needs to be examined using floating points. heh.
- Fast?!? (Score:5, Interesting)
  
  by SuperBanana ( 662181 ) writes: on Wednesday May 19, 2004 @01:44AM (#9192915)
  
  With Altivec, no wonder Mail is so damned fast.
  Sorry, but I couldn't let this one slide. You've obviously got a special interpretation of "fast", because I tried migrating my Eudora mailboxes to Mail, on a 1Ghz Powerbook G4.
  Mail CHOKED on them. The early version of Mail chugged for 2 something hours and I gave up and killed it. The latest version was slightly better; 1000 messages or so still took well over 10 minutes. It takes Eudora about 10 seconds to rebuild those big mailboxes(deleted messages aren't actually deleted until Eudora gets around to rebuilding the mailbox; you can set the limit based on percentage of the mailbox, raw MB, I think even % remaining disk space), or force it manually with one click in that mailbox's window. My inbox is 820, and several mailing list boxes are well over 5,000 if I forget to clean them out. I have hundreds of MB of mail, and Eudora handles most operations with little performance hit no matter how big the mailbox gets(there is a limit of around 32,000 messages however, which someone I know hit).
  But that was just the importing- then it had to thread them or something, and THEN it had to index them all, both of which it did in the background, but still took forever.
  Searching? Well, ok, it's "better" than Eudora in that it gives relevancy and Eudora is an on/off sorta deal, but that's fine- and I prefer 1 second for an exact search in a 2,000 message mailbox over 5-10 seconds for a fuzzy search.
  Sorry, but Eudora, despite being a lumbering dinosaur technology-wise(MIME support is broken- PGP-MIME just doesn't work right; no address book integration is another thing that really irritates me), it is just plain hands-down the fastest mail client around.
  The MBOX-with-index format also works exceedingly well, is portable (although some minor massaging with text-processing tools may be needed in some cases), and hard to corrupt- unlike almost every other mail client's DB (especially outlook). I've used Eudora for ten years, and never lost a single message except for one early beta version which munged a mailbox on me.
  
  Parent Share
  twitter facebook
  - Re:Fast?!? (Score:5, Interesting)
    
    by pHDNgell ( 410691 ) writes: on Wednesday May 19, 2004 @02:03AM (#9192968)
    
    Sorry, but I couldn't let this one slide. You've obviously got a special interpretation of "fast", because I tried migrating my Eudora mailboxes to Mail, on a 1Ghz Powerbook G4.
    
    Mail CHOKED on them.
    
    Everyone's got a story and a counter-story. I've got over 100,000 messages in IMAP (101,269 as of last night, but it goes up and down), fully synced to Mail.app (bodies and attachments) indexed for searching, and used every day. It's split over 250 mail boxes (one for each month I've sent or received email as long as I've been keeping stuff).
    
    It's amazingly fast. It makes my mail server seem fast (Sun IPX running SunOS 4.1.4 with a custom cyrus IMAPd that supports compressed mail stores and LDAP and some other stuff).
    
    (Sorry for all the parentheticals. :)
    
    Parent Share
    twitter facebook
    - Re:Fast?!? (Score:5, Funny)
      
      by Alan ( 347 ) writes: <arcterex@ufies.oPARISrg minus city> on Wednesday May 19, 2004 @02:19AM (#9193013) Homepage
      
      Dude, you seriously need to seek help for your mail-archiving condition :)
      
      Or if nothing else move some of the mail to a backup directory so the poor little imap server doesn't have to deal with YOUR pack-rat habits!
      
      Parent Share
      twitter facebook
      - Re:Fast?!? (Score:3, Interesting)
        
        by richie2000 ( 159732 ) writes:
        
        Has anything good come out since SunOS 4.1.4?
        I don't think so. Considering the time it took to get 4.1.4 as the proverbial gift from the Gods, I wouldn't hold my breath. ;-)
        Damn, I actually miss SunOS, SunView and the 3/80s we had at school...
  - Re:Fast?!? (Score:3, Interesting)
    
    by Rosyna ( 80334 ) writes:
    
    Uhm, I've got about 5 mailboxes that have hit this 32760 message limit (dunno why but they recently reduced it to 32000).
    
    My Mail folders contain 2.31gigs of email. Mail cannot handle this and chokes on it horribly. Eudora handles it like a champ. Too bad its junk mail filter sucks.
    - Re:Fast?!? (Score:3, Interesting)
      
      by EvilTwinSkippy ( 112490 ) writes:
      
      As a network administrator I just have to do a paternalistic scowel at you.
      2.3 gig of email. Dear god our server only has a 20 gig hard drive. I'd be camped out at your office (or send a coop to camp in your office.) and make disparaging remarks about "bloat" until you trimmed up a bit.
      If everything is important, nothing is important. 32,000 messages means you aren't real picky.
      - Re:Fast?!? (Score:5, Informative)
        
        by EvilTwinSkippy ( 112490 ) writes: <yoda.etoyoc@com> on Wednesday May 19, 2004 @10:53AM (#9195261) Homepage Journal
        
        Where to start...
        First off, servers take SATA or SCSI, not the cheepy IDE drives you find on the net. Second, even if you could find equivilent sizes for equivilent prices for server-grade stuff, I can't speak for everyone, but users don't store anything on my network that isn't on a RAID. 2 drives for a RAID-1, 3 (at least) for RAID-5.
        Assuming that cost isn't an issue, and you have a miraculaous RAID controller that is easy to program, you run into the problem of how to hook up the new drives. If you don't have enough bays and connectors you have to drop your old hard drives to tape, plug in your new drives, and restore.
        The last time I did a restore of 160GB it took 48 hours with a DLT autoloader. AIT might cut that down to 12 hours. But that's still a long time to be without data.
        I'll save the isues about premature failure on these uber-mega drives for another discussion.
        Now I insist our users use IMAP for email. Too many bad experiences of desktops croaking and taking all of a user's POP mailboxes with it. Making your system catalogue several gigabytes of email per user is going to slow things to a crawl, unless you are using something enlightened like maildir. Even then, you are going to be hell bent to find a file system that effiently handles both uber-mega attachments AND a few million tiny text files for individual messages.
        All for what? So some user doesn't have to be bothered to clean out their mailbox?
        No problem, except the next thing El' numbnuts is going to ask for is a tool to actually FIND something in all that mess.
        
        Parent Share
        twitter facebook
    - - Re:Fast?!? (Score:3, Informative)
        
        by Rosyna ( 80334 ) writes:
        
        The limit exists on OS X (at least) because of a limit of the Resource Manager. Each message in the mbox on OS X has its index and other data in the resource fork. One for each message. There is a 16-bit limit on the number of resources in a file (and a 16meg limit for the entire resource fork). It is also why some OS X developers keep asking apple to FREAKIN IMPLEMENT NAMED FORKS ALREADY!
  - Re:Fast?!? (Score:4, Informative)
    
    by alannon ( 54117 ) writes: on Wednesday May 19, 2004 @02:50AM (#9193106)
    
    One of the reasons that eudora tends to be fast for some things when Mail.app isn't, is that Eudora does not store attachments with the mail. It splits them off at download-time into a separate folder. Mail.app keeps the entire mail envelope intact, including attachments. This makes Mail.app often very, very slow when moving large numbers of messages around, simply because it's doing a lot of file manipulation. I will admit, though, that Mail.app often feels very sluggish. Apple needs to work on that.
    
    Parent Share
    twitter facebook
  - Re:Fast?!? (Score:4, Interesting)
    
    by nikster ( 462799 ) writes: on Wednesday May 19, 2004 @05:26AM (#9193576) Homepage
    
    Mail CHOKED on them
    
    it helps to check Apple apps _again_ from time to time since they tend to make huge improvements with every release. Mail.app has not been slow for a while now. Apple seems to pretty consequently follow the strategy "make it work first, make it fast later" . i am running the latest version on OS X 10.3
    
    I have about 1G of mail and it doesn't really seem slow in any situation, even though it's running on a almost 3 year old 667MHz powerbook (with a sloooow hard disk).
    
    I just did a test of search entire message in all mailboxes (all 1G of them). the first results appeared after 3 seconds, and it stopped after 40 secs, rebuilding some indexes along the way. the second search was done in about 15 seconds.
    
    Every single criticism i had since Mail 1.0 - and there were a lot, including performance - has since been addressed. It is now fast, no annoying modal dialogs, no indexing behind your back, no weird delays. It's just a beautiful mail client.
    
    i recommend you try it again.
    
    On topic: The junk mail filter seems to indeed work pretty well. i just checked my junk mail folder (2000 unread messages, heh): All except for 5 were spam, and those 5 were all mass mailings, too. Even clever(?) subject lines like v$a.g.r.a and such were filtered out.
    
    Oddly, 3 of the 5 false positives were from Apple, sent to my .mac account.
    
    Parent Share
    twitter facebook
  - Re: Fast?!? (Score:3, Interesting)
    
    by teridon ( 139550 ) writes:
    
    I had the same experience with Mail -- I let it chug away *overnight* to import my mail. The next day when I tried actually *using* Mail it was too slow compared to Eudora. What a waste of time :(
    
    FYI, Eudora 6.1 now has address book integration. See here [eudora.com]
- Re:Vectors..... (Score:5, Insightful)
  
  by RovingSlug ( 26517 ) writes: on Wednesday May 19, 2004 @01:45AM (#9192919)
  
  Ah, it uses vector math. ... Image classification would be a no brainer for folks that spend their time examining images and multispectral datasets.
  Ugh. The magic doesn't come from vectors. Vectors are just how you throw the numbers around. The reason the classification apparently works well is their choice of representation of the document: a word histogram -- the occurance count for each word. To measure the distance between two histograms, you usually use the chi-squared test. So, forget all about "vectors", the real work horse is the histogram. And, we can discuss about "clustering", but it's just as imporant to know how you're measuring the distance from one document to another.
  Image clustering is hard, and the problem comes from picking a good representation of the image. Of course, a "word histogram" for an image makes no sense. Just considering pixel intensity or pixel color doesn't work either. You usually have to start looking at things like lines, curvatures, intersections, texture patterns, etc. Once you decide tools you're going to use to describe an image and algorithms to calculate them, you can starting talking about how far away one image is from another, which then naturally leads to clustering techniques. But, the hard part about the clustering is getting them into a space in which they actually, nicely cluster.
  I had to stop reading the article because it was so clearly written by someone who had no comfort with the mathematical concepts or techniques. (Sorry, but seriously, it's the blind leading the blind.)
  
  Parent Share
  twitter facebook
  - Re:Vectors..... (Score:5, Informative)
    
    by BWJones ( 18351 ) * writes: on Wednesday May 19, 2004 @02:01AM (#9192963) Homepage Journal
    
    The magic doesn't come from vectors. Vectors are just how you throw the numbers around
    
    And your point is?
    
    The reason the classification apparently works well is their choice of representation of the document: a word histogram -- the occurance count for each word. To measure the distance between two histograms, you usually use the chi-squared test.
    
    For a univariate space (or perhaps bivariate space) this will work, but now try implementing standard chi-square analysis in multivariate (or hyperspectral) space. Starts to fall short rather quickly thus the measures of distances between clusters analysis.
    
    Image clustering is hard, and the problem comes from picking a good representation of the image.
    
    Yes, I do image clustering almost every day. Well, at least a couple times a week. With proper discriminands one can overcome "good image representation" problems.
    
    Of course, a "word histogram" for an image makes no sense.
    
    Actually, it does in a sense when you realize that images are simply matrices of numbers just like sentences or paragraphs can be identified as matrices after assigning lookup values to certain properties.
    
    Just considering pixel intensity or pixel color doesn't work either.
    
    Actually, yes it does. This is how many standard measures of image cluster analysis work.
    
    You usually have to start looking at things like lines, curvatures, intersections, texture patterns, etc.
    
    Actually, no. For many image classification algorithms that examine pixel value (oil bearing strata, concrete vs granite, types of aluminum in missiles etc...), structure or anatomy play absolutely no role in the identification of classes.
    
    Once you decide tools you're going to use to describe an image and algorithms to calculate them, you can starting talking about how far away one image is from another, which then naturally leads to clustering techniques.
    
    That is a very difficult approach to take for image classification that begins to rely on machine processing and image "interpretation" which is a much higher order problem.
    
    But, the hard part about the clustering is getting them into a space in which they actually, nicely cluster.
    
    Simply add more discriminands or filters and don't worry about "describing" the image. Other properties (like structure and anatomy) fall out after image clustering.
    
    Parent Share
    twitter facebook
    - Re:Vectors..... (Score:5, Informative)
      
      by Hays ( 409837 ) writes: on Wednesday May 19, 2004 @03:05AM (#9193166)
      
      You're being overly hard on the grandparent. He makes some good points. And naive image vectorization IS a problem. Eigenfaces only works with extremely careful registration of images, because the images are vectorized naively. Basically this means throwing out any notion of spatial coherence. (You could vectorize the image in random order, scanline order, whatever.. as long as you did it consistantly across the data set you'd get the same bases out. Shouldn't a system understand that an image shifted one pixel to the right is not arbitrarily far from its original version?).
      
      See http://www.cs.columbia.edu/~jebara/papers/iccv03.p df for a good argument about this
      
      And responding to another point of yours, classification algorithms that look only at intensity are at best brittle. In the real world things have to be better. You have to be able to recognize an object under different lighting, etc. The fact that you can design and calibrate a system well enough to work on pixel intensity alone in a few specific cases doesn't convince me that it's robust.
      
      That's not to say that you can't do some vision tasks with relatively simple metrics like intensity histograms or naively vectorized images, but really data representation is a major bottleneck for a lot of vision work. But you look like you're qualified to know that so I don't know why you're jumping down the grandparent's throat.
      
      Parent Share
      twitter facebook
    - Re:Vectors..... (Score:3, Interesting)
      
      by RovingSlug ( 26517 ) writes:
      
      The magic doesn't come from vectors. Vectors are just how you throw the numbers around
      
      And your point is?
      Ah, that's the main point. Both the article and your original post focus on the fact that vectors are being used. While true, this doesn't really impact the essense of the algorithm -- effectively addressing the lower-level data structures instead of the higher-level algorithms. Perhaps an analogy might be someone describing Google's search by explaining B-trees instead of getting into what proce
  - Document Vectors - Term Weights (Score:3, Interesting)
    
    by agentofchange ( 640684 ) writes:
    
    Forgetting about vectors is silly.
    In short: a vector is the result of a calculation based on the number of times a term is used in a document and the terms in the other documents it is being compared with (the document set).
    The angle between document (email) vectors is a representation of their likeness. For example if the angle is very small the documents have a lot in common.
    This is how the mail app works. It compares known junk emails (ie the query) to the incoming document set (new emails)
    Th
i know how (Score:5, Funny)

by ShallowThroat ( 667311 ) writes: on Wednesday May 19, 2004 @01:04AM (#9192742)

it's simple. it uses it's extremely uninsipired app name to scare away spam.

Share
twitter facebook
- Re:i know how (Score:5, Funny)
  
  by jjeffries ( 17675 ) writes: on Wednesday May 19, 2004 @01:14AM (#9192786)
  
  I hear that the next version will be known as "mail-enhancemant.app"
  
  Parent Share
  twitter facebook
subspaces? (Score:5, Funny)

by thedogcow ( 694111 ) writes: on Wednesday May 19, 2004 @01:04AM (#9192743)

The article mentions...

"In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions."

Funny. When I took linear algebra I was wondering if there was a practical approach to this, and I guess there is... to elliminate penis enlargement advertisments.

Share
twitter facebook
- Re:subspaces? (Score:2)
  
  by DrEasy ( 559739 ) writes:
  
  Not only maths help eliminate penis enlargement ads, but they eliminate penis growth altogether.
- Re:subspaces? (Score:4, Funny)
  
  by Capt'n Hector ( 650760 ) writes: on Wednesday May 19, 2004 @02:12AM (#9192989)
  
  When I took linear algebra I was wondering if there was a practical approach to this
  If by "this" you mean spam filtering, then cool. But if you're talking about applications in general... Are you kidding? Linear algebra is probably the most useful stuff you'll ever learn, especially if you're into computers. It's the stuff CG is made of. EVERYTHING uses linear algebra.
  So here's a guess on how this works: So you've got your document vector. You also have a vector space, call it S for "spam". Choose your basis for S to be a bunch of words commonly found in spam. Now, orthogonally project your document vector into S, take the Euclidian norm and if it's too long -- zap it! It's spam!
  
  Parent Share
  twitter facebook
Face recognition (Score:4, Informative)

by dysprosia ( 661648 ) writes: on Wednesday May 19, 2004 @01:06AM (#9192749)

I believe I remember reading somewhere that the same sort of vector/clustering calculations are used in face recognition software?

Just goes to show how solid math/calculations can have some useful applications!

Share
twitter facebook
- Re:Face recognition (Score:5, Informative)
  
  by moyix ( 412254 ) writes: on Wednesday May 19, 2004 @01:25AM (#9192835) Homepage
  
  Yes, for example, the eigenfaces method [mcgill.ca] converts each image into a vector, and constructs a new subspace based on the highest ranked common features between them (using Principal Component Analysis, aka the Karhunen Lòeve Transform). Then new images are projected into this space and the shortest distance between the new vector and the previously computed ones is found.
  
  It was the first thing that popped into my head while reading the article too :)
  
  Parent Share
  twitter facebook
...moderation ideas.... (Score:5, Funny)

by j3ll0 ( 777603 ) writes: on Wednesday May 19, 2004 @01:07AM (#9192751)

Why wouldn't a similar algorithm work to provide automated moderation? It seems to me that you could certainly identify clusters of words that indicate low-value posts?

Share
twitter facebook
- Re:...moderation ideas.... (Score:2, Funny)
  
  by pvt_medic ( 715692 ) writes:
  
  and by that token, i could creat something that would get me moded up every time so i can get more karma so i can mod...
  
  oh automated mod... scratch that plan, i will have to figure something else out for world domination.
- Re:...moderation ideas.... (Score:5, Funny)
  
  by wheresdrew ( 735202 ) writes: on Wednesday May 19, 2004 @01:16AM (#9192793) Journal
  
  Yes, but the combination of too many all too common terms could cause the system to implode.
  "In Soviet Russia imagine a beowulf cluster of insenstive clods who don't RTFA because they're using linux to beat the GNAA to the first post."
  
  Parent Share
  twitter facebook
Full text search goodness (Score:3, Interesting)

by vikman ( 695272 ) writes: on Wednesday May 19, 2004 @01:07AM (#9192754) Homepage

Now we understand why Apple is so good at doing full text searches and filesystem wide searches. I wish we had the same type of search functionality in Mozilla that Mail.app boasts of.
That is the one feature that Mozilla's mail client really could use.

Share
twitter facebook
n-space (Score:5, Funny)

by Anonymous Coward writes: on Wednesday May 19, 2004 @01:09AM (#9192756)

Each document is in turn represented by a long string of numbers, one for each word in the corpus. In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions. This coordinate is then mapped onto a unique position in the goatse.cx photograph. If it lands in an objectionable region, the message is discarded as spam.

It's an interesting method, but not having Mail.app myself, what I'm wondering is how well it works on the border regions; that is, when it is just barely objectionable. Say, on his leg.

Share
twitter facebook
how does it compare to Bayesian? (Score:5, Interesting)

by the quick brown fox ( 681969 ) writes: on Wednesday May 19, 2004 @01:11AM (#9192765)

Is there any hard data out there that shows the cluster analysis actually improves on the better Bayesian [paulgraham.com] algos out there? After all, most of the good ones also achieve the 98%+ that this article cites.
According to the FAQ of SpamBayes (I think), they're always getting suggestions of ways to tweak their algos that would "obviously" improve the result, but in almost every case it either makes no difference or hurts accuracy, when actually tested on real data.

Share
twitter facebook
- Re:how does it compare to Bayesian? (Score:2, Interesting)
  
  by turkmenistani ( 638203 ) writes:
  
  But, like the article mentions, what happens when your grandma sends you an email mentioning viagra? Traditional Bayesian algorithms would automagically flag it as spam and delete it. The problem with traditional spam filters is that they might block all incoming spam, but they might also block something you might have wanted to read.
  - Re:how does it compare to Bayesian? (Score:2, Interesting)
    
    by lupin_sansei ( 579949 ) writes:
    
    No they wouldn't. Bayesian filters would see the word "viagra" and give that a high spam score, but all the other words that your Aunty used would probably have a very high ham score (not spam). Thus it would probably score the entire email as ham.
    
    That's the great thing about Bayesian filters, they score the entire email not just look for single keywords.
  - Re:how does it compare to Bayesian? (Score:5, Funny)
    
    by inburito ( 89603 ) writes: on Wednesday May 19, 2004 @01:23AM (#9192829)
    
    Wow. If your grandma is suggesting you viagra I think your problems go way deeper than Bayesian misfirings..
    
    Parent Share
    twitter facebook
  - Re:how does it compare to Bayesian? (Score:3, Informative)
    
    by the quick brown fox ( 681969 ) writes:
    
    That actually tends not to happen. Most Bayesian filtering packages are weighted very conservatively, so that one or two highly non-spam tokens (like your grandma's e-mail address, or the name of the uncle who is on the little blue pill) will more than counterbalance the spam tokens.
    Again, what's intuitive doesn't play out in practice... this seems to be a common theme in the world of statistical spam filtering. For example, you'd think the word "free" would be pretty spammy... in my corpus, it only get
  - Re:how does it compare to Bayesian? (Score:5, Informative)
    
    by SimplyCosmic ( 15296 ) writes: on Wednesday May 19, 2004 @01:29AM (#9192851) Homepage
    
    Bayesian spam filtering doesn't mark an email as spam simply because of the presence of one single word, but using a mathematical equation based on the likelyhood of each of the words being in the message being symptoms of spam. What you're talking about is simply a spam filter based on a blacklist of words. Bayesian spam filtering uses mathematics to consider how those words are used in the context of the rest of the message, and do a surprisingly good job of it.
    
    Therefore, "viagra" in your grandmother's email might have a high indication of spamminess, but all the other words will lower the score below the rather high threshold needed to be considered spam.
    
    That's why training your bayesian spam filter on the email you receive is so important, as it learns what you consider spam from the type of email you receive.
    
    Parent Share
    twitter facebook
    - Re:how does it compare to Bayesian? (Score:3, Informative)
      
      by ghamerly ( 309371 ) writes:
      
      Your post is a bit misleading. It's true that the words are all considered together, but it's not true that they are considered "in context" in the sense that phrases are considered. The thing that makes Naive Bayes classifiers viable for most applications is that they are "naive", and do not consider phrases. Instead, each word is considered conditionally independent of every other word (conditioned on the class label, in this case spam or not spam). The "spamminess" of each word has an additive effect, an
- Re: (Score:3, Interesting)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
  - Re:how does it compare to Bayesian? (Score:2)
    
    by the quick brown fox ( 681969 ) writes:
    
    Did you mean LSA, for Latent Semantic Analysis?
    Anyway, yeah, I understand that. My question is whether, for the specific purpose of spam filtering, it results in improved performance, and if so whether it's been documented anywhere.
    The clustering stuff is certainly interesting for other purposes, and I'm glad there are people out there not only writing the software, but integrating it into the OS. The graphic and industrial designers aren't the only smart people at Apple.
  - Re:how does it compare to Bayesian? (Score:4, Informative)
    
    by martin-boundary ( 547041 ) writes: on Wednesday May 19, 2004 @03:06AM (#9193171)
    
    Bayesian filtering is a subset of what LSM can do.
    
    I'm sorry, but that's just completely wrong. Whoever is propagating this deserves a slap on the forehead.
    Bayesian theory is the most general possible form of rational decsion making. *Any* rational method based on belief structures can be represented in a Bayesian form. This was shown by Richard Cox in about 1944.
    Here's an excerpt from this wikipedia article [wikipedia.org], to whet your appetite:
    
    1. Divisibility and comparability - The plausibility of a statement is a real number and is dependent on information we have related to the statement.
    
    2. Common sense - Plausibilities should vary sensibly with the assessment of plausibilities in the model.
    3. Consistency - If the plausibility of a statement can be derived in two ways, the two results must be equal.
    
    Any system of reasoning which satisfies those assumptions has a Bayesian version, and conversely. (Read the whole article if you want to argue edge cases).
    So, if LSA (you wrote LSM?) works, then it's only to the extent that there's an underlying Bayesian model which makes it work.
    
    Parent Share
    twitter facebook
    - Re:how does it compare to Bayesian? (Score:5, Informative)
      
      by NoOneInParticular ( 221808 ) writes: on Wednesday May 19, 2004 @08:10AM (#9194067)
      
      You're absolutely right, but note however that what the grandparent calls 'Bayesian filtering' is referring to something that is more commonly known as 'naive Bayes': Bayesian inference with a set of extremely limiting assumptions. This technique is known in information retrieval as both the 'multinomial' and the 'multivariate' model of word frequency manipulation (which is which depends on how you store the evidence: only word occurrences or also word counts). In this sense, 'Bayesian filtering' is a very narrow subset of 'Bayesian inference' and its completely possible, and even quite likely, that latent semantical analysis subsumes it.
      
      Parent Share
      twitter facebook
- Re:how does it compare to Bayesian? (Score:3, Interesting)
  
  by wirelessbuzzers ( 552513 ) writes:
  
  It's pretty hard to compare algorithms, at least ones that might work, such as chi squared (SpamBayes) vs Bayesian (Plan for Spam, CRM114, lots more) vs point totals (SpamAssassin) vs cluster analysis (Mail.app).
  
  As for implementations, CRM114 [slashdot.org] kicks the shit out of Mail.app's filter, at least on my and my roommate's mixes. About the only thing that CRM114 hasn't caught for me is those 1-line virus spams with a .zip attached, and new classes of spam (last week I received my first stock spam). The false pos
- Re:how does it compare to Bayesian? (Score:3, Interesting)
  
  by Nuclear Elephant ( 700938 ) writes:
  
  98% is pretty pathetic - 1 error in 50. Most good Bayesian filters (SpamProbe, CRM114, DSPAM) can reach at least 99.9% (1 error in 1000) with ease. Others can grow far beyond this and reach as high as 99.985%, as a recent slashdot article [slashdot.org] covered (and this one [wired.com]). I reset my stats a few weeks ago, and out of 1800 spams so far, 0 have made it through. The only problem with Bayesian filtering is that it's mismarketed by companies who insist they have a better solution (although it's less accurate).
  
  And to ans
Nitpick on one of their recommendations (Score:3, Insightful)

by Logic Bomb ( 122875 ) writes: on Wednesday May 19, 2004 @01:11AM (#9192767)

You can also ask that your potential correspondents resend emails if they do not receive answers in a certain timeframe.

If the Junk Mail filter snagged a message the first time, it'll probably get it on subsequent tries too. If the message is legitimate, it probably can't be changed enough to make it through. It's a much better idea to check Junk Mail for legit messages and only empty it manually (or automatically for messages that are at least a week old).

Share
twitter facebook
Summary Service (Score:5, Interesting)

by spankalee ( 598232 ) writes: on Wednesday May 19, 2004 @01:13AM (#9192779)

Wow, the article just turned me on to the Summary Service. And I just used it to read a short and sweet summary of the article.

If you haven't played with it select a bunch of text (in a Cocoa app) and select Summary from the Services menu.

Very cool...

Share
twitter facebook
- Re:Summary Service (Score:5, Funny)
  
  by Mikey-San ( 582838 ) writes: on Wednesday May 19, 2004 @01:37AM (#9192878) Homepage Journal
  
  Input:
  Wow, the article just turned me on to the Summary Service. And I just used it to read a short and sweet summary of the article.
  If you haven't played with it select a bunch of text (in a Cocoa app) and select Summary from the Services menu.
  Very cool...
  Output:
  Wow, the article just turned me on to the Summary Service. And I just used it to read a short and sweet summary of the article.
  If you haven't played with it select a bunch of text (in a Cocoa app) and select Summary from the Services menu.
  
  Wow, look at that! Impressive!
  (I actually love Summary Service, but I couldn't resist that joke.)
  
  Parent Share
  twitter facebook
  - - Re:Summary Service (Score:3, Interesting)
      
      by nikster ( 462799 ) writes:
      
      below is the default output:
      
      In today's article of this three-part series, I'm going to fine-tune this strategy, plus take a closer look at Mail.app, so that you can more fully unleash its potential.
      
      ...Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system, developed in the Apple labs to help users who managed thousands or millions of large documents find the one they were looking for easily.
      
      ...The Apple data kit allows the user to find
Re: Bayesian Filtering (Score:2, Informative)

by Anonymous Coward writes:

The author is awfully dismissive of bayesian filtering, which works extremely well for me and for lots of other people. See mozilla, spam assassin, others.
os x's mail filter is great (Score:4, Interesting)

by squarefish ( 561836 ) * writes: on Wednesday May 19, 2004 @01:18AM (#9192802)

but it's a whole lot better with junkmatcher central [versiontracker.com]

Share
twitter facebook
Apple spam (Score:5, Interesting)

by seanadams.com ( 463190 ) * writes: on Wednesday May 19, 2004 @01:22AM (#9192826) Homepage

I have marked every single announcement and special offer i've ever received from Apple as junk, and yet the filter still refuses to classify them as such automatically.

I wonder if there's a loophole here that spammers could take advantage of: masquerade as Apple using the hole they've left in their filter. Spam Mac users to your heart's content. Bundle a Mac virus along with it for extra damage.

Please don't mod this down just because you like Macs. I like Macs too, but it really looks like there is a back door in the spam filter and I'm just reporting it - not mac bashing.

Share
twitter facebook
- Re:Apple spam (Score:4, Informative)
  
  by k_187 ( 61692 ) writes: on Wednesday May 19, 2004 @01:29AM (#9192852) Journal
  
  There is, Apple puts a rule in by default that stops Mail from evaluating any mail from apple. Well, there is in Panther, don't know if you caught that or not, but that might fix your problem.
  
  Parent Share
  twitter facebook
- Re:Apple spam (Score:5, Informative)
  
  by timgoh0 ( 781057 ) writes: on Wednesday May 19, 2004 @01:31AM (#9192858)
  
  This behaviour is due to the rules set up in apple mail. To disable this behaviour, go to the mail preferences, select rules and remove the entry "news from apple"
  
  Parent Share
  twitter facebook
- Re:Apple spam (Score:5, Informative)
  
  by .com b4 .storm ( 581701 ) writes: on Wednesday May 19, 2004 @01:35AM (#9192871)
  
  Did you check your "rules" preferences? Mail.app by default includes a rule to "Stop evaluating rules" for mail from a whole host of Apple e-mail addresses. I've never tried deleting it to see if I can get Apple mail to get filed as spam because... well, they e-mail me maybe twice a year and it's always been worth reading. But you might want to check out that rule, it could be what's fouling you up.
  
  Parent Share
  twitter facebook
- Re:Apple spam (Score:2, Informative)
  
  by Libraryman ( 721151 ) writes:
  
  There could be a back door in the spam filter, but I have another [slightly] less sinsiter possibility.
  Mail.app ships with a preset filtering rule to color-lable messages from Apple in blue. The junk filter may be set not to act on messages which are already being filtered (colored, flagged, moved to a specific folder) by one of your rules. Try deleting the rule to colorize the mail from Apple and see if it starts junk filtering it.
  Also worth noting, Apple will remove you from its mailing lists, any
It's Cyberdog! (Score:2, Interesting)

by Blackbrain ( 94923 ) writes:

Apple has finally brought Cyberdog [cyberdog.org] back!
Kickin it Apple Old School.
vs bayesian filters ? (Score:3, Informative)

by Bugmaster ( 227959 ) writes: on Wednesday May 19, 2004 @01:27AM (#9192841) Homepage

How does this technology compare to Bayesian filters such as PopFile [sourceforge.net] ? PopFile was not made by Apple, so clearly it doesn't have the cult appeal, but it has been working flawlessly for me for about a year now. What really irks me about this article is how it implies that Apple invented trainable filters -- where, in reality, this is very far from the truth. Apple does the same thing with pretty much everything it sells... sort of like Soviet Russia, who claimed to have invented flight, radio, transistors, and probably elephants too.

Share
twitter facebook
- Re:vs bayesian filters ? (Score:3)
  
  by diamondsw ( 685967 ) writes:
  
  RTF... Oh yeah, this is Slashdot. Nevermind...
Hmmm. Document visualization (Score:4, Insightful)

by mveloso ( 325617 ) writes: on Wednesday May 19, 2004 @01:34AM (#9192870)

I wonder if that data is accessible by 3rd parties. You could make "mail maps" that let you visualize the clustering of your incoming messages, and you could actually see the spam...by looking at the outliers and noise.

In fact, you could do this with any large data set. How about the feds looking for anomalous chunks of data in the bitstream? Anomalous stuff would just pop out, literally. This would make the TSA's job much, much easier. How about that?

Share
twitter facebook
Crystal clear ... erm ... (Score:5, Insightful)

by Too Much Noise ( 755847 ) writes: on Wednesday May 19, 2004 @01:40AM (#9192892) Journal

Then, we can do the Latent Semantic Analysis. In this new space, each axis is a weighted combination of all the words: documents and words coexist in the same space.

ok, got it - get a sparse point distribution, scrap the biggest common null subspace you find for the word matrices, then do some rotation to get meaningful combinations of these words ... or something (lexical analysis).

(further down ...)

Of course, systems that rely on such keywords are continuously updated and refined. Nevertheless, they are never entirely satisfying, even when using sophisticated Bayesian filters that are essentially weighted keyword systems.

so, weighted keyword systems (in particular Bayesian filters) are not so cool. Erm ... wait a minute, WTF???

ok, maybe this vector approach is something entirely new and leaves existing methods in the dust. But this article seems to be doing a relatively poor job at explaining why.

Share
twitter facebook
- Re:Crystal clear ... erm ... (Score:3, Informative)
  
  by martin-boundary ( 547041 ) writes:
  
  so, weighted keyword systems (in particular Bayesian filters) are not so cool. Erm ... wait a minute, WTF???
  ok, maybe this vector approach is something entirely new and leaves existing methods in the dust. But this article seems to be doing a relatively poor job at explaining why.
  
  Well, the article explains very poorly, but the approach isn't that new. Look up cluster analysis in google.
  Latent Semantic Analysis broadly works as follows:
  First, you plot all documents as points in space, by using each
Missing functionality (Score:5, Interesting)

by nsayer ( 86181 ) writes: <{moc.ufk} {ta} {reyasn}> on Wednesday May 19, 2004 @01:40AM (#9192895) Homepage

Here's the problem I have with mail.app's spam filtering:

I have several macs, and an IMAP server. The simple fact is that Mail.app doesn't share the filtering database. So the training winds up being sort of haphazard.

I suppose I should designate a particular machine to be the spam filtering IMAP client and have the rest of them not participate, but then I can't train on those subservient machines.

It'd be much better if multiple Mail.app IMAP clients could store their database on the server and share it.

Share
twitter facebook
- Re:Missing functionality (Score:3, Informative)
  
  by ezthrust ( 564219 ) writes:
  
  There might be something of use for you in this thread on macosxhints.com
  http://www.macosxhints.com/article.php?story=20030 320162436823 [macosxhints.com]
  Although there is a warning that once this is done, Mail stops learning.
- - Re:Missing functionality (Score:4, Informative)
    
    by n8_f ( 85799 ) writes: on Wednesday May 19, 2004 @03:12AM (#9193193) Homepage
    
    How that big server-level database of yours supposed to work?
    
    Uhh, how do you get any mail that he doesn't? The data would be stored in one of the user's mail folders, just like an attachment. You completely misunderstood the parent poster. He accesses the same IMAP account from multiple different machines, but he has to train each one of his clients FOR THE SAME ACCOUNT. So he gets 10 messages to homer@doh.com and his machine at work filters out message 1 and 2. He gets home, and his client filters out message 7. His laptop filters out message 9. They've each been trained to recognize some of the spam, but their training is incomplete because only one of the 3 clients is trained for each message that comes in. The only way to make it consistent would be to move all of the junk message back into the Inbox and select them as junk in each mail client. Pretty crappy. And it gets unsalvageable when you mark a message as Not Junk on client 2 that client 1 marked as Junk. I have the same issue. I just leave me home client running most of the time, so it handles all of the filtering as new messages come in and then mark the ones it missed when I get home. But the parent is right, Mail should just store it on the IMAP server.
    
    Which brings up an interesting point. I tend to store all of my notes on my personal IMAP server as drafts, so I can get to it anywhere. Why don't any programs use IMAP to store data? Can you not access them at a byte level, but only as whole messages? I haven't looked at the IMAP protocol. Could it be combined with WebDAV for a unified data store? I would love to have a server that allowed me to keep all of my e-mail, documents, contacts, etc. in one place that I could access from anywhere.
    
    Parent Share
    twitter facebook
Word disguises? (Score:3, Interesting)

by Piquan ( 49943 ) writes: on Wednesday May 19, 2004 @02:31AM (#9193054)

The big problem I see in spamland today isn't the classification technology. It's the word recognition problem. Sure, "VIAGRA" may be deeply embedded in a "spam" cluster, but what about "V1_4G ra"? If spammers weren't disguising their words, I think that Bayesian filtering and other techniques work fine. I'm not really sure that more advanced techniques in word classification are really needed here.

Share
twitter facebook
This is probably off-topic (Score:5, Interesting)

by teamhasnoi ( 554944 ) writes: <teamhasnoiNO@SPAMyahoo.com> on Wednesday May 19, 2004 @02:53AM (#9193117) Journal

All my emails to a couple of people suddenly started bouncing with a 550 'Administrative Prohibition' error last week - at first I blamed my ISP, then blamed my host, then the receiving host, all for naught. I then found I was on a couple of blacklists (probably because I apparently shared a virtual host with a scummy mortgage guy), but these had no bearing (I learned later)
I had emails out to every link in the chain, but no one knew what was going on.
In Apple Mail, I had my 'reply to' names set to my emai addys - I changed it to short descriptive names and now they're not bouncing anymore. (odd error, so I thought I'd post it)
Why this started all of a sudden, and why no host or ISP had heard of this before. I don't know.
I do know that being on a blacklist and attempting to get off of it is nigh impossible, so I'd be all over Apple making spam filtering software so overzealous wizards of blacklists [blars.org] can be kicked to the curb. (Why is this in use anywhere..?)

Share
twitter facebook
There's plenty of LSI information online (Score:5, Informative)

by K-Man ( 4117 ) writes: on Wednesday May 19, 2004 @03:11AM (#9193190)

Latent Semantic Indexing has been around for a while, and I've forgotten many of the details. As some have mentioned it's a dimension reduction technique, and the result is a set of eigenvectors, each of which describes a set of terms which correlate well with each other (or anticorrelate, I think components can be negative too).

In English terms, the technique finds sets of words that occur together in different subject areas, and gives them weights which reflect how often they occur together. For instance, "baseball" and "bat" may emerge as common companions in some documents, so they might get weights of 1.0 for both (in one eigenvector/topic) if they always occur together - meaning a query for "bat" should always return hits for "baseball" too. However if "bat" gets diluted by documents about flying animals, then its weight in the "baseball"-"bat" vector will be reduced, say to 0.5. Then queries for "bat" will not necessarily map to baseball documents, but to both areas, represented by different eigenvectors.

That's confusing enough, but LSI gives a clean method for managing all of these relative probabilities in a global space of word occurrence vectors. The "latent" part is how it discovers these topic areas automatically, by clustering words which occur together. This process is similar to data mining for common subsets, but with LSI the members of the subsets are actually weighted for significance.

Share
twitter facebook
Latent Semantic Analysis (Score:5, Informative)

by Henry Stern ( 30869 ) writes: <henry@stern.ca> on Wednesday May 19, 2004 @08:41AM (#9194254) Homepage

After reading through the comments here, it is obvious that there are some misconceptions about what Apple is doing.

Latent Semantic Indexing (LSI) was invented by Deerwester et. al. [1] as a method of reducing the dimensionality of a text corpus by finding a low-rank approximation of the term-document matrix.

The singular value decomposition (SVD) [2] factors a matrix A into the product of two orthogonal matrices and a diagonal matrix, A = U'SV. To find a rank k approximation of A using this factorisation, create matrices U^, S^ and V^ where S^ contains the first k rows and columns of S, U^ contains the first k rows of U and likewise for V^. Then, let A^ = U^'S^V^. The difference in Frobenius norms [3] of A and A^ is minimal for a rank-k approximation of A (least squares).

Rather than storing the full matrix, A^, in practice it is much more common to save U^ and S^ and project the columns and rows of A into a k-dimensional space. This allows both terms and documents to be clutered together and helps to associate keywords with documents.

You can do many things with these approximated document vectors, clustering, classification, document retrieval. Apple is probably using a k-nearest neighbour classifier [4] to determine how a message is to be filed.

I would be most interested to see Apple's updating strategy. There are several algorithms that allow you to add new rows and columns to a matrix where you know the full SVD, but none that I know of for the truncated SVD.

For one of my graduate-level courses, I wrote a little search engine that uses LSI to cluster 1000 newspaper articles. You can play with it here [stern.ca]. My favourite query is "Rowan Gorilla." The Rowan Gorilla is an oil rig that frequents Halifax harbour. The search engine returns articles on the oil and gas industry that contain neither the word "Rowan" nor "Gorilla" but are still topical.

[1] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 1990.

[2] Singular Value Decomposition -- from MathWorld. http://mathworld.wolfram.com/SingularValueDecompos ition.html [wolfram.com]

[3] Frobenius Norm -- from MathWorld. http://mathworld.wolfram.com/FrobeniusNorm.html [wolfram.com]

[4] Artificial Intelligence Wiki: NearestNeighbour. http://www.ifi.unizh.ch/ailab/aiwiki/aiw.cgi?Neare stNeighbor [unizh.ch]

Share
twitter facebook
- Re:Kinda like Mozilla Mail? (Score:5, Informative)
  
  by BWJones ( 18351 ) * writes: on Wednesday May 19, 2004 @01:10AM (#9192760) Homepage Journal
  
  Infact I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.
  
  Actually data clustering algorithms are completely different beasts than a standard bayesian analysis. Do a search on k-means clustering or ISODATA clustering methods to see what I mean. However, if you are referring to a bayesian cluster analysis (like those implemented for genetic analysis of microarrays) then you might be correct. Only for reasons you might not intend.
  
  Parent Share
  twitter facebook
  - Re:Kinda like Mozilla Mail? (Score:2, Funny)
    
    by Anonymous Coward writes:
    
    reading that has cleary shown me for the first time why my friends/family complain when i talk technical about chemistry to them.
    
    And i thought i spoke english!
- GD, RTFA! (Score:5, Informative)
  
  by Zen Programmer ( 518532 ) writes: on Wednesday May 19, 2004 @01:10AM (#9192761)
  
  If you had read the article, you would know it uses vector representation and latent semantic analysis, not Bayesian filters, which in the words of the author, "are essentially weighted keyword systems."
  
  Parent Share
  twitter facebook
- Re:Kinda like Mozilla Mail? (Score:3, Redundant)
  
  by Yaztromo ( 655250 ) writes:
  
  This spam filtering feature seems pretty similar to the one found in Mozilla Mail. Infact I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.
  
  Actually, if you read the article it specifically states that Mail's spam filtering is not like Mozilla Mails. You use it in much the same manner, butt the underlying technology is completely different.
  Yaz.
- Comment removed (Score:5, Funny)
  
  by account_deleted ( 4530225 ) writes: on Wednesday May 19, 2004 @01:17AM (#9192799)
  
  Comment removed based on user account deletion
  
  Parent Share
  twitter facebook
- Sounds sufficiently different to me (Score:5, Interesting)
  
  by Anonymous Coward writes: on Wednesday May 19, 2004 @01:25AM (#9192834)
  
  Actually from my understanding of it, its fairly different.
  
  I thought mozilla used bayesian (which you've mentioned) where words in the email get assigned a probably factor of being spam. These factors are totaled at the end; if the total factor is greater than some predefined value the message is flagged as spam.
  
  What this does (in my understanding) is count the number of occurances of each word in every email, and store that in a huge table. Then it relates messages together based on these word counts. So its like you get email clusters in N dimensional space, where each axis is a word, and an emails position on the axis is the number of times that emails uses that word. Then the clusters that have a lot of spam mail in in them are marked as spam clusters. All the emails in that cluster are then assumed to be spam
  
  The advantage to this method I would suppose is to fold:
  
  A) When you reduce the the N dimensional space, you would start by eliminating noise words (ie words that only occur in a single email). Spam emails that put fake words in to lower their spam probability in the bayesian method would not benefit with this method.
  
  B) Messages are grouped by content, so its possible that the client could group email by a common subject, kind of like automatic intelligent sorting. They do mention that this technology can be used to generate email summaries. So (in theory) not only could spam be sorted out, but so could any other key topics, like work, relatives, viagra purchases...
  
  At least thats my understanding of it.
  
  Parent Share
  twitter facebook
- Re:Kinda like Mozilla Mail? (Score:5, Informative)
  
  by DrSchlock ( 762271 ) writes: on Wednesday May 19, 2004 @01:39AM (#9192890)
  
  This spam filtering feature seems pretty similar to the one found in Mozilla Mail. Infact I'd be willing to bet that its just another bayesian e-mail filter with maybe a few extra bells and whistles.
  
  Not exactly Bayesian, no. It's a different kind of document classification algorithm, which the article calls Latent Semantic Analysis. Basically they represent each message as a point in a high-dimensional space (based on the unordered words in the document), and figure out which parts of the space tend to be occupied by spam e-mails. This involves quite a lot of computation to determine a likely boundary between the parts of the space representing spam and non-spam messages, given only a collection of labeled points.
  
  To make this train and run reasonably quickly, they have to do dimensionality reduction on the space: they collapse dimensions which tend to be correlated or redundant or useless. (If "teens" and "gushing" generally appear together in messages, they probably don't need two separate dimensions; if "hi" is equally likely to appear in spam and non-spam, it may not need a dimension at all.)
  
  A naive-Bayes classifier is much simpler: Assuming that the probabilities of words in a document are all independent, it selects the document type (spam or non-spam) that maximizes the total probability of the observed words. There's no training beyond counting how often each word occurs with each document type.
  
  Naive Bayes typically works nearly as well as more complex methods, and runs much faster. But presumably Apple feels their LSA implementation is fast enough, and sufficiently more accurate than simpler techniques to be worthwhile.
  
  Parent Share
  twitter facebook
- Re:Mail & IMAP (Score:4, Informative)
  
  by elbobo ( 28495 ) writes: on Wednesday May 19, 2004 @02:46AM (#9193093)
  
  doesn't ... have the ability to move my mail to a Junk folder on my IMAP server.
  
  Yes it does:
  
  Preferences -> Accounts -> Special Mailboxes -> Store junk messages on the server.
  
  My personal IMAP complaint is that you can't create rules to move messages between folders on the server, only folders on the client.
  
  Parent Share
  twitter facebook
- Re:But you still get the spam... (Score:5, Interesting)
  
  by rudedog ( 7339 ) writes: <dave@NOspAm.rudedog.org> on Wednesday May 19, 2004 @11:28AM (#9195594) Homepage
  
  The sender would just receive a message from the mail server saying that their mail was marked as spam
  
  Sadly, if it is spam, then you'll be punishing thousands of innocent people whose email addresses have been forged by the spammers, by sending them the bounce messages. Very little actual spam gets past my bayesian filters, but I do get a lot of bounces from other people's spam filters for messages and virusses that I never sent.
  
  Parent Share
  twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Magic (Score:4, Funny)

Re:Magic (Score:5, Funny)

Re:Magic (Score:2)

Re:Magic (Score:3, Funny)

Information Retrieval (Score:5, Funny)

Maybe... (Score:5, Interesting)

Re:Maybe... (Score:5, Informative)

Re:Maybe... (Score:2)

Re:Maybe... (Score:5, Interesting)

Re:Maybe... (Score:3, Informative)

Re:Maybe... (Score:3, Interesting)

Re:Maybe... (Score:3, Insightful)

Re:Maybe... (Score:3, Funny)

Re:Maybe... (Score:3, Funny)

Re:Maybe... (Score:5, Informative)

Re:Maybe... (Score:5, Informative)

Re:Maybe... (Score:3, Informative)

Good god, man (Score:5, Informative)

Re:Good god, man (Score:3, Informative)

Re:Good god, man (Score:3, Informative)

Not if email is marked as junk... (Score:5, Informative)

Re:Not if email is marked as junk... (Score:3, Informative)

Re:Maybe... (Score:3, Informative)

Re:Maybe... (Score:3, Informative)

Vectors..... (Score:4, Interesting)

Re:Vectors..... (Score:2)

Fast?!? (Score:5, Interesting)

Re:Fast?!? (Score:5, Interesting)

Re:Fast?!? (Score:5, Funny)

Re:Fast?!? (Score:3, Interesting)

Re:Fast?!? (Score:3, Interesting)

Re:Fast?!? (Score:3, Interesting)

Re:Fast?!? (Score:5, Informative)

Re:Fast?!? (Score:3, Informative)

Re:Fast?!? (Score:4, Informative)

Re:Fast?!? (Score:4, Interesting)

Re: Fast?!? (Score:3, Interesting)

Re:Vectors..... (Score:5, Insightful)

Re:Vectors..... (Score:5, Informative)

Re:Vectors..... (Score:5, Informative)

Re:Vectors..... (Score:3, Interesting)

Document Vectors - Term Weights (Score:3, Interesting)

i know how (Score:5, Funny)

Re:i know how (Score:5, Funny)

subspaces? (Score:5, Funny)

Re:subspaces? (Score:2)

Re:subspaces? (Score:4, Funny)

Face recognition (Score:4, Informative)

Re:Face recognition (Score:5, Informative)

...moderation ideas.... (Score:5, Funny)

Re:...moderation ideas.... (Score:2, Funny)

Re:...moderation ideas.... (Score:5, Funny)

Full text search goodness (Score:3, Interesting)

n-space (Score:5, Funny)

how does it compare to Bayesian? (Score:5, Interesting)

Re:how does it compare to Bayesian? (Score:2, Interesting)

Re:how does it compare to Bayesian? (Score:2, Interesting)

Re:how does it compare to Bayesian? (Score:5, Funny)

Re:how does it compare to Bayesian? (Score:3, Informative)

Re:how does it compare to Bayesian? (Score:5, Informative)

Re:how does it compare to Bayesian? (Score:3, Informative)

Re: (Score:3, Interesting)

Re:how does it compare to Bayesian? (Score:2)

Re:how does it compare to Bayesian? (Score:4, Informative)

Re:how does it compare to Bayesian? (Score:5, Informative)

Re:how does it compare to Bayesian? (Score:3, Interesting)

Re:how does it compare to Bayesian? (Score:3, Interesting)

Nitpick on one of their recommendations (Score:3, Insightful)

Summary Service (Score:5, Interesting)

Re:Summary Service (Score:5, Funny)

Re:Summary Service (Score:3, Interesting)

Re: Bayesian Filtering (Score:2, Informative)

os x's mail filter is great (Score:4, Interesting)

Apple spam (Score:5, Interesting)

Re:Apple spam (Score:4, Informative)

Re:Apple spam (Score:5, Informative)

Re:Apple spam (Score:5, Informative)

Re:Apple spam (Score:2, Informative)

It's Cyberdog! (Score:2, Interesting)

vs bayesian filters ? (Score:3, Informative)