Imagine your boss comes in one day and says to you, "We have over 100,000 web pages on our site. Of that figure, 10,000 are from spammers. I need you to go through our list of websites and figure out which ones are spam and which are genuine." How do you accomplish this task without going crazy? Wouldn't it be great if your computer just told you whether a webpage was spam or not? Well, it can. Just give it some initial training and you'll have your own digital secretary in no time. This is all possible through the CRM114 discriminator, which is a machine-learning tool to help you classify data according to predetermined samples. We can use it in our case by first, feeding it documents that are known to be "spam";  then feeding it documents that are known to be "genuine". In these two steps, we are "training" the program to recognize the difference between spam and genuine webpages. Finally,  for any unknown document, we'll run it through CRM114's "classify" function, which will guess the probability that the document belongs to either the "spam" or "genuine" group based on past training data.
Trying it out
Take a look at some sample code below. It uses Sam Dean's wrapper library, which provides an easy-to-use Python interface to the CRM114 Discriminator .
import crm
c = crm.Classifier("/Users/iamthecheese/Desktop/crm_test_data", ["genuine", "spam"])
c.learn("genuine", "did you see that jean claude van dam movie?")
c.learn("spam", "Jean claude van dam uses viagra, you should too, here's how...")
c.classify("I went to see that movie about the dam today")
If you type that into the Python interactive command prompt and all goes well, you should see the last command return to you:
('genuine', 0.65529999999999999)
Which basically means that based on the set of training data given to CRM114, the phrase "I went to see that movie about the dam today" has a 65% chance of being genuine. Pretty cool huh? Applying this to our spam problem, just find 20 pages that you know are spam and 20 pages you know are genuine; train the CRM114 with this set of data, and unleash it on the rest of the your 999,960 pages. It'll save you a lot of time and you can use your "personal classification secretary" for bunches of other problems in the future as well.
Installing the CRM114 Discriminator on Mac OSX
So now that  you're hooked, lets get to installing this program on your Mac OSX. Unfortunately, there is no current macport for the CRM114 Discriminator, so you'll have to do some digging through Makefiles to get everything working. Here's how to build and install the program from the source.
  1. First off, install a dependent regex library called Tre using macports
    sudo port install tre
  2. Get the source code for CRM114
    cd ~/Desktop/some_folder
    wget http://crm114.sourceforge.net/src/
  3. Modfiy the "Makefile" under the src directory by replacing the following line:
    prefix?=/usr     should become -->    prefix?=/opt/local
    commenting out the following line:
    LDFLAGS += -static -static-libgcc
    and uncommenting the following lines:
    CFLAGS += -I/opt/local/include -I${HOME}/include
    LDFLAGS += -L/opt/local/lib -L${HOME}/lib
    LIBS += -lintl -liconv
  4. Now save the Makefile and run the make and make install commands in the src directory
    make && make install
  5. Congratulations, now you've got the CRM114 Discriminator installed on your computer! If it's done correctly, you should be able to run the following command in terminal to get the current version
    crm -v
  6. Finally to use the above sample code, go download Sam Dean's Python CRM114 wrapper library and put it in a place where you can import it from python. ( The site's login/password is "guest").
This piece of software uses many cool machine-learning classification techniques which are beyond my ability to explain here. If you're interested, you can read more about the algorithms below: