Representative keywords of DPL platforms

The DPL platforms are too long and you could use a very, very short executive summary? No problem, I have the technology for it.

After the results you can find the kit to build yourself an extractor in the comfort of your home.

The results

Acquiring the data

for i in 93sam aigarius ajt hertzog sho sjr stratus svenl wouter
do
    wget http://www.debian.org/vote/2007/platforms/$i
done

Tokenizing

1
2
3
4
5
6
#!/bin/sh

for file in "$@"
do
    lynx -dump -stdin < $file | tr -c '[a-zA-Z]' ' ' | tr '[A-Z]' '[a-z]' | sed -e 's/ /\n/g' | sed -e '/^$/d' > $file.tok
done

Extracting the most representative keywords

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#!/usr/bin/python

import sys, math

def read_tokens(file):
    "Read all the tokens from one file"
    return [ line[:-1] for line in open(file) ]

# Read all the "documents"
docs = [ read_tokens(file) for file in sys.argv[1:] ]

# Aggregate token counts
aggregated = {}
for d in docs:
    for t in d:
        if t in aggregated:
            aggregated[t] += 1
        else:
            aggregated[t] = 1

def tfidf(doc, tok):
    "Compute TFIDF score of a token in a document"
    return doc.count(tok) * math.log(float(len(docs)) / aggregated[tok])

# Output the top 5 tokens by TFIDF for every document
for name, doc in zip(sys.argv[1:], docs):
    print name, sorted(set(doc), key=lambda tok: tfidf(doc, tok), reverse=True)[:5]

Errata

Jacobo suggests to use lynx -dump -nolist or w3m -dump for a more tokenizer-friendly text expansion.