Representative keywords of DPL platforms

debian eng pdo

The DPL platforms are too long and you could use a very, very short executive summary? No problem, I have the technology for it.

After the results you can find the kit to build yourself an extractor in the comfort of your home.

The results

93sam: jobs, deal, dds, nms, asking
aigarius: applications, aigarius, choose, trademarks, apps
ajt: humbug, effective, neat, hoping, success
hertzog: broadly, wouter, serve, represent, pushed
sho: deadline, helps, excellence, freaks, tasks
sjr: qb, published, xxxx, r, yet
stratus: stable, websites, feature, submitter, involving
svenl: unfair, protest, ban, publish, banning
wouter: controversy, seem, background, delegates, therefore

Acquiring the data

for i in 93sam aigarius ajt hertzog sho sjr stratus svenl wouter
do
    wget http://www.debian.org/vote/2007/platforms/$i
done

Tokenizing

#!/bin/sh

for file in "$@"
do
    lynx -dump -stdin < $file | tr -c '[a-zA-Z]' ' ' | tr '[A-Z]' '[a-z]' | sed -e 's/ /\n/g' | sed -e '/^$/d' > $file.tok
done

Extracting the most representative keywords

#!/usr/bin/python

import sys, math

def read_tokens(file):
    "Read all the tokens from one file"
    return [ line[:-1] for line in open(file) ]

# Read all the "documents"
docs = [ read_tokens(file) for file in sys.argv[1:] ]

# Aggregate token counts
aggregated = {}
for d in docs:
    for t in d:
        if t in aggregated:
            aggregated[t] += 1
        else:
            aggregated[t] = 1

def tfidf(doc, tok):
    "Compute TFIDF score of a token in a document"
    return doc.count(tok) * math.log(float(len(docs)) / aggregated[tok])

# Output the top 5 tokens by TFIDF for every document
for name, doc in zip(sys.argv[1:], docs):
    print name, sorted(set(doc), key=lambda tok: tfidf(doc, tok), reverse=True)[:5]

Errata

Jacobo suggests to use lynx -dump -nolist or w3m -dump for a more tokenizer-friendly text expansion.