apt-xapian-index: smart way of querying tags

I've recently posted:

Note that I've rewritten all the old posts to only show the main code snippets: if you were put off by the large lumps of code, you may want to give it another go.

Today I'll show how to implement a really good way of searching for Debtags tags. When I say really good, I mean the sort of good that after you run it you wonder how could it possibly manage to do it.

The idea is simple: you run a package search, but instead of showing the resulting packages, you ask Xapian to suggest tags like we saw in axi-query-expand.py.

For extra points, I'll use an adaptive cutoff in chosing the packages that go in the rset.

So, let's ask the user to enter some keywords to look for tags, and use them to run a normal package query:

# Build the base query
query = xapian.Query(xapian.Query.OP_OR, termsForSimpleQuery(args))

# Perform the query
enquire = xapian.Enquire(db)
enquire.set_query(query)

Now, instead of showing the results of the query, we ask Xapian what are the tags in the index that are most relevant to this search.

First, we pick some representative packages for the expand:

# Use an adaptive cutoff to avoid to pick bad results as references
matches = enquire.get_mset(0, 1)
topWeight = matches[0].weight
enquire.set_cutoff(0, topWeight * 0.7)

# Select the first 10 documents as the key ones to use to compute relevant
# terms
rset = xapian.RSet()
for m in enquire.get_mset(0, 10):
    rset.add_document(m[xapian.MSET_DID])

Then we define the filter that only keeps tags:

# Filter out all the keywords that are not tags
class Filter(xapian.ExpandDecider):
    def __call__(self, term):
        "Return true if we want the term, else false"
        return term[:2] == "XT"

Then we print the tags:

# This is the "Expansion set" for the search: the 10 most relevant terms that
# match the filter
eset = enquire.get_eset(10, rset, Filter())

# Print out the results
for res in eset:
    print "%.2f %s" % (res.weight, res.term[2:])

That's it. We turned a package search into a tag search, and this allows us to search for tags using keywords that are not present in the tag descriptions at all:

$ ./axi-query-tags.py explore the dungeons
27.50 game::rpg:rogue
26.14 use::gameplaying
17.53 game::rpg
10.27 uitoolkit::ncurses
...

$ ./axi-query-tags.py total world domination
7.55 use::gameplaying
5.68 x11::application
5.35 interface::x11
5.05 game::strategy
...

You can use the wsvn interface to get to the full source code and the module it uses.

You can see a similar technique working in the Debtags tag editor: enter a package, then choose "Available tags: search".

Next in the series: search as you type.