Contents

Why?

The current Chicago GLUG's Chump page is indexed with DBXML and searched by taking a users keyword query and converting it to an XPath query using the "contains()" function to find matches. While this is adequate for our groups needs, it is a very lacking implementation. That is partially why this project, to create a full-text search capability for our Chump's history which ranks results based relevance, was started.

The obvious question is, why not just use Lucene, or MySQL's full-text search, or any of the other possibilities? First, each of those would still require setting up and doing parsing on the XML documents, since we want relevance based on each Chump entry and not a document as a whole (which is based on days, not topics). Using MySQL as the storage for the index and Python for the implementation made this nearly as quick to complete as fiddling with any other solution. Additionally, it is now much easier to experiment with how the ranking works and add new features, since the basic indexing and searching is very small and simple. This gives us an interesting project to play with as a group. Also, Chump is real world data that is actually searched by users (not often, but it is). So, my selfish side of the project is that I am researching XML full-text retrieval and having this be a small, easily tweakable engine that I can try out new methods on will be useful.

Implementation Details

Tables

ENTRY

CREATE TABLE `entry` (
 `entry_id` int(11) NOT NULL auto_increment,
 `nick` char(50) default NULL,
 `date` char(50) default NULL,
 `entry` mediumtext,
 PRIMARY KEY  (`entry_id`)
)
entry_id nick date entry
1 kungfoog1ru Fri Jan 25 15:11:27 2008 <link>...</link>
... ... ... ..

ENTRY_TERM

CREATE TABLE `entry_term` (
 `entry_id` int(11) NOT NULL,
 `term` char(255) NOT NULL,
 `tf` int(11) NOT NULL,
 PRIMARY KEY  (`entry_id`,`term`)
)
entry_id term tf
1 fire 1
1 huffingtonpost 1
... ... ...
2 engadget 2
... ... ...

ENTRY_TERM_PROX

CREATE TABLE `entry_term_prox` (
 `entry_id` int(11) NOT NULL,
 `term` char(255) NOT NULL,
 `offset` int(11) NOT NULL,
 PRIMARY KEY  (`entry_id`,`term`,`offset`)
)
entry_id term offset
1 fire 15
1 huffingtonpost 10
... ... ...
2 engadget 10
2 engadget 29
... ... ...

IDF

CREATE TABLE `idf` (
 `term` char(255) NOT NULL default ,
 `idf` float default NULL,
 PRIMARY KEY  (`term`)
)

term idf
fire 3.46574
huffingtonpost 3.46574
... ...
engadget 3.46574
... ...

STOP_TERMS

CREATE TABLE `stop_terms` (
 `term` char(255) default NULL
)
term
a
an
and
...

QUERY

CREATE TEMPORARY TABLE `query` (
 `term` char(255), 
 `tf` INT, 
 PRIMARY KEY (`term`)
)
term
toothrot_
zombo

Queries

Searching/Ranking Query

SELECT d.entry_id, SUM(q.tf * i.idf * d.tf * i.idf), e.entry 
FROM query q, entry_term d, idf i, entry e 
WHERE q.term=i.term AND d.term=i.term AND e.entry_id=d.entry_id 
GROUP BY d.entry_id ORDER BY 2 DESC

Todo

  • Integrate with our Chump web interface
  • Use Prepared Statements (other SQL efficiency changes?)
  • Index contents of all links in Chump entries, as if it were content of that entry
  • Add stemmer, Porter Stemmer
  • Implement alternative ranking algorithms
    • Increased weight for matching in Title?
    • Increased weight for match in linked content vs. comment or link in comment?
  • Describe a bit about what the fields in these tables mean.

Download

Here is the first release of the Chump Search Engine, currently called py_indexer but that will change. I have little, basically no, Python experience. So, please look it over and make suggestions. Requires MySQL, Python and BeautifulSoup to run. It comes with two test files to show how to use the indexer and the search, plus a sample XML file taken from Chump.

References

Powered by MediaWiki