The bibliography

The website was designed with a simple alphabetical navigation system โ you can browse through authors or journals, etc, by initial letter and it was easy enough to find what you wanted. There’s on-the-fly TeX-to-HTML conversion for the website which uses a recursive regular expression which makes me feel both guilty and proud. But now, after 12 years, there are about 13,000 entries and the navigation has become unwieldy. They want a real search.
Search engines
I threw together a very quick search using a few quick SQL searches (did I mention I work one day per week with support queries of ~20 users to keep happy?) that did a reasonably tolerable job, but really they want to be able to specify multiple fields, wildcards and all that sort of thing. A search of raw database tables won’t work because of all the tex markup.
A quick google and my first stop was Apache Solr, which seems to be one of the best-known search engines there is. It has all the features I could think of, and lots of features I hadn’t thought of. My first thought was that it’s huge. It’s a standalone Java application, the zipped download alone is nearly 150MB, and there was a pile of documentation. I succeeded in building an XML-formatted output (including lots of embedded HTML) from the bibliography and got it into Solr, and tinkered with some searches.
In the end, though, I couldn’t shake the feeling that Solr was just too big. There was a lot of configuration. Just setting up Solr as a standalone daemon was a task in itself. Many of the potentially useful features were killed by the nature of our data; language-specific stemming and so on doesn’t work very well if you’ve got English, Irish and Greek mixed together in the one title.
So I started to look around again and came across Xapian. It’s smaller and more lightweight, has the features I need, and has direct bindings for several languages, including PHP which is what I need. From what I can tell this means I don’t need a separate daemon, the documentation is written with code samples in Python (yay!), and I’m planning to use xml.etree.ElementTree to re-use my existing XML output and stick it into Xapian.
I said I’d have a very basic working example on one or two fields in about three weeks (which for me is the cumulative โspareโ time from three busy working days). Wish me luck…