One of the projects in work is a bibliography, a painstakingly-produced list of publications for a specific research area. It wouldn’t be so bad if it was all English-language, but much of it relates to linguistics. There are diacritics all over the place (sometimes stacked on top of each other), mixtures of languages and alphabets within the one title, individually-italicised words, small-caps, superscript, subscript, and combinations of the above. This project started 12 years ago with the intention that, when complete, it would be a printed book. As a result the markup chosen which could handle the mess of markup necessary was LaTeX (remember, this was well before the days of markdown or similar). There could be another ten years of work to go.
The website was designed with a simple alphabetical navigation system — you can browse through authors or journals, etc, by initial letter and it was easy enough to find what you wanted. There’s on-the-fly TeX-to-HTML conversion for the website which uses a recursive regular expression which makes me feel both guilty and proud. But now, after 12 years, there are about 13,000 entries and the navigation has become unwieldy. They want a real search.
I threw together a very quick search using a few quick SQL searches (did I mention I work one day per week with support queries of ~20 users to keep happy?) that did a reasonably tolerable job, but really they want to be able to specify multiple fields, wildcards and all that sort of thing. A search of raw database tables won’t work because of all the tex markup.
A quick google and my first stop was Apache Solr, which seems to be one of the best-known search engines there is. It has all the features I could think of, and lots of features I hadn’t thought of. My first thought was that it’s huge. It’s a standalone Java application, the zipped download alone is nearly 150MB, and there was a pile of documentation. I succeeded in building an XML-formatted output (including lots of embedded HTML) from the bibliography and got it into Solr, and tinkered with some searches.
In the end, though, I couldn’t shake the feeling that Solr was just too big. There was a lot of configuration. Just setting up Solr as a standalone daemon was a task in itself. Many of the potentially useful features were killed by the nature of our data; language-specific stemming and so on doesn’t work very well if you’ve got English, Irish and Greek mixed together in the one title.
So I started to look around again and came across Xapian. It’s smaller and more lightweight, has the features I need, and has direct bindings for several languages, including PHP which is what I need. From what I can tell this means I don’t need a separate daemon, the documentation is written with code samples in Python (yay!), and I’m planning to use xml.etree.ElementTree to re-use my existing XML output and stick it into Xapian.
I said I’d have a very basic working example on one or two fields in about three weeks (which for me is the cumulative “spare” time from three busy working days). Wish me luck…