Beagle data storage: Lucene 1.9 and SQLite 3
Beagle sure has come a long way in terms of maturity over the last few months.
I’ve been getting involved with Beagle’s interaction with dotLucene which is the C# port of Apache Lucene – a very powerful text search architecture. Beagle stores text content of indexed files within Lucene ‘databases’ and uses Lucene’s impressive search features to query on behalf of the user.
We previously used dotLucene 1.4.3 within Beagle, but I recently upgraded us to 1.9 RC1. Beagle is mostly unaffected by the changes, but there are some bug fixes and optimizations included. Perhaps the biggest win was the result of my extensive testing to make sure the upgrade didn’t break anything – I did identify and fix two bugs, and they were both also present in the 1.4 code.
The first bug was a file descriptor leak in a common code path (inside Beagle code), and the other, a fairly significant locking bug which was causing the locking often to not be having any effect at all. This explains some of the strange behaviour that has cropped up time to time in the past which we’ve never been able to pinpoint.
I also looked at some traces through the codepaths. I noticed that dotLucene was dealing with throwing and catching exceptions a hell of a lot – hundreds of exceptions being dealt with while indexing a small range of files. dotLucene was using exception catching where simple if/else combinations would work just fine. Exception handling is expensive as the runtime must jump through hoops keeping track of where to jump to if a certain type of exception occurs, so by greatly reducing the amount of exception handling that takes place, we have a nice small optimization in place.
After landing dotLucene 1.9, I’ve now turned some attention to another aspect of Beagle’s data storage mechanism. Beagle uses SQLite to store file attributes when extended attributes are not available, and for its file text cache.
Currently, Beagle only uses SQLite 2.x. Attempting to ‘port’ it to SQLite 3 revealed a problem in our SQLite interaction. You must always query a SQLite database from the same thread that the connection was originally established. Beagle is multi-threaded and we are using the same connection over multiple threads, which is (apparently) unsafe, and SQLite 3 explicitly checks this and returns error if you go beyond the original thread.
This creates a non-trivial problem to solve, and is a poor design decision from the SQLite developers. We’re going to stick with SQLite 2.x-only, as it seems to work just fine even despite sharing the connection over our thread pool. SQLite 3 wouldn’t bring any major benefits to us, and we are unable to use it due to its new explicit thread checking restriction. Sigh.