Enano’s new search engine: an inside look
Coblynau is scheduled to be released in about two days. But let’s rewind a little bit and take a look at what’s been going on the past day or two.
Two days ago I booted Nighthawk into Vista because Fedora was making my sound card queasy and I needed to do some testing of Enano under the OS from “that company out west” anyway.
I’ll be the first to admit that while I’m competent with MySQL, I’m not the best-coded stored procedure in the database, so to speak, when it comes to the dinosaur open source DBMS. I’m still trying to get MySQL’s weird syntax down (SHOW INDEXES FROM table anyone?), and I just realized how clueless I was when I set up the FULLTEXT index that comes on Enano’s page_text table.
The problem became apparent to me when I realized how difficult it was to hook into the search system convincingly. I was fed up with MySQL’s indexing functionality, and the built-in search engine in Enano blows because it’s just incredibly slow. There was never any central result list that you could tap into and manipulate, so plugins like Decir and Snapr were pretty much doomed to having either their own search pages or segregating search results from different tables or types of pages for the same query. Blech.
It occurred to me last night, what if I just selected the indexed words from search_index and matched that against the query? You know, something like “SELECT page_names FROM search_index WHERE word=’some_term’;”? The index is already automatically updated each time a page is saved, so I decided that it was a go.
I started rewriting the code around 10:30 PM. By 1:30AM, I finally had ironed out the 4 or 5 PHP and SQL syntax errors and the newborn algorithm (hackers will want to know that it’s a function called perform_search()) was returning an array with trimmed/clipped/highlighted page info and I had a little file called search-test.php that displayed the results in an increasingly human-friendly way.
The way it all works is, you have two arrays, $scores and $page_data. Each page has a unique string assigned to it in the format of “ns=Article;pid=Main_Page”. $scores is an associative array containing one value for each page found. The value is incremented by 1 each time another search term is found on the page. $page_data just contains the unique ID, page text, page name, and the size of the page in bytes.
The reason this approach works so well is that you can easily have a plugin hook into the search algorithm and inject its own results, scoring them appropriately. The algorithm also was designed to consume almost no memory, and it’s working pretty well on my development site on Scribus.
An additional benefit, of course, was that we got to try something that’s never before happened in Enano history: during your upgrade, the search_cache table is actually going to be dropped. That’s right. The algorithm is so fast (it processed an 11-term query in 0.13 seconds whereas the old algorithm took a whopping 10.4) that I decided a caching system wasn’t necessary. Only time and a heavy server load will tell, but so far Enano’s been performing very well with the new search code.