Full-text search with ColdFusion using Sphinx
Configuration (1/3)
In this example post we shall create indexes for forum postings which are stored in a PostgreSQL database; to keep things simple, we're just accessing one table with the columns message_id (int4), thread_id (int4), board_id (int4), title (text), text (text), father_id (int4), user_id (int4), created (timestamp without time zone). Remember that you can of course stick any SQL in there, so as long as you can access it, you can index it - use as many JOINs as you want.
Configuration is stored in /etc/sphinx/sphinx.conf (click to download example conf). There are four distinct parts of the configuration:
- In the first part you declare sources of data for your indexes (I avoid the term datasources as that might lead to some confusion in a J2EE environment);
- the second bit deals with the actual index configuration, i.e. which data source to use, where to store files, minimum word length, stemming, stopwords, phrase boundaries and normalization of special characters (i.e. é->e etc.);
- the third part contains configuration for the indexer
- and the fourth section deals with the settings for the search daemon searchd.
Let's take a look at each of these sections:
Sphinx datasources
This is the bit that's read by the indexer and compiled into indexes. Remember that it's not possible to update an index, so if the data changes we'll have to re-index from scratch. This inevitably means having to read all of the data you want to index. Our forums are quite large now - we're talking about a board_message table which is several GB in size. Indexing that may take a while, depending on hardware and current load. But we still want to have fairly current index data, so it should take no more than 5 minutes for a new forum posting to be available in search results, even when the complete index is just built once a day.
This problem is solved by having a so called delta index, i.e. a second index that's re-built every five minutes and takes in all the data changes that happened since the last full index build. Add a so called "killlist" into the equation, i.e. a list of document ids that have been deleted since the last full index build and you have a clean couple of indexes without stale entries.
In order to actually know what to index in the delta index, we need to store the maximum document id and the last index run somewhere in a tracker table. I have set up this table in a separate schema:
CREATE SCHEMA sphinx; CREATE TABLE sphinx.index_tracker ( id serial NOT NULL, max_doc_id BIGINT NOT NULL DEFAULT 0, description text NOT NULL, last_reindex TIMESTAMP WITHOUT TIME zone DEFAULT '1970-01-01 00:00:00'::TIMESTAMP WITHOUT TIME zone, CONSTRAINT "pk_index_tracker" PRIMARY KEY (id) )WITHOUT OIDS; CREATE INDEX idx_last_reindex ON sphinx.index_tracker USING btree (last_reindex);
Let's add our first index to the tracker table; each full index we use gets a new id assigned by the serial. the board_message index being our first, the corresponding id will obviously be 1.
INSERT INTO sphinx.index_tracker (max_doc_id,description) VALUES (1,'public.board_message index');
Okay, now on to /etc/sphinx/sphinx.conf - here is the configuration for the data source of our main board-message index; if a certain value is spanning several lines, a backslash is used at the end of a line:
source forummain
{
type = pgsql
sql_host = 123.123.123.123
sql_port = 5432
sql_user = username
sql_pass = password
sql_db = forumdb
sql_query_pre = UPDATE sphinx.index_tracker \
SET max_doc_id = (SELECT MAX(message_id) FROM public.board_message) \
, last_reindex = now() \
WHERE id = 1
sql_query_range = SELECT doc.min_message_id, t.max_doc_id \
FROM (SELECT MIN(m.message_id) AS min_message_id \
FROM public.board_message m \
WHERE m.message_id > 0) AS doc, \
sphinx.index_tracker t \
WHERE t.id = 1
sql_range_step = 1000
sql_query = SELECT message_id, thread_id, board_id, title, text, father_id, \
user_id, extract(epoch from created) as created_unixts, 1 as source_id \
FROM public.board_message \
WHERE message_id>=$start AND message_id<=$end
sql_attr_uint = user_id
sql_attr_uint = board_id
sql_attr_uint = thread_id
sql_attr_timestamp = created_unixts
sql_attr_uint = source_id
sql_attr_uint = father_id
#unpack_zlib = text
}
Let's go over this, it's fairly straightforward:
- Database configuration bit is quite obvious, no need for further explanation here.
- sql_query_pre sets a query that should be run before the indexing begins - we store the index run time and the maximum document id we are indexing now in our tracker table.
- Next we get the range of document ids in sql_query_range; the first value retrieved by this query must be the minimum document id, the second the maximum. These values are needed to calculate the $start and $end values in the actual data retrieval query.
- In sql_range_step we define how many documents should be fetched in one go. We don't want to overload the server by trying to get Gigabytes of data all in one go, so we'll set something sensible here like 1000.
- sql_query contains the actual data fetching query; the Sphinx indexer will walk over the range of document ids specified in sql_query_range in increments set with sql_range_step and set $start and $end accordingly for each query until the whole range is indexed.
In case you have wondered about the additional source_id field, I find it quite useful to be able to tell, which index exactly yielded a certain result row; we have for example another index (not covered here) for archived posts - and I allow searching both the current forum and the archives. The archive datasource has source_id=2 assigned. Knowing the source of a document id enables me to decide from which pool I need to actually fetch the corresponding document, i.e. from the table of current postings or from the archive table. - The last block of the source section deals with attribute fields - these can be used to filter and sort results. Sphinx doesn't know about database-specific date-types, so if you wish to sort or filter by date, you'll need to convert the field to a unix timestamp value in the query.
- The unpack_zlib statement would tell Sphinx to use compression on the database connection; this does not work with PostgreSQL, so it's commented. It might work with MySQL, though I haven't tried.
Now we've dealt with the full index, we need the configuration for the delta index:
source forumdelta : forummain
{
type = pgsql
sql_host = 123.123.123.123
sql_port = 5432
sql_user = username
sql_pass = password
sql_db = forumdb
sql_query_pre =
sql_query_range = SELECT t.max_doc_id, doc.max_message_id \
FROM (SELECT MAX(m.message_id) AS max_message_id \
FROM public.board_message m ) AS doc, \
sphinx.index_tracker t \
WHERE t.id = 1
sql_range_step = 1000
sql_query = SELECT message_id, thread_id, board_id, title, text, father_id,\
user_id, extract(epoch from created) as created_unixts, 1 as source_id \
FROM public.board_message \
WHERE message_id>$start AND message_id<=$end
sql_attr_uint = user_id
sql_attr_uint = board_id
sql_attr_uint = thread_id
sql_attr_timestamp = created_unixts
sql_attr_uint = source_id
sql_attr_uint = father_id
#unpack_zlib = texts
sql_query_killlist = SELECT trash.message_id FROM public_trashcan.message trash \
WHERE trash.deleted_date >= (SELECT track.last_reindex FROM sphinx.index_tracker track WHERE track.id = 1) \
UNION \
SELECT ar.message_id \
FROM archive.board_message ar , archive.board_thread at \
WHERE at.thread_id=ar.thread_id \
AND at.last_reply >= now() - '6 months 1 week'::interval
}
First you may have noticed the forumdelta : forummain name declaration of this source. This indicates that the configuration of forumdelta should inherit all the settings from forummain unless they are explicitly overwritten here. This means you do not have to repeat settings in the delta index source definition which are identical to the full index, you may leave out stuff like the database settings. This feature also applies to the actual index configurations later on - and you'll see that it is not only convenient but more or less a must for some settings...
As we 're not running any pre-indexing query for the delta index, we need to explicitly set the sql_query_pre to empty.
Apart from that there's just one new configuration item in there: sql_query_killlist. This specifies a number of documents that should be ignored in any search that uses this index. In our case we exclude messages that have been deleted or archived since the last full index run.
Next page: Sphinx configuration (2/3)

May 12th, 2010 - 19:39
Good stuff! Keep it up!
June 25th, 2010 - 16:28
nice post. thanks.