devbox@COMPUTEC The Computec development blog

5May/102

Full-text search with ColdFusion using Sphinx

Configuration (2/3)

Sphinx indexes

Now we've got our data sources in place, we'll need to configure the actual indexes, i.e. how to index and where to store the files. A couple of these settings are more or less related to performance and/or resource consumption (that's to say your mileage may vary depending on amount of data and available hardware) and all of them are explained in detail in the docs, so let's just go over the most relevant bits.

index forummain
{
    source = forummain
    path = /var/lib/sphinx/data/forummain
    docinfo = extern
    charset_type = utf-8
    mlock = 1
    preopen = 1
    # remove comment on the next line to reduce memory consumption in trade for 1 additional disk i/o per keyword per query
    # ondisk_dict = 1
    min_word_len = 3
    morphology = stem_en, libstemmer_de
    min_stemming_len = 5
    stopwords = /etc/sphinx/german.stop /etc/sphinx/english.stop
    # wordforms = /etc/sphinx/wordforms.txt
    # exceptions = /etc/sphinx/exceptions.txt
    phrase_boundary = ., ?, !, U+2026 # horizontal ellipsis
    phrase_boundary_step = 100
    index_exact_words = 1
    charset_table = \
    U+00C0->a, U+00C1->a, U+00C2->a, [...] \
    0..9, A..Z->a..z, a..z
}
  • path specifies the directory where the index files should be stored. Obviously this needs to be readable by the Sphinx search daemon and writeable by the indexer.
  • morphology sets stemmers for the languages you usually use. This makes sure that plural forms and suchlike can also be found. Sphinx comes with english and russian stemmers, you need to install additional stemmer libraries yourself as described in the Compiling Sphinx section. If you have installed the Snowball stemmer library, you should already have a number of stemmers for different languages available, for example libstemmer_de for german.
  • min_stemming_len specifies the minimum word length for words that should be filtered through the stemmer.
  • stopwords specifies one or more files containing words that should not be indexed. You can download our stopwords-lists for german and english here, which you can customize for your needs.
  • The charset_table is the bit where the aforementioned configuration inheritance really comes in useful. In my example the full configuration statement for this setting has nearly 190 not very readable lines, almost 50KB of more or less gibberish. Very useful though, as this bit does all the tricky Unicode normalization.

Using the configuration inheritance, the configuration for the delta index couldn't be much simpler:

index forumdelta : forummain
{
    source = forumdelta
    path = /var/lib/sphinx/data/forumdelta
}

Next page: Sphinx configuration (3/3)

« »

Comments (2) Trackbacks (1)

Leave a comment

(required)