Wellfire Interactive

Stretching Haystack's ElasticSearch Backend

A lot of feature requirements in Django projects are solved by domain specific third-party modules that smartly fit the bill and end up becoming something of a community standard. For search, Haystack is that touchstone: it supports some of the most common search engines and its API closely mirrors that of existing Django APIs, making it easy for developers to get started.

We’ve been using Haystack with a Lucene backed engine called ElasticSearch - you know, for search. Unlike the popular Solr search engine, ElasticSearch uses schema-free JSON instead of XML and runs as a binary without requiring an external Java server. For our needs it optimizes simplicity and power.

Note: ElasticSearch support is only available in Haystack 2.0.0 beta. To use it you’ll need to grab the code from source, not PyPI.

What ElasticSearch can do

Rather than simply filtering your content, a search engine performs textual matching. Unlike a LIKE query in SQL, the query and indexed content can be provided with different relevancy weights, language characteristics can be chosen, and even synonyms. And it can do across different types of content, or rather, different types of ‘documents’.

The search engine does so by tokenizing and filtering the content - both indexed content and query terms. ElasticSearch allows you to configure how these are used, and you can add your own as well. With the available filters and tokenizers, you can add in analyzers that reference different languages, use custom stop words, and filter on synonyms. You update the index based on the index configuration.

Here’s an example from the ElasticSearch docs for setting up an analyzer to filter on synonyms using a provided synonym file.

{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "synonym" : {
                    "tokenizer" : "whitespace",
                    "filter" : ["synonym"]
                }
            },
            "filter" : {
                "synonym" : {
                    "type" : "synonym",
                    "synonyms_path" : "analysis/synonym.txt"
                }
            }
        }
    }
}

This looks like a pretty useful feature until you realize that Haystack’s ElasticSearch backend only supports a default setting configuration. Here’s what our index settings look like (source).

DEFAULT_SETTINGS = {
    'settings': {
        "analysis": {
            "analyzer": {
                "ngram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["haystack_ngram"]
                },
                "edgengram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["haystack_edgengram"]
                }
            },
            "tokenizer": {
                "haystack_ngram_tokenizer": {
                    "type": "nGram",
                    "min_gram": 3,
                    "max_gram": 15,
                },
                "haystack_edgengram_tokenizer": {
                    "type": "edgeNGram",
                    "min_gram": 2,
                    "max_gram": 15,
                    "side": "front"
                }
            },
            "filter": {
                "haystack_ngram": {
                    "type": "nGram",
                    "min_gram": 3,
                    "max_gram": 15
                },
                "haystack_edgengram": {
                    "type": "edgeNGram",
                    "min_gram": 2,
                    "max_gram": 15
                }
            }
        }
    }
}

And here’s the snippet showing how these are used (source).

if current_mapping != self.existing_mapping:
    try:
        # Make sure the index is there first.
        self.conn.create_index(self.index_name, self.DEFAULT_SETTINGS)
        self.conn.put_mapping(self.index_name, 'modelresult', current_mapping)
        self.existing_mapping = current_mapping
    except Exception:
        if not self.silently_fail:
            raise

The settings configure two nGram analyzers for Haystack, but we’re left without a way of changing the filter or tokenizer attributes, or of adding a new analyzer.

Using custom index settings

The solution, for the time being, is to use a custom search backend. The first step is to update the settings used for updating the index. Here’s a custom backend extending the original.

from django.conf import settings
from haystack.backends.elasticsearch_backend import ElasticsearchSearchBackend

class ConfigurableElasticBackend(ElasticsearchSearchBackend):

    def __init__(self, connection_alias, **connection_options):
        super(ConfigurableElasticBackend, self).__init__(
                                connection_alias, **connection_options)
        user_settings = getattr(settings, 'ELASTICSEARCH_INDEX_SETTINGS')
        if user_settings:
            setattr(self, 'DEFAULT_SETTINGS', user_settings)

This extended backend does nothing more than look for a custom settings dictionary in your project settings file and then replace the backend settings with your own. But now we can swap out those settings.

Choosing a new default analyzer

Even though we’ve updated the settings, our changes are still unavailable. Haystack assigns the specific analyzer to each search field based on a hard coded analyzer.

The default analyzer for non-nGram fields is the “snowball” analyzer. The snowball analyzer is basically a stemming analyzer, which means it helps piece apart words that might be components or compounds of others, as “swim” is to “swimming”, for instance. It also adds in a stop word filter, which removes common words from entering the index, such as common prepositions and articles. The analyzer is also language specific, which could be problematic since the default language is English and to change this you need to specify the language in the index settings.

Here’s the snippet in which the default analyzer is set in the build_schema method, with minor formatting changes for this page (source).

if field_mapping['type'] == 'string' and field_class.indexed:
    field_mapping["term_vector"] = "with_positions_offsets"

    if not hasattr(field_class, 'facet_for') and not \
            field_class.field_type in('ngram', 'edge_ngram'):
        field_mapping["analyzer"] = "snowball"

The chosen analyzer should be configurable, so let’s make it so.

class ConfigurableElasticBackend(ElasticsearchSearchBackend):

    DEFAULT_ANALYZER = "snowball"

    def __init__(self, connection_alias, **connection_options):
        super(ConfigurableElasticBackend, self).__init__(
                                connection_alias, **connection_options)

        user_settings = getattr(settings, 'ELASTICSEARCH_INDEX_SETTINGS')
        user_analyzer = getattr(settings, 'ELASTICSEARCH_DEFAULT_ANALYZER')

        if user_settings:
            setattr(self, 'DEFAULT_SETTINGS', user_settings)
        if user_analyzer:
            setattr(self, 'DEFAULT_ANALYZER', user_analyzer)

    def build_schema(self, fields):
        content_field_name, mapping = super(ConfigurableElasticBackend,
                                              self).build_schema(fields)

        for field_name, field_class in fields.items():
            field_mapping = mapping[field_class.index_fieldname]

            if field_mapping['type'] == 'string' and field_class.indexed:
                if not hasattr(field_class, 'facet_for') and not \
                                  field_class.field_type in('ngram', 'edge_ngram'):
                    field_mapping['analyzer'] = self.DEFAULT_ANALYZER)
            mapping.update({field_class.index_fieldname: field_mapping})
        return (content_field_name, mapping)

This update closely follows how the base method is written, including iterating through the fields as well as ignoring nGram fields. Now on reindexing all of your non-nGram indexed content will be analyzed with your specified analyzer. For explicitness the default analyzer is directly set as an attribute.

Search analyzers by field

We’ve now set up a configurable default analyzer, but why not control this on a field by field basis? It should be pretty straightforward. We’ll just subclass the fields, adding an analyzer attribute via a keyword argument.

class ConfigurableFieldMixin(object):

    def __init__(self, **kwargs):
        self.analyzer = kwargs.pop('analyzer', None)
        super(ConfigurableFieldMixin, self).__init__(**kwargs)

And then define a new field class using the mixin:

from haystack.fields import CharField as BaseCharField

class CharField(ConfigurableFieldMixin, BaseCharField):
    pass

Just be sure to import and use the new field rather than the field from the indexes module as you’d normally do. This establishes which analyzer the field should use, but doesn’t actually use the analyzer for indexing. Again, we need to extend the subclassed backend to do so, focusing on the build_schema method.

class ConfigurableElasticBackend(ElasticsearchSearchBackend):

    DEFAULT_ANALYZER = "snowball"

    def __init__(self, connection_alias, **connection_options):
        super(ConfigurableElasticBackend, self).__init__(
                                connection_alias, **connection_options)
        user_settings = getattr(settings, 'ELASTICSEARCH_INDEX_SETTINGS')
        if user_settings:
            setattr(self, 'DEFAULT_SETTINGS', user_settings)

    def build_schema(self, fields):
        content_field_name, mapping = super(ConfigurableElasticBackend,
                                              self).build_schema(fields)

        for field_name, field_class in fields.items():
            field_mapping = mapping[field_class.index_fieldname]

            if field_mapping['type'] == 'string' and field_class.indexed:
                if not hasattr(field_class, 'facet_for') and not \
                                  field_class.field_type in('ngram', 'edge_ngram'):
                    field_mapping['analyzer'] = getattr(field_class, 'analyzer',
                                                            self.DEFAULT_ANALYZER)
            mapping.update({field_class.index_fieldname: field_mapping})
        return (content_field_name, mapping)

If you wanted to control nGram analysis on a field by field basis simply remove the conditional.

Putting it all together

When you update your project settings to use the new backend, ensure that you’re referring to an engine instance (BaseEngine), not a backend instance (BaseSearchBackend). Given that we’ve just defined a new backend instance, we’ll need to also go ahead and define a new search engine.

from haystack.backends.elasticsearch_backend import ElasticsearchSearchEngine

class ConfigurableElasticSearchEngine(ElasticsearchSearchEngine):
    backend = ConfigurableElasticBackend

Now simply update your project settings accordingly to reference your new search engine backend and you’re good to go.

HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'myapp.backends.ConfigurableElasticSearchEngine',
        'URL': env_var('HAYSTACK_URL', 'http://127.0.0.1:9200/'),
        'INDEX_NAME': 'haystack',
    },
}
ELASTICSEARCH_INDEX_SETTINGS = {
    # index settings
}
ELASTICSEARCH_DEFAULT_ANALYZER = "snowball"

Don’t forget to update your index.

Next up

Django class-based view mixins for Haystack and handy debugging management commands.

Get more like this on tech and the business of web development

Get monthly community news, design and development tips, and info about the business side of web development.

Posted: by Ben

blog comments powered by Disqus