Search¶
A top-level description of search within tate.org.uk
Context¶
A user can search for occurences of a particular term in tate.org.uk by using the search input at the top of all pages or the input in the dedicated search page, https://www.tate.org.uk/search
Either way, a user will be provided with a list of search results at https://www.tate.org.uk/search
A user can also land on the dedicated search page via a link (in site or externally) that might include particular filter/term parameter values to ensure a certain results list is displayed - in other words, a pre-filtered 'view'. For example, a link to the items within a particular archive: https://www.tate.org.uk/search?aid=628&type=archive
The search UX is mostly unchanged from before the Wagtail migration project. However, in the previous version of the site search results were provided by the Elasticsearch-driven legacy API.
The search back-end therefore required rewriting to use Wagtail page data.
Fundamentals¶
Backend¶
Searching uses whichever back-end is specified under WAGTAILSEARCH_BACKENDS in tate/settings/base.py . On production this will be tate.search.backends.es8 which is basically a Wagtail-maintained interface to the API of an instance of Elasticsearch.
Essentially, the Wagtail ES8 backend provides a Django queryset-like API for querying Elasticsearch. This provides convenience but it's sometimes helpful to understand what happens beyond this abstraction.
In terms of implementation, the steps required in performing a typical search are as follows:
- Start with a queryset of model types (corresponding to the content types of results expected by the user)
- Apply filtering to the queryset as necessary
- Call search() on the queryset with the user's query
- At this point, the Wagtail ES8 backend effectively translates the queryset into a conventional ES query. It does this by mapping the model types of the queryset to the necessary ES index/indices to be queried, and adds a request for filtering to the query in accordance with the filtering applied to the queryset
- ES returns a set of results, which the Wagtail ES8 backend then converts into something like (but fundamentally not) a queryset
Filtering¶
The set of model instances returned by the Wagtail ES8 backend is fundamentally not a queryset because it cannot be filtered.
Unlike conventional queryset filtering that is an abstraction of a database query, the filtering applied to the original queryset is performed in Elastisearch.
One key point here is that, as long as a model field is explicitly set to be filtered on, using index.FilterField, its mapping in the ES indexing will specify it to be enabled for filtering (in Elastisearch).
Another key point is that Elastisearch can only filter (or sort) a multi-indices (i.e multi-model) query on a field that's available to all the indices. The field needn't necessarily have a value - but it must be in the index's mapping.
Filtering on a custom indexed value¶
Wagtail supports indexing extra fields (e.g. properties or methods). Out of the box this is only useful for searching, as these custom indexed fields can't be used for filtering.
A custom solution was developped for this project to solve this: tate.search.backends.es8.IndexedColumn.
To use it, first make sure your model uses CustomFilterField in its search_fields:
from tate.search.utils.index import CustomFilterField
class MyPage(Page):
...
search_fields = [
...,
CustomFilterField("custom"),
]
def custom(self):
return ... # Could be any value
Once that's done (and the index has been updated), you can use this field in a normal filter() call before calling search(), and the filtering will be done inside the ElasticSearch backend.
from tate.search.backends.es8 import IndexedColumn
queryset = MyPage.objects.annotate(custom=IndexedColumn("custom"))
queryset = queryset.filter(custom="...")
queryset = queryset.search(...)
Lookups are also supported (provided they're supported by the backend as well):
queryset = MyPage.objects.annotate(custom=IndexedColumn("custom"))
queryset = queryset.filter(custom__lte="...")
queryset = queryset.search(...)
Fuzzy matching¶
This is currently in use on the site, see docs here
Note: use is conditional on the presence of the Elastic search backend
Autocomplete¶
How it works¶
The Wagtail docs are a bit brief on this, so this is an attempt to provide a bit more information on how this works.
Firstly, despite appearing to be part of the QuerySet API, autocomplete() is similar in nature to the way search() can be called on a QuerySet. If you have Elastic search set as your backend, autocomplete() makes specific calls (defined in the ES search interface) to the ES API. If not, it provides similar (but likely not exactly the same) results using the default backend.
The search backend for preprod and prod is currently Elastic search so the ES implementation of autocomplete() is of more relevance here.
Requests to ES via autocomplete() specify use of ES's Edge n-gram tokenizer specifically recommended for search-as-you-type queries. In very broad terms, the query string is broken down into chunks (of specified length), with searches based on the resulting chunks. Read about the theory in rather more detail here!
The min-gram and max-gram values are passed with the request in the tokenizer details here.
What fields does is use?¶
Search requests using search() run against all model fields that are flagged with index.SearchField()
e.g index.SearchField("description")
However, search request using autocomplete() run against only fields flagged with index.AutocompleteField()
Currently, the only field flagged for autocomplete is the core title field available to all Wagtail models. In effect this means a search like the one shown below returns a SearchResults object containing pages with titles that contain the term. In that sense, ES is not interested in any other attribute of the page.
from wagtail.models import Page
Page.objects.live().autocomplete("Spla")
# returns
<SearchResults [<Page: Splash>, <Page: Sketch titled ‘splash’>, <Page: El Anatsui: Ink Splash II – 1>, <Page: El Anatsui: Ink Splash II – 2>, <Page: El Anatsui: Ink Splash II – 3>, <Page: Lipspeaker tour: A Bigger Splash>, <Page: Workshop: Bernard Makes a Splash>, <Page: Donald Rodney's ‘Splash Crowns’>, <Page: Sketch titled ‘splash crown’>, <Page: Splash>, <Page: A Bigger Splash: Painting and Performance>, <Page: A Bigger Splash by David Hockney>, <Page: A Bigger Splash: Painting after Performance>, <Page: Curator’s tour: A Bigger Splash>, <Page: A Bigger Splash: Painting after Performance>, <Page: A Bigger Splash: Painting after Performance>, <Page: Sketch of a water splash>, <Page: Sketch of a water splash>, <Page: Sketch of a water splash>, <Page: Sketch of a water splash>, '...(remaining elements truncated)...']>
In theory, having index.AutocompleteField() on just the title field of all models is useful. However, it's not practical or likely a nice UX to show titles while a user types. An approach using some sort of 'keyword' field (tagged accordingly with autocomplete) is therefore preferable. This would be dependent on an appropriate source for the 'keywork' field's value.