12月18日
Full-text search on Google App Engine (what's wrong with search.SearchableModel)
We've been using google app engine for the last couple of months. You've probably read everywhere that there are major inadequacies with the current state of google app engine, many of its limitations are annoyingly prohibitive and have resulted in many wasted hours. I am talking about file limit, file size limit, no join like in SQL, no background processes, CPU limitations etc etc. But none of these has been a stumbling block for us until we hit full-text search. This issue will likely force us off google app engine.
You probably want to tell me that there's
search.SearchableModel, and that's exactly the topic I want to address here.
Our application worked fine in our development environment, it even deployed and worked correctly on appspot with the full text search functionality. The problem with search.SearchableModel didn't become apparent until we
started redeploying new indexes. These redeployments triggered rebuilding of
indexes. Our indexes would be stucked in building mode for days, eventually erroring out. Initially, we thought this is just a small problem that Google will eventually fix but we were met with very little support or response.
A more detailed look and a few experimentation yield the following - which i want to share here as a warning for anyone considering app engine as a platform. Let me demonstrate with an example.
Say we want to search for all jobs related to "Marketing", and we want to limit this search to only jobs from a specific company, and to return the results in descending date, this would require the following index declaration.
- kind: Job
properties:
- name: __searchable_text_index
- name: company
- name: created_at
direction: desc
Note that in this example there's only one search term, namely "Marketing". If we want to search for "Marketing Manager" (a two-word search term), the following index would be required.
- kind: Job
properties:
- name: __searchable_text_index
- name: __searchable_text_index
- name: company
- name: created_at
direction: desc
After repeat testing, the conclusion that I drew eventually is that in order to support n-word search term, an index would need to be created specifically with n properties for the index field (__searchable_text_index) to cater for the possibility of someone trying to search with n-words. I've confirmed this pattern through experimentation. This means a separate index will need to be created to support 1-word, 2-word, 3-word .... n-word search terms. This is horribly unscalable, and it is really the root of the indexing errors we've been experiencing, Google calls it an
exploding index, and if you read the documentation carefully you'll start to understand why it would be exploding.
Just a note, based on Google's documentation (in the same link above), this may not be a problem for you if you don't need to apply additional filters (e.g. company, created_at) on the full-text search query. I am guessing that's why site like vorby.com is coping ok, but I haven't really verified this since it doesn't really do much for us.
I am actively trying to find a work around atm, e.g. by removing indexes that will create a exploding number of index permutations. But it's looking likely that we'll have to move off. (Fortunately we've planned for this a bit by making use of django)
Chat to me if you want, I would love to have a discussion on it, james__z on skype