A scalable way to built search engine.
Having constant or increasing rate of the input documents we at Reverbrain wanted to spend as little as possible on search engine resharding, moving data around, constantly looking at shards load and so on. We just wanted to add new servers when needed, and optionally running mundane operational tasks in background (better started automatically by some agents).
Reverbrain had already developed distributed storage system Elliptics, it was originally build on a distributed hash table basis, but we introduced logical entities named buckets to wrap replication into structure which can be easily operated on. Our main goal for Elliptics was to create a really safe storage which would store replicas across physically distributed datacenters, and with buckets it could be fine-tuned for different types of data. Buckets allow horizontal scaling in the environments where there is constant or increasing rate of input data and ever increasing storage size. When adding new buckets for new servers there is no resharding or consistent hashing rebalancing, although it is possible to scale single bucket (set of replicas) by adding servers into selected buckets. Elliptics proved to handle scaling issues easily, in 2013 one of our installation had 53+ billions of records, for comparison, Facebook had 450+ (250 in other sources) billions of photos these days.
Having this background we decided to build search engine on Elliptics base. This took off all scaling, replication, distribution and recovery tasks from our shoulders. Basically, we concentrated on base search engine features, and that is what we have:
- Automatic scaling with Elliptics buckets support, no resharding, no data copy, no locked tables and other related performance issues, one should just add new servers and they will be used immediately after being added, which is particularly useful to take off write load or fix issues with not enough space.
- Automatic replication management via Elliptics buckets – one can put index replicas into different datacenters across the globe, and reading operations will select the fastest replica available for given client.
- HTTP JSON-based REST API, which is quite popular in modern indexing servers.
Basic set of searching options like index intersection and simple query distance based relevance. We do not try to implement any search option possible, but will add them on demand.
- Automatic index recovery – most of the time we do not have to wait until Elliptics recovery strikes in, instead replicas can be self-healed when client writes new keys.
- Search load balancing – client always reads indexes from the fastest available replica from the given bucket.
Greylock is a quite simple search engine, we created it to fill in scalability niche. Because of this we do not have embedded natural language processing like lemmatization (or simpler stemming), spelling error correction, synonim searching and so on. Instead we build microservice architecture where NLP tasks are separated into its own service.