A scalable way to built search engine.

Having constant or increasing rate of the input documents we at Reverbrain wanted to spend as little as possible on search engine resharding, moving data around, constantly looking at shards load and so on. We just wanted to add new servers when needed, and optionally running mundane operational tasks in background (better started automatically by some agents).

Reverbrain had already developed distributed storage system Elliptics, it was originally build on a distributed hash table basis, but we introduced logical entities named buckets to wrap replication into structure which can be easily operated on. Our main goal for Elliptics was to create a really safe storage which would store replicas across physically distributed datacenters, and with buckets it could be fine-tuned for different types of data. Buckets allow horizontal scaling in the environments where there is constant or increasing rate of input data and ever increasing storage size. When adding new buckets for new servers there is no resharding or consistent hashing rebalancing, although it is possible to scale single bucket (set of replicas) by adding servers into selected buckets. Elliptics proved to handle scaling issues easily, in 2013 one of our installation had 53+ billions of records, for comparison, Facebook had 450+ (250 in other sources) billions of photos these days.

Having this background we decided to build search engine on Elliptics base. This took off all scaling, replication, distribution and recovery tasks from our shoulders. Basically, we concentrated on base search engine features, and that is what we have:

Greylock features

Greylock is a quite simple search engine, we created it to fill in scalability niche. Because of this we do not have embedded natural language processing like lemmatization (or simpler stemming), spelling error correction, synonim searching and so on. Instead we build microservice architecture where NLP tasks are separated into its own service.