Range headers and native HTML5 streaming support in Backrunner

Plain HTML5 streaming requires server to properly handle RFC 2616 Range header, in particular video seek/rewind uses Range header to specify exact position within file to start streaming from.

HLS protocol is a bit different, but yet again player uses Range header to specify position within timeframe.

Previously Elliptics HTTP server Backrunner used size/offset URI parameters for this purpose, which are not ajax friendly and obviously are not supported by standard players.

With this new Backrunner update we add Range and If-Modified-Since headers support.
The former allows to work HTML5 pleers with Elliptics HTTP proxy out of the box. If-Modified-Since is quite useful for client-side caching.

Here is a simple example of our video-on-demand service.

Our future plans include realtime HLS generation and transcoding for live-translations built on top of elliptics and related technologies.

Greylock tutorial – distributed base search engine based on Elliptics

We’ve heavily updated Reverbrain documentation pages: doc.reverbrain.com, and I’m pleased to present our distributed base search engine Greylock. Documentation page includes generic information about search engine and tutorial, which includes installation process, configs and two types of clients: plain HTTP API (similar to what is expected from base search engine like Greylock and ElasticSearch) and Python client (works via HTTP too, but also uses Consul to acquire mailbox locks). If you need C++ tutorial, you can check greylock test suite which includes insertion/selection/removal as well as various iterators over data, self-recovery tests, statistics and other interesting bits.

I get a fair number of questions on how is Greylock different from ElasticSearch or Solr for instance or Amazon Cloud Search? They all have enormous amount of features and work with large amount of data, so what’s the purpose?

And the answer is just two words: scalability and automation.
If you worked with Elastic you do know which operations have to be made to reshard cluster when current sharding scheme becomes a bottleneck (how long it takes, what is the performance penalty and how dangerous is the process). When you work in the environment where new documents always arrive and space consumption grows with time, this resharding process will have to be started again and again with new servers added. At some point this becomes a serious issue.

With Greylock this is not needed at all. There is virtually no data and index movements when new servers are being added due to Elliptics bucket system. This design proved to work really well in Elliptics storage installations, where upload rates reach tens of terabytes daily, and that’s only our clients data, there are other seriously larger installations for example in Yandex.

We concentrated on scalability problem and solved it. And yet we do have a set of features. It is not comparable with Elastic of course even not counting NLP tasks which we will release later (language models and spelling correction for instance for any language where you can find a rather large corpus). Greylock supports basic relevance model based on the word distance among client request and words in the document.

Likely two of the worst issues are absence of numerical indexes and client locks. Both were made deliberately. Numerical indexes break pagination, which in turn means that if you want 100 documents out of a million, you will have to read them all into RAM, resort either to numeric order or into lexical order (that’s how document ids are stored in the inverted indexes), intersect the whole million of keys and return the first 100. For any subsequent request this has to be done again and again. Without numerics pagination works with iterators pointing to inverted indexes only, the whole index (and its millions of entries) is never being read, only some pages are accessed sequentially.

To help with numerics Greylock supports document timestamps, i.e. a single 64-bit numeric per document ID which is used in inverted indexes sorting order. Of course this is not a replacement for fair numeric index, but it does solve almost all of our use cases.

The second major issue is consistency and client locking. Greylock as well as Elliptics are not strictly consistent storages. With Greylock things are even worse – amount of data overwritten by a single client insert can be large and index structure (originally started as a distributed B+/*-tree) does not tolerate broken pages. Elastic and others implement consistency model (like Raft, Paxos or ZAB) internally. Greylock doesn’t. That’s why we require clients to acquire locks in some other system like Consul, Etcd or ZooKeeper to work properly. Our tutorial shows basic locking scheme implemented using strictly consistent Consul key-value storage.

We have a major plan for Greylock distributed search engine, expect new features and give it a try: http://doc.reverbrain.com/greylock:greylock
If you have any questions, you are welcome:
Google group: https://groups.google.com/forum/?fromgroups=#!forum/reverbrain,
this site and comments.

Server-side operations

Elliptics distributed storage is being built as a client-server architecture, and although servers may discover themselves, exchange various statistic information and forward requests, they most of the time serve client’s requests.

Recovery in a distributed storage is a tricky operation which requires serious thinking on which keys have to be copied to which destinations. In Elliptics recovery is another client process which iterates remote nodes, reads data to be recovered and update needed keys.

But there are cases when this round trip to client is useless. For example when you require missing replica, or when you have a set of keys you want to copy or move to new destination.

Thus I introduced two server-side operations which allow to send content from one server to multiple replicas. It is intended for various recovery tools which optimize by not copying data from local node to recovery temporary location, instead they may tell remote node to send data directly to required locations. It can also be used to move data from one low-level backend (for example eblob) to a newer version or different backend without server interruption.

There is a new iterator type now which sends all keys being iterated to set of remote groups. It does it with the speed of network or disk (what it slower), in local tests iteration over 200Gb blobs sending data over the network to one remote node via write commands ended up with ~78MB/s sustained speed. There were pikes though, especially when remote node synced caches. Both sender and recipient had 30Gb of RAM. Rsync shows ~32MB/s speed on these machines, but not because it is that slow, but because of ssh which maxed out CPU by packet encryption.
Iterator sends dnet_iterator_response structure for each write result for every key it has processed just like for usual iterator, neither API nor ABI is broken.

Second server-send command accepts vector of keys. It searches for all remote nodes/backends which host given keys in the one specified group, splits keys into per-node/backend basis and tells remote backends to send appropriate keys to specified remote groups. The same iterator response is generated for every key which has been processed.

All operations are async and can run in background with other client requests being handled in parallel.

There are 3 operation modes:
1. default – writing data to remote node using compare-and-swap, i.e. only write data if it either doesn’t exist or it is the same on remote servers. Server sending (iterator or per-key) running in this mode is especially useful for recovery – there is no way it can overwrite newer copy with the old data.
2. overwrite – when special flag is set, it overwrites data (clears compare-and-swap logic)
3. move – if write has been successful, remove local key

There is example tool in examples which iterates over remote node and backends and performs copy/move of the keys being iterated. Next step is to update our Backrunner HTTP proxy to use this new logic to automatically recover all buckets in background.

Stay tuned!

backrunner: HTTPS support

We at Reverbrain.com develop highly scalable distributed storage Elliptics for medium and large objects. And in the web era the most frequently used API is HTTP. We had developed HTTP proxy for elliptics named Backrunner, it supports a wide range of options like ACL, streaming, redirect, partial upload/download, static file downloading and many others.

But if you build a system hidden behind HTTPS you likely want to secure your work with storage, in porticular your CDN will likely require you to work via HTTPS in this case.

So we have updated Backrunner to support HTTPS. It can listen for unencrypted and secured connections simultaneously on different ports/addresses, and you have to provide certificate/private key files.

This rather small change allows to deploy fully secured storage access to your frontend.

Reverbrain packages repository

Reverbrain package repository now hosts packages for the following distributives: RHEL6, RHEL7 (CentOS supported), Ubuntu Precise, Ubuntu Trusty, Debian Wheezy, Debian Jessie.

Repository includes all packages needed to install Elliptics distributed storage and Eblob low-level storage.

Here is a small tutorial on how to automatically turn on repository in your setup: http://doc.reverbrain.com/elliptics:server-tutorial

Backrunner HTTP elliptics proxy can be found in Docker repo: https://registry.hub.docker.com/u/reverbrain/backrunner/

LSM and fractal trees and how to really work with large files

LSM tree (stands for log-structured merge tree) is a rather simple structure which can be hardly called a tree.

This is an append-only log which is sorted when written to disk. LSM tree is intended for write-heavy workloads, since reading requires at least O(number of on-disk log files) disk-seek operations.

There is a fair number of read optimizations for LSM trees, in particular bloom filter which can reduce number of disk seek operations to minimum albeit with some probability (it can be arbitrary small though).

LSM tree behaves much better for write workloads compared to Btree and friends (B+, B* and so on), since there is only one write of the sorted tree and it is always sequential. Btree potentially has to update multiple nodes (some log of total number of keys) when performing single write. Nodes are likely located at random locations which ends up with random writes. These are slow.

Quite contrary Btree reading is usually faster than that of LSM trees – logarithm of number of keys is less than number of sorted logs in LSM tree. But this does not count bloom filters in. Which in turn doesn’t count node caching in btrees.
Multiple operations needed to perform single request – like multiple page reads to fetch single key in btree case – is called multiplication. Fractal tree is aimed at write multiplication – it is yet B+tree, but it stores data in intermediate nodes (not in leafs) for some time until page split is required. This allows to reduce number of writes needed to insert or update a key.

Anyway, btrees are considered to be faster than LSM trees for reading and slower for writing. The latter is a fact – LSM trees are designed for that, the former is questionable.

Elliptics distributed storage can use many backends, and the most popular one is Eblob – a low-level storage built with LSM trees design in mind.

LSM trees do not support data rewrite – key update is appended to new log and older version is either marked as removed or special lookup sequence is used to find out newer keys first. Eventually LSM tree merges and removes old versions of the duplicate keys.

In Eblob this process is called defragmentation, and it is a bit different than LSM tree process. LSM tree stores already sorted data to disk, it sorts it in RAM. But if your storage is intended to store large files like Elliptics – we store objects which are sometimes quite larger than amount of RAM in the system – plain LSM tree approach (sort in mem and sync to disk) doesn’t work.

Instead Eblob stores unsorted log to disk (optionally overwriting data in place) and adds in-memory index of the keys. This simple scheme could be very naive since number of keys multiplied by key size – amount of RAM needed to store key index in memory – can be much larger than amount of RAM on any given server. So we have additional on-disk index of stored keys, it can be sorted – binary search allows to find needed key rather quickly.
But not quickly enough – there will be log2 of number of keys random seek operations – we have to split sorted keys into ranges of smaller size called index blocks and store start/stop pairs for each index block in RAM. This allows to find out index block quickly without on-disk operations, and then perform single read to get the whole index block (tens-to-hundreds of keys) and perform in-memory binary search.

And even this is not enough. Iterators and for example recovery works with sorted keys – recovery merges index lists from different nodes and produces sorted list of keys which have to be recovered – since our data is not sorted yet, reads of the to be recovered keys will be actually random reads. Instead we can turn that purely random read into subsequent read plus some times head positioning. So we sort data which is performed when defragmentation process is being started the first time.

This allows Elliptics+Eblob be the ultimate solution for medium-to-large files distributed storage. For smaller files pure LSM tree is usually enough.

Backrunner – next generation HTTP proxy for Elliptics distributed storage

Elliptics is a powerful distributed storage for medium and large data, but it is rather low-level. It doesn’t know about ACL or REST API for example, I would compare it to block level in Linux filesystem hierarchy. In particular, Elliptics only provides C/C++, Python and Golang API bindings.

For the vast majority of the users HTTP REST API is a must, thus we created Backrunner – a new swiss-knife HTTP proxy for Elliptics distributed storage. It supports ACL, automatic bucket selection based on disk and network speed, errors, amount of free space, automatic defragmentation, header extension, local static files handling and provides simple REST API for clients.

We call Backrunner an entry point to Elliptics distributed storage. It not only provides externally visible interfaces, but also takes many administrative tasks like running defragmentation, showing properly crafted monitoring stats and so on.

Backrunner’s load balancing operates in real-time, for example it gathers upload metrics (speed, latency, errors) on every request to properly tune algorithm placing data around the cluster. It also takes into account amount of free space, disk activity, internal errors, timings from other clients, network speed and many other metrics.

We will extend it to run basic recovery operations, right now Backrunner detects that replicas are out of sync, but do not run recovery because this will likely heavily affect timings, which is generally a bad idea. That’s why Elliptics is an eventually consistent system – we pay this price for the highest possible scalability levels.

Backrunner is also distributed in docker images: https://registry.hub.docker.com/u/reverbrain/backrunner/
Documentation: http://doc.reverbrain.com/backrunner:backrunner

Tutorial is coming, stay tuned!

Elliptics 2.26 changelog

Here is a human-readable changelog of 2.26 major version of elliptics distributed storage: http://doc.reverbrain.com/elliptics:major-version-updates#v226

The main feature is multiple backends in the single server. One can turn on/off them, change state, each backend has own IO execution pool. Basically it allows to change old scheme of having many elliptics servers, one per disk/directory/mountpoint, to just one server with multiple backends.
This greatly simplifies node setup and heavily decreases route table updates.

Also added new C++ structured logger Blackhole. One can send logs into ElasticSearch, syslog or use oldschool files.

We also cleaned up code and client logic, introduced new kinds of errors, simplified protocol and fixed bunch of bugs.

Enjoy and stay tuned!