Server-side operations

Elliptics distributed storage is being built as a client-server architecture, and although servers may discover themselves, exchange various statistic information and forward requests, they most of the time serve client’s requests.

Recovery in a distributed storage is a tricky operation which requires serious thinking on which keys have to be copied to which destinations. In Elliptics recovery is another client process which iterates remote nodes, reads data to be recovered and update needed keys.

But there are cases when this round trip to client is useless. For example when you require missing replica, or when you have a set of keys you want to copy or move to new destination.

Thus I introduced two server-side operations which allow to send content from one server to multiple replicas. It is intended for various recovery tools which optimize by not copying data from local node to recovery temporary location, instead they may tell remote node to send data directly to required locations. It can also be used to move data from one low-level backend (for example eblob) to a newer version or different backend without server interruption.

There is a new iterator type now which sends all keys being iterated to set of remote groups. It does it with the speed of network or disk (what it slower), in local tests iteration over 200Gb blobs sending data over the network to one remote node via write commands ended up with ~78MB/s sustained speed. There were pikes though, especially when remote node synced caches. Both sender and recipient had 30Gb of RAM. Rsync shows ~32MB/s speed on these machines, but not because it is that slow, but because of ssh which maxed out CPU by packet encryption.
Iterator sends dnet_iterator_response structure for each write result for every key it has processed just like for usual iterator, neither API nor ABI is broken.

Second server-send command accepts vector of keys. It searches for all remote nodes/backends which host given keys in the one specified group, splits keys into per-node/backend basis and tells remote backends to send appropriate keys to specified remote groups. The same iterator response is generated for every key which has been processed.

All operations are async and can run in background with other client requests being handled in parallel.

There are 3 operation modes:
1. default – writing data to remote node using compare-and-swap, i.e. only write data if it either doesn’t exist or it is the same on remote servers. Server sending (iterator or per-key) running in this mode is especially useful for recovery – there is no way it can overwrite newer copy with the old data.
2. overwrite – when special flag is set, it overwrites data (clears compare-and-swap logic)
3. move – if write has been successful, remove local key

There is example tool in examples which iterates over remote node and backends and performs copy/move of the keys being iterated. Next step is to update our Backrunner HTTP proxy to use this new logic to automatically recover all buckets in background.

Stay tuned!

Reverbrain packages repository

Reverbrain package repository now hosts packages for the following distributives: RHEL6, RHEL7 (CentOS supported), Ubuntu Precise, Ubuntu Trusty, Debian Wheezy, Debian Jessie.

Repository includes all packages needed to install Elliptics distributed storage and Eblob low-level storage.

Here is a small tutorial on how to automatically turn on repository in your setup: http://doc.reverbrain.com/elliptics:server-tutorial

Backrunner HTTP elliptics proxy can be found in Docker repo: https://registry.hub.docker.com/u/reverbrain/backrunner/

LSM and fractal trees and how to really work with large files

LSM tree (stands for log-structured merge tree) is a rather simple structure which can be hardly called a tree.

This is an append-only log which is sorted when written to disk. LSM tree is intended for write-heavy workloads, since reading requires at least O(number of on-disk log files) disk-seek operations.

There is a fair number of read optimizations for LSM trees, in particular bloom filter which can reduce number of disk seek operations to minimum albeit with some probability (it can be arbitrary small though).

LSM tree behaves much better for write workloads compared to Btree and friends (B+, B* and so on), since there is only one write of the sorted tree and it is always sequential. Btree potentially has to update multiple nodes (some log of total number of keys) when performing single write. Nodes are likely located at random locations which ends up with random writes. These are slow.

Quite contrary Btree reading is usually faster than that of LSM trees – logarithm of number of keys is less than number of sorted logs in LSM tree. But this does not count bloom filters in. Which in turn doesn’t count node caching in btrees.
Multiple operations needed to perform single request – like multiple page reads to fetch single key in btree case – is called multiplication. Fractal tree is aimed at write multiplication – it is yet B+tree, but it stores data in intermediate nodes (not in leafs) for some time until page split is required. This allows to reduce number of writes needed to insert or update a key.

Anyway, btrees are considered to be faster than LSM trees for reading and slower for writing. The latter is a fact – LSM trees are designed for that, the former is questionable.

Elliptics distributed storage can use many backends, and the most popular one is Eblob – a low-level storage built with LSM trees design in mind.

LSM trees do not support data rewrite – key update is appended to new log and older version is either marked as removed or special lookup sequence is used to find out newer keys first. Eventually LSM tree merges and removes old versions of the duplicate keys.

In Eblob this process is called defragmentation, and it is a bit different than LSM tree process. LSM tree stores already sorted data to disk, it sorts it in RAM. But if your storage is intended to store large files like Elliptics – we store objects which are sometimes quite larger than amount of RAM in the system – plain LSM tree approach (sort in mem and sync to disk) doesn’t work.

Instead Eblob stores unsorted log to disk (optionally overwriting data in place) and adds in-memory index of the keys. This simple scheme could be very naive since number of keys multiplied by key size – amount of RAM needed to store key index in memory – can be much larger than amount of RAM on any given server. So we have additional on-disk index of stored keys, it can be sorted – binary search allows to find needed key rather quickly.
But not quickly enough – there will be log2 of number of keys random seek operations – we have to split sorted keys into ranges of smaller size called index blocks and store start/stop pairs for each index block in RAM. This allows to find out index block quickly without on-disk operations, and then perform single read to get the whole index block (tens-to-hundreds of keys) and perform in-memory binary search.

And even this is not enough. Iterators and for example recovery works with sorted keys – recovery merges index lists from different nodes and produces sorted list of keys which have to be recovered – since our data is not sorted yet, reads of the to be recovered keys will be actually random reads. Instead we can turn that purely random read into subsequent read plus some times head positioning. So we sort data which is performed when defragmentation process is being started the first time.

This allows Elliptics+Eblob be the ultimate solution for medium-to-large files distributed storage. For smaller files pure LSM tree is usually enough.

React monitoring tool released

During last months we’ve been actively developing our monitoring tools and now we are ready to present you React!

React(Real-time Call Tree) is a library for measuring time consumption of different parts of your program. You can think of it as a real-time callgrind with very small overhead and user-defined call branches. Simple and minimalistic API allows you to collect per-thread traces of your system’s workflow.

Trace is represented as JSON that contains call tree description(actions) and arbitrary user annotations:

{
    "id": "271c32e9c21d156eb9f1bea57f6ae4f1b1de3b7fd9cee2d9cca7b4c242d26c31",
    "complete": true,
    "actions": [
        {
            "name": "READ",
            "start_time": 0,
            "stop_time": 1241,
            "actions": [
                {
                    "name": "FIND",
                    "start_time": 3,
                    "stop_time": 68
                },
                {
                    "name": "LOAD FROM DISK",
                    "start_time": 69,
                    "stop_time": 1241,
                    "actions": [
                        {
                            "name": "READ FROM DISK",
                            "start_time": 70,
                            "stop_time": 1130
                        },
                        {
                            "name": "PUT INTO CACHE",
                            "start_time": 1132,
                            "stop_time": 1240
                        }
                    ]
                }
            ]
        }
    ]
}

This kind of trace can be very informative, on top of it you can build complex analysis tools that on the one hand can trace specific user request and on the other hand can measure performance of some specific action in overall across all requests.

Also, for human readable representation of react traces we’ve build simple web-based visualization instrument. React trace React has already proved himself in practice. Under high load our caching system performance had degraded dramatically after cache overflow. Call tree statistics showed that during cache operations most of the time was consumed in function that was resizing cache pages. After careful examination, we optimized that function and immediately, performance significantly increased.

Now React is used in Elliptics and Eblob and we intent to use it in other Reverbrain products.

For more details check out documentation and official repository.

Elliptics monitoring spoiler alert

Here is Elliptics per-cmd monitoring

Cache/eblob command timing

Monitoring includes a very detailed statistics about the most interesting bits of the storage. Above picture shows write command execution flow (whole command and how time was spent within) – one can turn it on for own command if special trace bit has been set.

Other monitoring parts include queues (network and processing), cache, eblob (command timings, data stats, counters), VFS and other stats.
More details on how to use that data in json format is coming.

Eblob documentation and code samples

Eblob is a main low-level storage supported in elliptics. It was created as append-only base, but eventually we added rewrite, data sorting, in-memory and on-disk indexes, added and then removed columns, bloom-filters, iterators, blob merge and defragmentation and so on and so on.

Eblob has rather simple interfaces accessible from C, C++ and Python.

So, we documented our most actively used low-level storage from every point of view: design and architecture, interfaces, known features, gotchas and best practices.
It is generally a good idea to read how things are done at the lowest level to understand how whole system behaves, so I present eblob documentation link here: http://doc.reverbrain.com/eblob:eblob

Enjoy!

The year of eblob

Hi, this is rbtz speaking again. I’m the engineer responsible for eblob codebase for
almost a year now. Here is small recap of what was happening with eblob
since v0.17.2 with some commit highlights.

The year of eblob

v0.17.2
* Eblob now builds under Mac OS X. This improved experience of developers with Macs.
* Changed most links to point to newly created http://reverbrain.com.
* Added comments to all eblob subsystems: e254fc3. This improves learning curve of new developers.
v0.17.3
* Added l2hash support: c8fa62c. This reduces memory consumption of elliptics metadata .by 25% on LP64.
* Added first edition of eblob stress test. Year after it’s responsible for catching 99% bugs that otherwise would go to testing: 8eab8ed.
v0.17.4

v0.17.6
* Added config variables for index block and bloom: a106d9d. This allows sysadmins to limit memory occupied by bloom filter.
* Added config variable to limit total blob size: f7da001. This allows sysadmins to limit eblobs size in case many databases are located on one shared drive.
v0.17.7
* Reduce memory consumption of “unsorted” blobs by 20% on LP64: 19e8612
v0.17.8
* First static analyzer crusade (feat. clang static analyzer) – number of “almost impossible to spot” bugs found.
* Added data-sort and binlog v1. This allows “on the fly” eblob defragmtntation and memory cleanups.
* Added travis-ci tests after each commit: f08fea2.
* Removed custom in-memory cache in favor of OS page cache: a7e74a7; This removed number of nasty races in eblob code and also opened way for some future optimizations.
* Added Doxyfile stub, so that in future libeblob man pages may be autogenerated: aac9cb3.
* Decreased memory consumption of in-memory data structures by 10% on LP64: c6afffa.
v0.18.0

v0.18.2
* Replaced core mutexes with rwlocks; This improves out Intel vTune concurrency benchmarks, along with our QA tests.
v0.18.3

v0.18.5-1
* Second static analyzer crusade (feat. Coverity);
* Switched to <a href=”https://en.wikipedia.org/wiki/Spinlock#Alternatives”>adaptive mutexes</a> when available: 43b35d8.
* Speeded up statistics update v1: 40a60d7. Do not hold global lock while computing and writing stats to disk.
* Rewritten bloom filter v1: 6f08e07. This improves speed and reduces memory fragmentation.
* Allocate index blocks in one big chunk instead of millions of small, thus speeding up initialization and reducing memory fragmentation: b87e273.
v0.18.6
v0.19.0
* Do not hold global lock for the whole duration of sync(): 6f6be68. This removes “stalls” in configs where sync > 0.
v0.19.1
* Switched to POSIX.1-2008 + XSI extensions: 6ece045.
* Build with -Wextra and -Wall by default: 0e8c713. This should in long term substantially improve code quality.
* Added options to build with hardening and sanitizers: c8b8a34, 2d8a42c. This improves our internal automated tests.
v0.19.2

v0.19.7
* Do not set bloom filter bits on start on removed entries: 36e7750. This will improve lookup times of “long removed” but still not defragmentated entries.
v0.19.8
v0.19.9
* Added separate thread for small periodic tasks: ea17fc0. This in future can be upgraded to simple background task manager;
* Moved statvfs(3) to periodic thread which speeds up write-only micro benchmarks by 50%: f36ab9d.
* Lock database on init to prevent data corruption by simultanious accesses to the same database by different processes: 5e5039d. See more about EB0000 in kb article.
v0.19.10
v0.19.11
* Removed columns aka types: 6b1f173; This greatly simplifies code and as side effect improves elliptics memory usage and startup times;
* Removed compression: 35ac55f; This removes dependency on Snappy.
* Removed bsize knob for write alignment: 8d87b32;
* Rewritten stats v2: 94c85ec; Now stats update very lightweight and atomic;
* Added writev(2)-like interface to eblob, so that elliptics backend could implement very efficient metadata handling: b9e0391;
v0.20.0

v0.21.3
* Replaced complex binlog with very tiny binlog v2: 1dde6f3; This greatly simplifies code, improves data-sort speed and memory efficiency;
* Made tests multithreaded: 1bd2f43. Now we can spot even more errors via automated tests before they hit staging.
* Move to GNU99 standard: f65955a. It’s already 15 years old already =)
* Fixed very old bug with log mesage truncation/corruption on multithreaded workloads: 10b6d47.
v0.21.4
* Bloom filter rewrite v2: 1bfadaf. Now we use many hash functions instead of one thus trading CPU time for improved IO efficiency. This improved bloom filter efficiency by order of magnitude.
v0.21.5
* Merge small blobs into one on defrag: ace7ca7. This improves eblob performance on databases with high record rotation maintaining almost fixed number of blobs.
v0.21.6
v0.21.7
* Added record record validity check on start: bcdb0be; See more about database corruption EB0001 in kb article.
v0.21.8

v0.21.11
* More robust eblob_merge tool that can be used to recover corrupted blobs.
v0.21.12
v0.21.13
* Reduced memory consumption of in-memory data-structures by 10% on LP64: e851820;
v0.21.14

v0.21.16
* Added schedule-based data-sort: 2f457b8; More on this topic in previous post: data-sort implications on eblob performance.
v0.21.17

Here I’ve mentioned only most notable commits, mostly performance and memory usage oriented changes. There are of course lots of other stuff going on like bugfixes, minor usability improvements and some internal changes.

Here are some basic stats for this year:

Total commits: 1375
Stats total: 65 files changed, 8057 insertions(+), 4670 deletions(-)
Stats excl. docs, tests and ci: 39 files changed, 5368 insertions(+), 3782 deletions(-)

Also if you are interested in whats going to happen in near future in eblob world you should probably take a look into it’s roadmap.

By the way for those of you who is interested in numbers and pretty graphs – after recent upgrade of our internal elliptics cluster storing billions of records to new LTS releases of elliptics 2.24.X and eblob 0.22.X we’ve got:

Response time reduction (log scale):

98th percentile that was around 100ms dropped below 50 ms

98th percentile that was around 100ms dropped below 50 ms

Disk IO (linear scale):

IO dropped more than one order of magnitude.

IO dropped more than one order of magnitude.

Memory (linear scale):

There is much more "cached" memory now. Also periodic data-sort routine successfully frees unused cached keys.

There is much more “cached” memory now. Also periodic data-sort routine successfully frees unused cached keys.

Post by Alexey “savetherbtz” Ivanov

Eblob as append-mostly database

Post by Alexey Ivanov

In this post I want to dig a little bit deeper into one of eblob’s least
understood parts: record overwrite. Let’s start with notion how overwrite was
meant to work in eblob: it simply shouldn’t. Indeed eblob started as
append-only database, so no overwrites were allowed. Then overwrite was added
to reduce fragmentation for use cases when some records were continuously
overwritten so that appending them over and over again was inefficient. So
special case “overwrite-in-place” for same-or-smaller size records overwrite
was added: it was controllable via two knobs – one global for database called
EBLOB_TRY_OVERWRITE and second is per-write flag
BLOB_DISK_CTL_OVERWRITE. Both of them are now deprecated, let me
explain why.

From this point on eblob became what I call “append-mostly” database, which
means usually it only appends data to the end of file, but in some rare cases it
can overwrite data “in-place”.

During binlog rewrite it was decided that having flags that drastically change
eblob write behavior is too complicated so those were replaced with the one
“canonical” write path: we are trying to overwrite records only if it’s overwrite of
same/smaller size and blob is not currently under data-sort. So we basically
turned on EBLOB_TRY_OVERWRITE flag by default, which should greatly reduce
fragmentation on some workloads. There are possible performance implications of
that decision so we’ve done much testing before applying this change via our
internal QA.

As side effect it allowed us to remove 90% of binlog code since there is no need to
“log” data overwrites and keep track of their offsets in original blob to apply
them later to sorted one.

By the way we’ve considered three possible alternatives:
– Always append data to end of “opened” blob on overwrite. (True append only database)
– Overwrite in-place only in last blob. (Almost append only)
– Always overwrite in-place when blob is not under data-sort. (Append-mostly database)

Here are some graphs from our benchmark system yandex-tank (Yay! It’s opensource).

Setup:
10000 rps, 1kb records, eblob 0.21.16

Always append:

(no-overwrite) read:write 10k eblob_0.21.16 flags 14

Always append

(no-overwrite) read:write 10k eblob_0.21.16 flags 142

Always append + EBLOB_AUTO_DATASORT

Overwrite in last blob only:

(partial-overwrite) read:write 10k eblob_0.21.16 flags 14

Overwrite in last blob only

(partial-overwrite) read:write 10k eblob_0.21.16 flags 142

Overwrite in last blob only + EBLOB_AUTO_DATASORT

Always overwrite:

(overwrite) read:write 10k eblob_0.21.16 flags 14

Always overwrite

(overwrite) read:write 10k eblob_0.21.16 flags 142

Always overwrite + EBLOB_AUTO_DATASORT

 

We’ve picked “always overwrite in-place” because it behaves really well with
background data-sort – as you can see there is almost no noticeable difference
between graph with data-sort in AUTO mode and disabled data-sort. One common
choice also simplifies write path and makes performance independent of
“magic-knobs”.

NB! Future eblob versions may change this behavior or reintroduce some of
knobs in case regressions of “always overwrite” approach be found.

Data-sort implications on eblob performance

Lots of stuff has been written about data-sort and defragmentaion in recent
eblob versions (>= 0.18.0) both in documentation and blogposts. Today I
want to speak about eblob memory management, data structures and why regular
data-sort is essential for eblob/elliptics performance.

First when key is written to data file it’s basic information like key itself,
size of record, and location on disk is stored in in-memory index (internally
rb-tree) and also written to so-called “unsorted” index. So both data file and
“unsorted” index have records sorted by their write time, but we still can very
efficiently find given key in in-memory index or iterate over
“unsorted” index because order of records matches one used in datafile.

But having all keys in memory is not always possible (especially when you have
billions of them). So on each startup eblob sorts all but last “unsorted”
indexes by key, so it can use more efficient data-structures instead of storing
all keys in-memory rb-tree. Memory-efficient as it is this breaks record ordering
between data file (records are sorted by write time) and new “sorted” index
(records sorted by key). This makes iteration over such blob very inefficient
(consuming way too many IOPS).

To mitigate those problems data-sort routine was introduced. It combines in itself three purposes:
* Purges records from in-memory index, so that recently unused keys won’t
occupy precious RAM and cause OOM on write-heavy workloads.
* Defragments data by physically deleting keys that were marked as “removed”.
It’s analogues to some NoSQLs’ compact procedure or SQLs’ VACUUM command.
* Restores record ordering between data file and index, so that iteration speed
is maximized.

As fast, efficient and useful as it is data-sort is rather heavy-weight routine,
because it can theoretically move terabytes across the drive, so it’s rather unwise to run
it in peak hours. Given that number of knobs were introduced so that administrator can
manage time of data-sort startup.

Starting with eblob v0.21.17 and elliptics v2.24.14.11 admin may select between
four different methods of managing data-sort startup times.

  • AUTO – run data-sort on startup for each “unsorted” blob (but last). Also run it on every blob’s “close”. This is preferred method for small databases.
  • TIMED – old way of running defrag each defrag_timout seconds. It’s useful for autogenerated config files where each server gets it’s own timeout.
  • SCHEDULED – most sophisticated built-in method. It automagically spreads data-sort load across nodes in time based on given defrag_time and defrag_splay so that each node selects some random time in range [defrag_time - defrag_splay, defrag_time + defrag_splay] hours. This is preferred method for medium/big clusters.
  • NONE – when none of given options are selected one must periodically run data-sort via provided API – for example with elliptics one can use dnet_ioclient -r host:port:family -d start command to run defrag on node given by host:port:family tuple. This is preferred method for very big clusters that require some external synchronization based on e.g.: replica location, current time or node load.

NB! Failure of periodically running data-sort will lead to extensive memory
usage, HDD space waste and slow iteration (e.g: dnet_recovery) speed.

For more information see:

Recent eblob wiki updates

Hi, it’s SaveTheRbtz speaking, I’m an engineer responsible to eblob code maintenance and improvement. There have been much of work done with eblob code base for past half a year, including big stuff like data-sort and l2lash, but also there were tons of small but useful improvements like rwlocks, memory usage improvements and start up speed.

I’m here to present recent wiki updates regarding some eblob features e.g. one that explains how reading and writing in eblob works and what optimizations make it faster than plain directory with billions of files.

Also we’ve expanded data-sort manual: now it has more info about data-sort and defragmentation including pros and cons.

Next release of eblob will break backward compatibility in terms of both ABI and API, but it will allow us not only improve performance and provide new features but also greatly simplify API itself and remove some useless functionality. For example we are now working on writev()-like (aka scatter/gather) interface for blob that will allow us to efficiently implement metadata in elliptics.

Full roadmap for eblob could be seen in corresponding wiki article too.