Skip to content
Unbelievably space efficient data structures in Golang.
Branch: master
Clone or download
Latest commit 4430458 May 22, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github refactor: slimtrie: remove searchWords() Apr 18, 2019
_tmpl doc: contri: add workflow.md Apr 26, 2019
array new-feature: array: NewBitsJoin() accept a "dense" argument May 19, 2019
benchhelper new-feature: array: NewBitsJoin() accept a "dense" argument May 19, 2019
bits optimize: bits: use math/bits; 10% faster Apr 8, 2019
docs doc: bump-ver: 0.5.5 May 22, 2019
encode new-feature: encode: add encode.Int to convert int to byte and back Apr 19, 2019
genhelper new-feature: array: add specific type array: I16, I32 and I64 Apr 8, 2019
index api-change: slimtrie: range based SlimTrie must provides all keys May 21, 2019
iohelper refactor: iohelper: use testify May 7, 2019
polyfit new-feature: polyfit: add polyfit for fit a polynomial curve over points May 15, 2019
scripts refactor: dep: update pip requirement: yaml May 19, 2019
serialize serialize: remove redudent codes Mar 28, 2019
strhelper refactor: strhelper: complete comments May 17, 2019
tools/reports/memusage api-changes: trie: SlimTrie.Get returns value and found in bool Apr 16, 2019
trie new-feature: slimtrie: max key limit extends to 2^31 May 21, 2019
typehelper api-change: typehelper: ToSlice now just panic if input is not a slice. May 14, 2019
vendor refactor: dep: add testify/require May 16, 2019
version array: add GetVersion() Feb 25, 2019
.gitignore gitignore .DS_Store Mar 14, 2019
.gitsubrepo .github: import CODE_OF_CONDUCT.md Mar 16, 2019
.travis.yml refactor: travis: add coveralls May 19, 2019
LICENSE Initial commit Feb 2, 2019
Makefile refactor: travis: add coveralls May 19, 2019
README.md doc: bump-ver: 0.5.5 May 22, 2019
appveyor.yml add appveyor.yml for auto build on appveyor.com Feb 20, 2019
git-subrepo Squashed git-subrepo master:git-subrepo 4cf23f478 (2019-02-14 19:18:30) Mar 16, 2019
go.mod
go.sum refactor: dep: add gonum May 7, 2019
int-types.md array: converter.go add doc; use int type for Unmarshal() return value Feb 21, 2019

README.md

Slim - surprisingly space efficient data types in Golang

Travis-CI AppVeyor GoDoc Report card GolangCI Sourcegraph Coverage Status

Slim is collection of surprisingly space efficient data types, with corresponding serialization APIs to persisting them on-disk or for transport.

Why slim

As data on internet keeps increasing exponentially, the capacity gap between memory and disk becomes greater.

Most of the time, a data itself does not need to be loaded into expensive main memory. Only the much more important information, WHERE-A-DATA-IS, deserve a seat in main memory.

This is what slim does, keeps as little information as possible in main memory, as a minimized index of huge amount external data.

  • SlimIndex: is a common index structure, building on top of SlimTrie.

    GoDoc

  • SlimTrie is the underlying index data structure, evolved from trie.

    GoDoc

    Features:

    • Minimized: requires only 6 bytes per key(even less than an 8-byte pointer!!).

    • Stable: memory consumption is stable in various scenarios. The Worst case converges to average consumption tightly. See benchmark.

    • Unlimited key length: You can have VERY long keys, without bothering yourself with any waste of memory(and money). Do not waste your life writing another prefix compression:). (aws-s3 limits key length to 1024 bytes). Memory consumption only relates to key count, not to key length.

    • Ordered: like btree, keys are stored in alphabetic order, but faster to access. Range-scan is supported!(under development)

    • Fast: time complexity for a get is O(log(n) + k); n: key count; k: key length. With comparison to btree, which is O(log(n) * k)(tree-like). And golang-map is O(k)(hash-table-like).

    • Ready for transport: SlimTrie has no gap between its in-memory layout and its on-disk layout or transport presentation. Unlike other data-structure such as the popular red-black-tree, which is designed for in-memory use only and requires additional work to persist or transport it.

    • Loosely coupled design: index logic and data storage is completely separated. Piece of cake using SlimTrie to index huge data.

Memory overhead

Bits/key is very stable when key-count(n) increaes, and does not relate to key-length(k) either!

The more dense a key set is, the less memory a trie-like data structure costs. Above 5000 keys, it becomes stable at 13 bits/key.

key-count k=64 k=128 k=256
1000 16 16 16
2000 13 13 13
5000 14 14 14

Performance benchmark

Time(in nano second) spent on a get operation with SlimTrie, golang-map and btree by google.

Smaller is better.

benchmark-get-png

Key count Key length SlimTrie Map Btree
1 1024 86.3 5.0 36.9
10 1024 90.7 59.7 99.5
100 1024 123.3 60.1 240.6
1000 1024 157.0 63.5 389.6
1000 512 152.6 40.0 363.0
1000 256 152.3 28.8 332.3

It is about 2.6 times faster than the btree by google.

Time(in nano second) spent on a get with different key count(n) and key length(k):

See: trie/report/

Status, Roadmap and Change-log

Currently slim v0.5.0 APIs are stable, and has been used in a production env.

Meanwhile we focus on optimizing memory usage and query performance.

Internal data structure may change before v1.0.0.

Roadmap
  • Large key set benchmark
  • Query by range
  • Support up to 2 billion keys
  • Reduce false positive rate
  • Reduce memory usage from 40 to 25 bits/key
  • 2019-04-20 v0.4.3 Range index: many keys share one index item
  • 2019-04-18 v0.4.1 Marshaling support
  • 2019-03-08 v0.1.0 SlimIndex SlimTrie
Change-log
v0.5.5:
  api-change:
    slimtrie:
    - range based SlimTrie must provides all keys; by drdr xp; 2019-05-21
    treestr:
    - interface Tree adds a new method LabelInfo to format tree branch label; by drdr
      xp; 2019-05-21
  new-feature:
    slimtrie:
    - max key limit extends to 2^31; by drdr xp; 2019-05-21
    tree:
    - add depth-first walker DepthFirst(); by drdr xp; 2019-05-21
    trie:
    - add removeSameLeaf() to remove leaves with the same value; by drdr xp; 2019-05-21
v0.5.4:
  api-change:
    array:
    - NewDense do not need eltWidth; only support int32; protobuf structure change;
      optimize Dense creating; by drdr xp; 2019-05-13
    - rename Bitmap to Bits; by drdr xp; 2019-05-17
  new-feature:
    array:
    - add Bitmap to store series of bits; by drdr xp; 2019-05-16
    - Base.Indexes() to retrieve all indexes of present elements; by drdr xp; 2019-05-17
    - add NewBitmapJoin() to create a big bitmap by joining sub bitmaps; by drdr xp;
      2019-05-16
    - add Bitmap16 which is compatible with U32; by drdr xp; 2019-05-18
    - NewBitsJoin() accept a "dense" argument; by drdr xp; 2019-05-19
    benchhelper:
    - add SizeStat() to describe data struct and size; by drdr xp; 2019-05-18
    polyfit:
    - add polyfit for fit a polynomial curve over points; by drdr xp; 2019-05-15
    slimtrie:
    - use Bitmap16 and reduce memory usage; by drdr xp; 2019-05-18
    strhelper:
    - add ToBin() converts integer or slice of integer to binary format string; by drdr
      xp; 2019-05-17
v0.5.3:
  api-change:
    trie:
    - values to create trie must be slice or it panic; by drdr xp; 2019-05-01
    typehelper:
    - ToSlice now just panic if input is not a slice.; by drdr xp; 2019-05-01
v0.5.2:
  new-feature:
    array:
    - add Dense array to compress incremental ints; by drdr xp; 2019-05-06
    benchhelper:
    - add SizeOf() to get size of a value; by drdr xp; 2019-05-06
    - add RandI32Slice; by drdr xp; 2019-05-07
v0.5.1:
  new-feature:
    slimtrie:
    - do not store Step on leaf node; by drdr xp; 2019-04-29
    trie:
    - Tree convert a tree-like data strcture to string; by drdr xp; 2019-05-02
v0.5.0:
  api-change:
    trie:
    - Append() do not need isStartLeaf; by drdr xp; 2019-04-22
v0.4.3:
  new-feature:
    slimtrie:
    - RangeGet() to get value of a key in indexed range; by drdr xp; 2019-04-20
    - String(); by drdr xp; 2019-04-23
    trie:
    - add String() to output human readable trie structure; by drdr xp; 2019-04-19
v0.4.1:
  new-feature:
    encode:
    - add encode.Int to convert int to byte and back; by drdr xp; 2019-04-18
    slimtrie:
    - add proto.Marshaler and proto.Unmarshaler interface; by liubaohai; 2019-04-18
    strhelper:
    - add func to convert word of bits back to string; by drdr xp; 2019-04-19
v0.4.0:
  api-changes:
    trie:
    - trie.Node add squash; by wenbo; 2019-04-11
    - remove marshalAt and unmarshalAt; use SectionReader and SectionWriter; by drdr
      xp; 2019-04-10
    - fix method name encode->marshal; by drdr xp; 2019-04-10
    - SlimTrie.Get returns value and found in bool; by drdr xp; 2019-03-27
  new-feature:
    array:
    - add MemSize() to get memory occupied by array; by drdr xp; 2019-04-15

Synopsis

Index keys, get by key

Show me the code
package index_test

import (
	"fmt"
	"strings"

	"github.com/openacid/slim/index"
)

type Data string

func (d Data) Read(offset int64, key string) (string, bool) {
	kv := strings.Split(string(d)[offset:], ",")[0:2]
	if kv[0] == key {
		return kv[1], true
	}
	return "", false
}

func Example() {

	// Accelerate external data accessing (in memory or on disk) by indexing
	// them with a SlimTrie:

	// `data` is a sample of some unindexed data. In our example it is a comma
	// separated key value series.
	//
	// In order to let SlimTrie be able to read data, `data` should have
	// a `Read` method:
	//     Read(offset int64, key string) (string, bool)
	data := Data("Aaron,1,Agatha,1,Al,2,Albert,3,Alexander,5,Alison,8")

	// keyOffsets is a prebuilt index that stores key and its offset in data accordingly.
	keyOffsets := []index.OffsetIndexItem{
		{Key: "Aaron", Offset: 0},
		{Key: "Agatha", Offset: 8},
		{Key: "Al", Offset: 17},
		{Key: "Albert", Offset: 22},
		{Key: "Alexander", Offset: 31},
		{Key: "Alison", Offset: 43},
	}

	// `SlimIndex` is simply a container of SlimTrie and its data.
	st, err := index.NewSlimIndex(keyOffsets, data)
	if err != nil {
		fmt.Println(err)
	}

	// Lookup
	v, found := st.Get("Alison")
	fmt.Printf("key: %q\n  found: %t\n  value: %q\n", "Alison", found, v)

	v, found = st.Get("foo")
	fmt.Printf("key: %q\n  found: %t\n  value: %q\n", "foo", found, v)

	// Output:
	// key: "Alison"
	//   found: true
	//   value: "8"
	// key: "foo"
	//   found: false
	//   value: ""
}

Index key ranges, get by key

Show me the code

Create an index item for every 4(or more as you wish) keys.

Let several adjacent keys share one index item reduces a lot memory cost if there are huge amount keys in external data. Such as to index billions of 4KB objects on a 4TB disk(because one disk IO costs 20ms for either reading 4KB or reading 1MB).

package index_test

import (
	"fmt"
	"strings"

	"github.com/openacid/slim/index"
)

type RangeData string

func (d RangeData) Read(offset int64, key string) (string, bool) {
	for i := 0; i < 4; i++ {
		if int(offset) >= len(d) {
			break
		}

		kv := strings.Split(string(d)[offset:], ",")[0:2]
		if kv[0] == key {
			return kv[1], true
		}
		offset += int64(len(kv[0]) + len(kv[1]) + 2)

	}
	return "", false
}

func Example_indexRanges() {

	// Index ranges instead of keys:
	// In this example at most 4 keys shares one index item.

	data := RangeData("Aaron,1,Agatha,1,Al,2,Albert,3,Alexander,5,Alison,8")

	// keyOffsets is a prebuilt index that stores range start, range end and its offset.
	keyOffsets := []index.OffsetIndexItem{
		// Aaron  +--> 0
		// Agatha |
		// Al     |
		// Albert |

		// Alexander +--> 31
		// Alison    |

		{Key: "Aaron", Offset: 0},
		{Key: "Agatha", Offset: 0},
		{Key: "Al", Offset: 0},
		{Key: "Albert", Offset: 0},

		{Key: "Alexander", Offset: 31},
		{Key: "Alison", Offset: 31},
	}

	st, err := index.NewSlimIndex(keyOffsets, data)
	if err != nil {
		panic(err)
	}

	v, found := st.RangeGet("Aaron")
	fmt.Printf("key: %q\n  found: %t\n  value: %q\n", "Aaron", found, v)

	v, found = st.RangeGet("Al")
	fmt.Printf("key: %q\n  found: %t\n  value: %q\n", "Al", found, v)

	v, found = st.RangeGet("foo")
	fmt.Printf("key: %q\n  found: %t\n  value: %q\n", "foo", found, v)

	// Output:
	// key: "Aaron"
	//   found: true
	//   value: "1"
	// key: "Al"
	//   found: true
	//   value: "2"
	// key: "foo"
	//   found: false
	//   value: ""
}

Getting started

Install

go get github.com/openacid/slim

All dependency packages are included in vendor/ dir.

Prerequisites

  • For users (who'd like to build cool stuff with slim):

    Nothing.

  • For contributors (who'd like to make slim better):

    • dep: for dependency management.
    • protobuf: for re-compiling *.proto file if on-disk data structure changes.

    Max OS X:

    brew install dep protobuf

    On other platforms you can read more: dep-install, protoc-install.

Who are using slim

baishancloud

Feedback and contributions

Feedback and Contributions are greatly appreciated.

At this stage, the maintainers are most interested in feedback centered on:

  • Do you have a real life scenario that slim supports well, or doesn't support at all?
  • Do any of the APIs fulfill your needs well?

Let us know by filing an issue, describing what you did or wanted to do, what you expected to happen, and what actually happened:

Or other type of issue.

Authors

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE file for details.

You can’t perform that action at this time.