mmap
使用mmap处理数据库中的内容使得我们像处理内存中数据一样,但是它不会占用实际的物理RAM,只有在我们访问指定的字节区域时,操作系统才将其从disk延迟还在到内存页。操作系统负责我们持久化数据的所有内存管理。
- Moving from millions of small files to a handful of larger allows us to keep all files open with little overhead. This unblocks the usage of mmap(2), a system call that allows us to transparently back a virtual memory region by file contents. For simplicity, you might want to think of it like swap space, just that all our data is on disk already and no writes occur when swapping data out of memory.
- This means we can treat all contents of our database as if they were in memory without occupying any physical RAM. Only if we access certain byte ranges in our database files, the operating system lazily loads pages from disk. This puts the operating system in charge of all memory management related to our persisted data. Generally, it is more qualified to make such decisions, as it has the full view on the entire machine and all its processes. Queried data can be rather aggressively cached in memory, yet under memory pressure the pages will be evicted. If the machine has unused memory, Prometheus will now happily cache the entire database, yet will immediately return it once another application needs it.
- Therefore, queries can longer easily OOM our process by querying more persisted data than fits into RAM. The memory cache size becomes fully adaptive and data is only loaded once the query actually needs it.
- From my understanding, this is how a lot of databases work today and an ideal way to do it if the disk format allows — unless one is confident to outsmart the OS from within the process. We certainly get a lot of capabilities with little work from our side.
Compaction
t0 t1 t2 t3 t4 now
+-----------+ +-----------+ +-----------+ +-----------+ +-----------+
| 1 | | 2 | | 3 | | 4 | | 5 mutable | before
+-----------+ +-----------+ +-----------+ +-----------+ +-----------+
+-----------------------------------------+ +-----------+ +-----------+
| 1 compacted | | 4 | | 5 mutable | after (option A)
+-----------------------------------------+ +-----------+ +-----------+
+--------------------------+ +--------------------------+ +-----------+
| 1 compacted | | 3 compacted | | 5 mutable | after (option B)
+--------------------------+ +--------------------------+ +-----------+
storage
周期性地创建一个新的block,把之前的block写入磁盘,只有block被成功持久化才能删除WAL。
block默认大小为2h,从而避免内存中存积大量的数据。当查询多个blocks时,我们需要合并查询的结果。而合并是需要开销的。因此引入压缩的功能,压缩是将一个或多个blocks写入一个更大的block中,压缩时会丢弃被删除的数据,重构samples以提高性能。
- The storage has to periodically “cut” a new block and write the previous one, which is now completed, onto disk. Only after the block was successfully persisted, the write ahead log files, which are used to restore in-memory blocks, are deleted.
- We are interested in keeping each block reasonably short (about two hours for a typical setup) to avoid accumulating too much data in memory. When querying multiple blocks, we have to merge their results into an overall result. This merge procedure obviously comes with a cost and a week-long query should not have to merge 80+ partial results.
- To achieve both, we introduce compaction. Compaction describes the process of taking one or more blocks of data and writing them into a, potentially larger, block. It can also modify existing data along the way, e.g. dropping deleted data, or restructuring our sample chunks for improved query performance.
Retention, MinBlockDuration, MaxBlockDuration
|
+-----------+ +-----------+ +-----------+ +-----------+ +-----------+
| 1 | | 2 | | | 3 | | 4 | | 5 | .....
+-----------+ +-----------+ +-----------+ +-----------+ +-----------+
|
retention boundary
随着数据的增长,block因为压缩变得越来越大。而block也应该有上限,其上限为其不应该跨整个数据库。经验值MaxBlockDuration
取的是不超过Retention window
的10%
。而MinBlockDuration
默认取的是2h
。
- The older data gets, the larger the blocks may become as we keep compacting previously compacted blocks. An upper limit has to be applied so blocks don’t grow to span the entire database and thus diminish the original benefits of our design.
- Conveniently, this also limits the total disk overhead of blocks that are partially inside and partially outside of the retention window, i.e. block 2 in the example above. When setting the maximum block size at 10% of the total retention window, our total overhead of keeping block 2 around is also bound by 10%.
网友评论