Promethues存储实现机制

date

Mar 8, 2023

slug

prometheus

author

status

Public

关于prometheus

CNCF的毕业项目，云原生监控必备工具，本文不介绍使用，只关注prometheus如何实现时序数据的存储

什么是series

Wikipedia 关于Time series的解释： https://en.wikipedia.org/wiki/Time_series

在prometheus中：

Every time series is uniquely identified by its metric name and optional key-value pairs called labels

类似下面这种：

<metric name>{<label name>=<label value>, ...}

存储结构

Ingested samples are grouped into blocks of two hours. Each two-hour block consists of a directory containing a chunks subdirectory containing all the time series samples for that window of time, a metadata file, and an index file (which indexes metric names and labels to time series in the chunks directory). The samples in the chunks directory are grouped together into one or more segment files of up to 512MB each by default. When series are deleted via the API, deletion records are stored in separate tombstone files (instead of deleting the data immediately from the chunk segments).

The current block for incoming samples is kept in memory and is not fully persisted. It is secured against crashes by a write-ahead log (WAL) that can be replayed when the Prometheus server restarts. Write-ahead log files are stored in the wal directory in 128MB segments. These files contain raw data that has not yet been compacted; thus they are significantly larger than regular block files. Prometheus will retain a minimum of three write-ahead log files. High-traffic servers may retain more than three WAL files in order to keep at least two hours of raw data.

A Prometheus server's data directory looks something like this:

./data
├── 01BKGV7JBM69T2G1BGBGM6KB12
│   └── meta.json
├── 01BKGTZQ1SYQJTR4PB43C8PD98
│   ├── chunks
│   │   └── 000001
│   ├── tombstones
│   ├── index
│   └── meta.json
├── 01BKGTZQ1HHWHV8FBJXW1Y3W0K
│   └── meta.json
├── 01BKGV7JC0RY8A6MACW02A2PJD
│   ├── chunks
│   │   └── 000001
│   ├── tombstones
│   ├── index
│   └── meta.json
├── chunks_head
│   └── 000001
└── wal
    ├── 000000002
    └── checkpoint.00000001
        └── 00000000

官方对存储的定义已经比较详细了，总结下来就是把历史数据每两小时compact 成一个block，新的数据放在内存里，同时为了防止新数据丢失，使用wal来保存服务crash的时候数据的完整性。

How to read？

上面讲了如何去存储，那么作为一个时序数据，我们如何去获取一个time range的数据呢，实际上在接到query请求后，prometheus会在所有的block中（包括chunk_head），将符合的数据读取出来做一个merge返回给用户，如图所示：

Write-Ahead Log （WAL）

widely used in relational databases to provide durability（D from ACID）

Persisting every state change as a command to the append only log

wal 文件夹里面存放的数据是当前正在写入的数据，里面包含多个数据段文件，一个文件默认最大 128M，Prometheus 会至少保留3个文件，对于高负载的机器会至少保留2小时的数据。wal 文件夹里面的数据是没有压缩过的，所以会比 block 里面的数据略大一些。

Chunks_head与Checkpoint的作用

prometheus的metric 是有series和samples组成的，我们先来看一下series和sample的写入时机：

series只会写入一次，也就是第一记录的时候

samples会一直写入

举个例子：有一个target是记录http_request_total{ip=”1.2.3.4”},那么当prometheus第一次scrape的时候，就会记录一个series http_request_total{ip=”1.2.3.4”} ，然后根据收集间隔收集的所有samples。

上面我贴的官方介绍里说到：

The current block for incoming samples is kept in memory and is not fully persisted

v2.19之前，最近 2 小时的指标数据存储在内存中，v2.19 引入 head block，最近的指标数据存储在内存中，当内存存满时将数据刷入到磁盘中，并通过一个引用关联刷到磁盘的数据。

这里用到了mmap，mmap的好处是由内核来平衡内存和持久化之间的关系，开发程序的人不需要太过于关心指标过多的时候内存被打爆。

当head中的chunk的sample达到120个的时候会通过mmap写到磁盘中也就是上文中的chunks_head目录下的文件。

当sample持续的写入的时候，也就是chunk_heads 中的chunk到达一定数量之后，会将chunk压缩成block进行持久化：

Check point 的作用

上面讲到的WAL，我们知道WAL主要是用来记录操作日志，主要方便恢复数据使用，因为head里的数据都是存在内存里的，WAL中也有很多按照数字递增的文件，这些文件叫做segment，segment存放的就是对内存中的series和sample的备份。

那么问题来了，这个备份肯定要随着head中的数据持久化而随之变化。也就是如何清理冗余的数据，但是segment中的数据并不一定全都需要删除，可能有一部分数据是有用的，所以就找了个第三方Check point，也可以将它理解为一个筛子，将不需要删除的数据过滤出来存在check point为前缀的目录下

data
└── wal
    ├── checkpoint.000003
    |   ├── 000000
    |   └── 000001
    ├── 000004
    └── 000005

我们来看下checkpoint.000003中的这个数字“3”是如何得来的，假如没有清理之前的数据是下面这样的：

data
└── wal
    ├── 000000
    ├── 000001
    ├── 000002
    ├── 000003
    ├── 000004
    └── 000005

在清理的时候会选择2/3的数据来删除。至于为什么会选前2/3的数据来做check point。目前我没有发现官方有关于这个数字的说明，可能在时间和空间上是一个比较优良的测试结果。

So，0000000-000003这四个文件将要被删除，一些不需要被删除的数据则放在这个目录下重新按照000000开头组织。至于为什么叫做heckpoint.000003，这个其实就是标识上次删除的是0000000-000003，如果遇到需要恢复数据的情况，则直接先从checkpoint中恢复，然后再按照顺序逐个恢复check_point00000X-XX中的数据

曾经遇到的一些小问题

一年前，业务部门有个小需求找到我，目的是想让promethes从Clickhouse中收集一些指标来转化成Prometheus的metrics，从而利用现有的一些系统来做告警和一些指标展示。为什么想到用promethes来做呢，因为这些指标都是经过一个很复杂的sql查出来的，如果按照常规的逻辑，在BI页面直接拉一周的数据会很慢。。

这里不得不说一个做的挺好用的小工具sql-exproter,当时以为事情到这里就结束了，但是存在一个问题，因为某些上游数据延迟原因，sql查出来的这些指标实际上是有offset的。比如我在18：30：00 select出来的数据实际上是18：00：00的业务指标数据。promethes中的时序和实际业务是不符的，其实promethes也想到了这个问题，所以提供了写入时自定义timestamp的功能。

但是，这个美好的设想只建立在offset在一定时间之内（默认2h），因为在chunk的整个生命周期中，只有在head中是可写的。。所以当时闹了一个小乌龙，offset过大时，数据总是看不到。

那些年因为“菜”犯下的错。 Ps ：现在也菜

References

https://www.luozhiyun.com/archives/725

https://www.youtube.com/watch?v=qB40kqhTyYM

TSDB Head Improvements (part1- part 2)

https://prometheus.io/docs/concepts/data_model/

https://github.com/prometheus/prometheus/blob/main/docs/storage.md