共计 4452 个字符,预计需要花费 12 分钟才能阅读完成。
skywalking中的alarm webhooks
skywalking虽然支持多种webhook,但博主一直没找到skywalking的告警如何将alarm信息根据不同webhook发送出去,目前博主的版本是v9.1,最新已是v9.4。从文档来看它提供了webhook传递的消息格式(还有grpchook,不涉及略),所以自己折腾下
- method:POST
- content/type:application/json
- body:
[{
"scopeId": 1,
"scope": "SERVICE",
"name": "serviceA",
"id0": "12",
"id1": "",
"ruleName": "service_resp_time_rule",
"alarmMessage": "alarmMessage xxxx",
"startTime": 1560524171000,
"tags": [{
"key": "level",
"value": "WARNING"
}]
}, {
"scopeId": 1,
"scope": "SERVICE",
"name": "serviceB",
"id0": "23",
"id1": "",
"ruleName": "service_resp_time_rule",
"alarmMessage": "alarmMessage yyy",
"startTime": 1560524171000,
"tags": [{
"key": "level",
"value": "CRITICAL"
}]
}]
几个关键的字段:
- scope:服务范围
- name:服务名称
- rulename:规则名称
- alarmMessage:告警消息
- startTime:告警时间戳
- tags:标签
- 其他几个暂不清楚该怎么使用。。。
根据上面的信息,想用flask建个中转服务,接受skywalk发过来的告警信息,然后再发去别的服务,期间自己做过滤策略
flask创建alert接口,打印下接受到数据
@app.route('/alert', methods=['POST'])
def alert():
data = request.get_json()
print(data)
return jsonify({'results': 'success'})
接入钉钉企业内部机器人
构建普通文本消息模板
for i in data:
msg = {
"scope": i['scope'],
"name": i['name'],
"rule_name": i['ruleName'],
"alarm_message": i['alarmMessage'],
"start_time": time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(int(i['startTime']) / 1000)),
"tags": str(i['tags'])
}
send_msg_tpl = {
"msgtype": "text",
"text": {
"content": "服务名称:{scope} \n影响范围:{name} \n触发规则:{rule_name} \n开始时间:{start_time} \n告警内容:{alarm_message} \n告警标签:{tags}".format(
**msg)
},
"at": {
"atUserIds": [
"manager5345"
],
"isAtAll": False
}
}
获取企业内部机器人webhook
这里不是普通的群机器人,是企业内部机器类型,可以和机器人交互的
钉钉群设置》机器人》找到对应的企业内部机器人查看即可
headers = {'Content-Type': 'application/json'}
webhook = 'https://oapi.dingtalk.com/robot/send?access_token=dddddd'
向钉钉发送消息
requests.post(webhook, data=json.dumps(send_msg_tpl), headers=headers)
time.sleep(0.2)
postman测试下效果
目前看起来功能基本完成,没有太多处理策略,直接转发,部署到服务器上并接入到skywalking中看看
调整skywalking webhook
修改alarm-settings.yml
# 之前钉钉hook
#dingtalkHooks:
# textTemplate: |-
# {
# "msgtype": "text",
# "text": {
# "content": "Apache SkyWalking Alarm: \n %s."
# }
# }
# webhooks:
# - url: https://oapi.dingtalk.com/robot/send?access_token=xxxxdfdf
# secret: xxxxx
# 改成如下
webhooks:
- http://alert-center.xadocker.cn/alert
最终实际效果
webhook分发策略
根据name
去正则匹配,然后返回webhook
def get_webhook(name):
policy = {
"default": "https://oapi.dingtalk.com/robot/send?access_token=default",
"bff-center": "https://oapi.dingtalk.com/robot/send?access_token=aaaa",
"products-center": "https://oapi.dingtalk.com/robot/send?access_token=bbbb",
"inventory-center": "https://oapi.dingtalk.com/robot/send?access_token=cccc",
"member-center": "https://oapi.dingtalk.com/robot/send?access_token=dddd",
}
for i in policy.keys():
pattern = re.compile(i)
if pattern.search(name):
return policy[i]
return policy['default']
到这里基本就可以根据不同服务名分发到不同webhook,未匹配则发送到default hook,当然还可以再扩展下:
- 策略里增加userid,将消息@某些level人员
- 根据tags中level划分,严重等级@所有人或高level人员
- 接入其它webhooks:微信,飞书,slack等,当然对应的消息发送格式也要随着变动
- 将策略配置提取到单独的配置文件中,或是nacos配置中心管理方便动态修改
- 将信息入库?@回机器人让它查查告警汇总表?
- 告警限流的控制(避免第三方平台限制调用频率,得想方式控制下),引入mq,然后分组又或者抑制?
- 包装下转发给alermanager。。。。
转发给alertmanager?
接口地址:/api/v2/alerts
看下alertmanager消息格式:
[
{
"labels": {
"alertname": "<requiredAlertName>",
"<labelname>": "<labelvalue>",
...
},
"annotations": {
"<labelname>": "<labelvalue>",
},
"startsAt": "<rfc3339>",
"endsAt": "<rfc3339>",
"generatorURL": "<generator_url>"
},
...
]
- startsAt:告警起始时间
- endsAt:告警结束时间
若两者未指定,则startsAt为接收到告警的当前时间,endsAt则为:startsAt + reslove_timeout(默认 5m)。对于一个持续触发的告警,alertmanager每收到一次相同的告警(没有设置startsAt和endsAt),就会将该告警的endsAt设置为startsAt + reslove_timeout
目前博主的alertmanager的reslove_timeout
为10m
[root@k8s-master manifests]# cat alertmanager-secret.yaml
apiVersion: v1
kind: Secret
metadata:
labels:
app.kubernetes.io/component: alert-router
app.kubernetes.io/instance: main
app.kubernetes.io/name: alertmanager
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 0.24.0
name: alertmanager-main
namespace: monitoring
stringData:
alertmanager.yaml: |-
"global":
"resolve_timeout": "10m"
# 略
尝试构造下告警消息体
[
{
"labels": {
"skywalking_scope": "service",
"skywalking_name": "inventory-center",
"skywalking_rulename": "service_resp_time_rule",
"alert_from": "skywalking",
"severity": "warning"
},
"annotations": {
"summary": "This is a inventory center alert",
"description": "Response time of service inventory-center is more than 100ms in last 2 minutes."
}
}
]
postman模拟测试下
看来还不错,那现在就是要在skywalking的消息体转化为alertmanager所识别的就行了,这样博主就不用写抑制和分组的逻辑了,略~~~~~
其实想自己写也行,很多功能扩展可以参考alermanager,之所以写一个,因为博主没有找到skywalking这方面得配置,而且近期有人咨询咋搞,若是读者知道还望告知下如何配😭
参考地址:
- https://skywalking.apache.org/docs/main/v9.1.0/en/setup/backend/backend-alarm/
- https://prometheus.io/docs/alerting/latest/clients/
- https://prometheus.io/docs/alerting/latest/clients/
- https://github.com/prometheus/alertmanager/blob/main/api/v2/openapi.yaml