AWS CloudWatch是用于实时监控AWS资源以及运行在AWS上的应用的一个服务。CloudWatch支持通过AWS SNS服务发送告警消息,您只需要在AWS SNS中配置日志服务开放告警接口的URL,即可将CloudWatch告警消息发送给日志服务,由日志服务告警系统完成告警降噪、通知等处理。

前提条件

已创建协议CloudWatch的开放告警应用。具体操作,请参见配置开放告警对外接口

CloudWatch配置

  1. 登录AWS管理控制台。
  2. 创建SNS主题。
    您需在Amazon SNS控制台上配置如下必填参数。具体操作,请参见To create an SNS topic
    参数 说明
    Type 主题的类型,选择Standard
    Name 主题的名称。
  3. 订阅SNS主题。
    您需在Amazon SNS控制台上配置如下必填参数。具体操作,请参见To subscribe to an SNS topic
    参数 说明
    Topic ARN 您在步骤2中所创建的主题的ARN。
    Protocol 协议,选择HTTP
    Endpoint 配置为您在日志服务中创建开放告警服务和应用后生成的接口信息(完整URL)。如何获取,请参见获取接口信息
    Enable raw message delivery 选中Enable raw message delivery复选框。
    配置完成后,订阅处于Pending confirmation状态。此时AWS SNS将给日志服务发送一条订阅确认消息,日志服务收到该消息后会自动访问消息中的订阅确认链接。访问成功后,订阅处于Confirmed状态,表示订阅成功。
    说明 如果未订阅成功,您可以选中目标订阅后,单击Request confirmation,重新发送一条订阅确认消息。如果仍未成功,您可以在日志服务的告警排障中心查看错误日志。
    订阅SNS主题
  4. 选择您要接入日志服务的告警并添加通知方式。
    您需在CloudWatch控制台上的目标告警编辑页面添加两个通知方式,相关说明如下。具体操作,请参见To edit an alarm
    • Alarm state trigger:选择触发告警的状态。
      • 其中一个通知方式的Alarm state trigger配置为In alarmInsufficient data,表示告警处于对应的状态时,系统发送告警通知。
      • 另一个通知方式的Alarm state trigger配置为OK,表示告警恢复时,系统发送一条恢复通知。
    • Select an SNS topic:选择Select an existing SNS topic
    • Send a notification to…:选择您在步骤2中创建的主题。
    告警

CloudWatch告警消息

CloudWatch告警分为静态阈值告警和异常检测告警。静态阈值告警消息和异常检测告警消息的Trigger字段的值不同。更多信息,请参见CloudWatch::Alarm属性说明
  • 静态阈值告警消息中的Trigger字段值包含MetricNameDimensions等字段。
  • 异常检测告警消息值的Trigger字段值包含Metrics等字段,其中Metrics字段值是一个指标数据查询列表。
  • 静态阈值告警消息
    {
        "AlarmName": "test-alert",
        "AlarmDescription": "this is a test alert",
        "AWSAccountId": "123456",
        "NewStateValue": "ALARM",
        "NewStateReason": "Threshold Crossed: 1 out of the last 1 datapoints [1.0 (04/08/21 03:06:00)] was greater than or equal to the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition).",
        "StateChangeTime": "2021-08-04T03:10:10.215+0000",
        "Region": "US East (Ohio)",
        "AlarmArn": "arn:aws:cloudwatch:us-east-2:123456:alarm:test-alert",
        "OldStateValue": "OK",
        "Trigger":
        {
            "MetricName": "NumberOfMessagesPublished",
            "Namespace": "AWS/SNS",
            "StatisticType": "Statistic",
            "Statistic": "SUM",
            "Unit": null,
            "Dimensions":
            [
                {
                    "value": "my-topic",
                    "name": "TopicName"
                }
            ],
            "Period": 60,
            "EvaluationPeriods": 1,
            "ComparisonOperator": "GreaterThanOrEqualToThreshold",
            "Threshold": 1.0,
            "TreatMissingData": "- TreatMissingData:                    missing",
            "EvaluateLowSampleCountPercentile": ""
        }
    }
  • 异常检测的告警消息
    {
        "AlarmName": "cpu alrm",
        "AlarmDescription": "this is a cpu alarm",
        "AWSAccountId": "123456",
        "NewStateValue": "INSUFFICIENT_DATA",
        "NewStateReason": "Threshold Crossed: no datapoints were received for 2 periods and 2 missing datapoints were treated as [Breaching].",
        "StateChangeTime": "2021-08-05T08:38:47.104+0000",
        "Region": "US East (Ohio)",
        "AlarmArn": "arn:aws:cloudwatch:us-east-2:123456:alarm:cpu alrm",
        "OldStateValue": "OK",
        "Trigger":
        {
            "Period": 60,
            "EvaluationPeriods": 2,
            "ComparisonOperator": "GreaterThanUpperThreshold",
            "ThresholdMetricId": "ad1",
            "TreatMissingData": "- TreatMissingData:                    breaching",
            "EvaluateLowSampleCountPercentile": "",
            "Metrics":
            [
                {
                    "Id": "m1",
                    "MetricStat":
                    {
                        "Metric":
                        {
                            "Dimensions":
                            [
                                {
                                    "value": "i-1a2b3c4d",
                                    "name": "InstanceId"
                                }
                            ],
                            "MetricName": "CPUUtilization",
                            "Namespace": "AWS/EC2"
                        },
                        "Period": 60,
                        "Stat": "Average"
                    },
                    "ReturnData": true
                },
                {
                    "Expression": "ANOMALY_DETECTION_BAND(m1, 0.1)",
                    "Id": "ad1",
                    "Label": "CPUUtilization (预期)",
                    "ReturnData": true
                }
            ]
        }
    }

告警消息映射

CloudWatch告警被接入到日志服务后,映射为日志服务告警内容。示例如下:

  • 静态阈值告警消息
    {
        "aliuid": "aliuid1",
        "alert_instance_id": "{自动生成}",
        "alert_id": "CloudWatch_test-alert",
        "alert_type": "sls_pub",
        "alert_name": "test-alert",
        "region": "{告警中心Project所在地域}",
        "project": "{告警中心所属的Project}",
        "project_id": 0,
        "next_eval_interval": 60,
        "alert_time": 1628046610,
        "fire_time": 1628046610,
        "fire_results": null,
        "fire_results_count": 0,
        "resolve_time": 0,
        "status": "firing",
        "results": null,
        "labels":
        {
            "TopicName": "my-topic",
            "__comparison_operator__": "GreaterThanOrEqualToThreshold",
            "__statistic__": "SUM",
            "__statistic_type__": "Statistic",
            "__threshold__": "1",
            "metric_name": "NumberOfMessagesPublished"
        },
        "annotations":
        {
            "__alarm_arn__": "arn:aws:cloudwatch:us-east-2:123456:alarm:test-alert",
            "__aws_accountId__": "123456",
            "__aws_region__": "US East (Ohio)",
            "__cloud_watch_alert_type__": "StaticThreshold",
            "__config_app__": "sls_pub_alert",
            "__pub_alert_app__": "{开放告警应用ID}",
            "__pub_alert_protocol__": "cloud_watch",
            "__pub_alert_region__": "{接收告警消息的网络接口对应的地域}",
            "__pub_alert_service__": "{开放告警服务ID}",
            "desc": "this is a test alert",
            "title": "Threshold Crossed: 1 out of the last 1 datapoints [1.0 (04/08/21 03:06:00)] was greater than or equal to the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition)."
        },
        "severity": 10,
        "policy":
        {
            "alert_policy_id": "{开放告警应用中配置的告警策略ID}",
            "action_policy_id": "{开放告警应用中配置的行动策略ID}",
            "use_default": false,
            "repeat_interval": "{开放告警应用中配置的重复等待时间}"
        },
        "template": null,
        "drill_down_query": "https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#alarmsV2:alarm/test-alert"
    }
  • 异常检测告警消息
    {
        "aliuid": "aliuid1",
        "alert_instance_id": "{自动生成}",
        "alert_id": "CloudWatch_cpu alrm",
        "alert_type": "sls_pub",
        "alert_name": "cpu alrm",
        "region": "{告警中心Project所在地域}",
        "project": "{告警中心所属的Project}",
        "project_id": 0,
        "next_eval_interval": 120,
        "alert_time": 1628152727,
        "fire_time": 1628152727,
        "fire_results": null,
        "fire_results_count": 0,
        "resolve_time": 0,
        "status": "firing",
        "results": null,
        "labels":
        {
            "__comparison_operator__": "GreaterThanUpperThreshold",
            "__threshold_metricId__": "ad1"
        },
        "annotations":
        {
            "__alarm_arn__": "arn:aws:cloudwatch:us-east-2:123456:alarm:cpu alrm",
            "__aws_accountId__": "123456",
            "__aws_region__": "US East (Ohio)",
            "__cloud_watch_alert_type__": "AnomalyDetection",
            "__config_app__": "sls_pub_alert",
            "__pub_alert_app__": "{开放告警应用ID}",
            "__pub_alert_protocol__": "cloud_watch",
            "__pub_alert_region__": "{接收告警消息的网络接口对应的地域}",
            "__pub_alert_service__": "{开放告警服务ID}",
            "desc": "this is a cpu alarm",
            "title": "Threshold Crossed: no datapoints were received for 2 periods and 2 missing datapoints were treated as [Breaching]."
        },
        "severity": 8,
        "policy":
        {
            "alert_policy_id": "{开放告警应用中配置的告警策略ID}",
            "action_policy_id": "{开放告警应用中配置的行动策略ID}",
            "use_default": false,
            "repeat_interval": "{开放告警应用中配置的重复等待时间}"
        },
        "template": null,
        "drill_down_query": "https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#alarmsV2:alarm/cpu%20alrm"
    }

日志服务告警消息内容与CloudWatch告警消息内容的映射关系如下:

日志服务字段 CloudWatch字段 说明
aliuid 用于接入告警的开放告警应用所属的阿里云账号ID。
alert_id 告警监控规则的ID。

alert_id字段值为CloudWatch_{$alert_name},其中{$alert_name}为告警监控规则的名称。

alert_type 告警类型,固定为sls_pub。
alert_name AlarmName 告警监控规则的名称。
status NewStateValue 告警状态。
  • 如果CloudWatch告警消息中NewStateValue字段的值为ALARM或INSUFFICIENT_DATA,则status字段的值为firing。
  • 如果CloudWatch告警消息中NewStateValue字段的值为OK,则status字段的值为resolved。
next_eval_interval
  • Period
  • EvaluationPeriods
告警评估间隔时间,为CloudWatch告警消息中的Period字段值和EvaluationPeriods字段值的乘积。
alert_time StateChangeTime 告警触发时间。
fire_time StateChangeTime 告警首次触发时间。
resolve_time StateChangeTime 告警恢复时间。
  • 如果status字段的值为firing,则resolve_time的值为0。
  • 如果status字段的值为resolved,则resolve_time的值为CloudWatch告警消息中StateChangeTime字段的值。
labels 标签信息。
  • 静态阈值告警消息
    • 将如下字段和字段值添加到labels字段中,且将字段名重命名,详细说明如下:
      • ComparisonOperator重命名为__comparison_operator__
      • MetricName重命名为__metric_name__
      • StatisticType重命名为__statistic_type__
      • Statistic重命名为__statistic__
      • Threshold重命名为__threshold__
    • Dimensions字段中每个name字段的值作为字段,每个value字段的值作为字段值,添加到labels字段中。
  • 异常检测告警消息
    将如下字段和字段值添加到labels字段中,且将字段名重命名,详细说明如下:
    • ComparisonOperator重命名为__comparison_operator__
    • ThresholdMetricId重命名为__threshold_metricId__
annotations 标注信息,日志服务的annotations字段中将加入以下字段:
  • desc:告警内容描述,对应CloudWatch告警消息中的NewStateReason字段的值。
  • title:告警消息的标题,对应CloudWatch告警消息中的AlarmDescription字段的值。
  • __cloud_watch_alert_type__:CloudWatch的告警类型。
    • 如果是静态阈值告警,字段值为StaticThreshold。
    • 如果是异常检测告警,字段值为AnomalyDetection。
  • trigger字段外所有未被使用的字段都会被加入到annotations字段中。

    字段将被重命名,命名方式为在字段名前后加上两个下划线(__),小写形式。由多个单词构成的字段名,按照单词拆分,各个单词之间加上下划线(_)。例如AlarmArn字段重命名为__alarm_arn__

severity NewStateValue 告警严重度。
  • 如果CloudWatch告警消息中NewStateValue字段的值为ALARM,则severity字段的值为10,即严重。
  • 如果CloudWatch告警消息中NewStateValue字段的值为INSUFFICIENT_DATA,则severity字段的值为8,即高。
  • 如果CloudWatch告警消息中的NewStateValue字段的值为OK,则severity字段的值将由CloudWatch告警消息中OldStateValue字段的值决定。
policy 您在开放告警应用中配置的告警策略。更多信息,请参见Policy结构
project 告警中心所属的Project。更多信息,请参见项目(Project)
drill_down_query 对应CloudWatch告警的URL地址。