SignalFx Developers Guide

Detectors, Events, and Alerts

Detectors watch incoming data for anomalous conditions specified by SignalFlow calculations and other settings. In response to an anomalous condition, detectors record an event, trigger an alert, and optionally send off notifications using third-party services. Detectors can also record events, alerts, and notifications when the anomalous condition clears (is no longer present).

You can also use detectors to monitor Microservices APM (µAPM) metrics. SignalFx provides a library of advanced SignalFlow functions that help you create µAPM detectors. To learn more about these detectors and the SignalFlow library, see the section Microservices APM detectors.

Detectors

Specifically, detectors define the following:

  • A trigger condition, specified in a SignalFlow program

  • An severity to set when the trigger condition occurs

  • Where and how notifications are sent

  • The content included in notifications

Detector actions

When SignalFx detects that a trigger condition exists, it does the following:

  • Generates an event

  • Sets off an alert

  • Sends one or more notifications to people to inform them of the alert

When SignalFx detects that the condition no longer exists, it does the following:

  • Generates a second event

  • Clears the alert

  • Sends a second set of notifications

Considerations for detector properties

Tags

You can have no more than 50 tags per detector.

Microservices APM detectors

µAPM detectors monitor incoming µAPM metrics for anomalies. The built-in SignalFlow library provides functions that help you construct latency and error rate detectors for your services and endpoints. You can find reference documentation for this library in the APM module documentation hosted on GitHub.

µAPM metrics are only available through the SignalFx µAPM product. To learn more about µAPM tracing, see the topic Working with APM Tracing.

µAPM detectors use features common to all detectors as well as features designed specifically for µAPM.

Features common with other detectors:

  • In response to an anomaly, they trigger alerts, record alert events, and send out alert notifications.

  • When the anomaly clears, they turn off alerts, record clear events, and send out clear notifications.

  • µAPM detectors use SignalFlow programs.

Features specific to APM detectors

µAPM detectors use tailored alerting strategies implemented by built-in functions in the SignalFlow library apm module. These functions use a pre-determined set of span metrics and properties to identify anomalies. This results in a specific SignalFlow design pattern:

  • You don’t need to use stream constructors such as data() to define the data you want to monitor. The built-in µAPM functions are already set up to do this.

  • In most cases, you don’t need to call detect():

    • Most of the built-in functions return detect blocks, so all you need to do is call publish() on the function result.

    • Some functions return when() blocks. Pass them to detect(), then call publish() on the detect() result.

    • Some functions return a stream. You can call when() on the results of these functions, call detect() on the when() results, then call publish() on the detect() result.
      You can also create a chart with these functions, by calling publish() on the stream in a SignalFlow chart program.

  • You don’t have to write complex statistical calculations, because error rate and latency calculations are part of the built-in functions.

The apm built-in functions are also the basis of the built-in µAPM alert conditions offered by the web UI. To learn more, see the topic Using Built-in µAPM Alert Conditions in the product documentation.

The submodules of apm are organized by the type of anomalies they detect.

Microservices APM errors submodule

This section summarizes the errors submodule. The full reference documentation for each function is available in the APM errors library on GitHub.

The errors submodule provides functions for error rates:

  • streams.error_rate() returns an error rate stream. To use the function in a detector, pass the stream to a when() function, pass the result to detect(), then call publish() on the resulting detect block.

  • conditions.error_rate_static() monitors spans for an incoming error rate that exceeds a static threshold. It returns a dictionary that includes following items:

    • on: when() block that returns True when the error rate exceeds the trigger threshold

    • off: when() block that returns True when the error rate falls below the clear threshold.

      In a µAPM detector SignalFlow program, pass these items to a detect() call to create a detect block, then call publish() on the block.

  • errors.detector() returns a detect block that triggers when both of the following are true:

    • The static error rate exceeds the static trigger threshold

    • The error rate percent growth between the current window and the previous window growth exceeds the percent growth rate trigger threshold.

      The block clears when both the static and percent growth error rates fall below their clear thresholds. To use the detect block in a µAPM detector SignalFlow program, call publish(<label>)

  • errors.static.detector() calculates the error rate by calling `streams.error_rate(). It returns a detect block that

    • Triggers when the rate exceeds the specified trigger threshold

    • Clears when the rate falls below the specified clear threshold

  • sudden_change.detector() calculates the error rate in the current window and compares it to the rate in a baseline window. The function returns a detect block that:

    • Triggers when the percent growth error rate exceeds the specified trigger threshold

    • Clears when the rate falls below the specified clear threshold

Microservices APM latency submodule

This section summarizes the latency submodule. The full reference documentation for each function is available in the APM latency library on GitHub.

The latency submodule provides functions for latency.

  • The static.detector() function returns a detect block that’s triggered when latency exceeds a static limit.

  • The sudden_change.growth_rate() function returns a detect block that’s triggered when latency is growing too quickly compared to a recent baseline.

  • The sudden_change.deviations_from_norm() function returns a detect block that’s triggered when latency in the current window is too many deviations from the baseline established in the preceding window.

  • The historical_anomaly.growth_rate function returns a detect block that’s triggered when latency is growing too quickly compared to a historical baseline.

  • The historical_anomaly.deviations_from_norm() function returns a detect block that’s triggered when latency is too many deviations from a historical norm.

Microservices APM SignalFlow library

The SignalFlow library, including the apm module, is built into SignalFx. To use it in a SignalFlow program, import the functions you want to use using Python from <module> import <object> syntax.

The next section contains examples of µAPM SignalFlow programs as well as a curl command that creates a µAPM detector.

Microservices APM detector examples

  1. Microservices APM - Detect error rate defines a detector that’s triggered by error rate in incoming spans.

  2. Microservices APM - Detect error rate growth defines a detector that’s triggered by the growth in error rate in the current window of spans, compared to a previous baseline window.

  3. Microservices APM - Detect static latency defines a detector that’s triggered by latency that exceeds a static threshold.

Microservices APM - Detect error rate

This SignalFlow program uses the streams and conditions objects in the errors submodule of apm to create a detect block that’s triggered when the error rate exceeds a specified limit. The program does the following:

  • Creates a filter that selects spans that have the properties "service: my_svc" and "operation: do_thing".

  • Inside conditions.error_rate_static(), calls streams.error_rate() to calculate the error rate from endpoint spans, based on arguments passed to the main function. Endpoint spans have kind='CONSUMER' or 'SERVER'.

  • Sets the trigger condition to be error rate > 10% of all spans, and the clear rate to be < 5% of all spans.

  • Creates the detect block and publishes it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from signalfx.detectors.apm.errors import streams
from signalfx.detectors.apm.errors import conditions

f = filter('service', 'my_svc') and filter('operation', 'do_thing')

#Calculates the error rate over a 15 minute window.
#Returns a dictionary containing when() functions
c = conditions.error_rate_static(current_window=duration('15m'), filter_=f, fire_rate_threshold=0.1, clear_rate_threshold=0.05)

#Fires when error rate over 15 minutes is > 10%, clears when error rate is < 5%
detect(c['on'], off=c['off']).publish('det')

Microservices APM - Detect error rate growth

This SignalFlow program uses the sudden_change object in the sudden_change submodule of apm to create a detect block that’s triggered when the growth in error rate exceeds a specified limit. The program does the following:

  • Creates a filter that selects spans that have the properties "service: my_svc" and "operation: my_op".

  • Calls sudden_change.detector(), passing in the filter. This function returns a detect block that triggers on error rate growth using default parameters except for a span filter.

1
2
3
4
5
6
7
from signalfx.detectors.apm.errors.sudden_change import sudden_change

filter_ = filter('service', 'my_svc') and filter('operation', 'my_op')

#Fires when the growth in error rate exceeds 50%, for
#a window of 5 minutes compared to a baseline window of the previous hour
sudden_change.detector(filter_).publish('my_det')

Microservices APM - Detect static latency

This SignalFlow program uses the static submodule in the latency submodule of apm to create a detect block that’s triggered when the 90th percentile of latency (the default) in incoming spans exceeds a specified limit. It does the following:

  • Detects latencies greater than 100 ms

  • To trigger, "latency > 100 milliseconds" must be true for more than 80% of 3 minutes.

  • To clear, "latency < 80 milliseconds" must be true for more than 90% of 2 minutes.

1
2
3
from signalfx.detectors.apm.latency.static import static

static.detector(100, lasting('3m', 0.8), 80, lasting('2m', 0.9)).publish('my_static_detector')

Detector trigger conditions

Every detect() function must have a trigger or "on" condition. SignalFlow has three types:

Immediate conditions

Conditions that only care about the value of the input streams each time the condition is evaluated. As soon as condition x is true, SignalFlow triggers an alert.

Duration conditions

Monitor the value of the input streams for a specified duration prior to the point that the condition is evaluated. To define a duration condition, use when(<predicate>,<lasting>) function that triggers an alert if <predicate> is continuously true during duration <lasting>.

Percentage of duration conditions

Monitor the value of the input streams for a specified percentage of duration prior to the point that the condition is evaluated. To define a percentage of duration condition, use a when(<predicate>, <lasting>, <percent>) function that triggers an alert if <predicate> is true for <lasting> duration at least <percent> percent of the duration.

For more information about how to define conditions to trigger an alert, see the detect() function and the when() function documentation.

Clearing detectors

In general, an alert condition clears as soon as the trigger condition in the detect() is no longer true; that is, as soon as the not condition is met. Each type of trigger condition has its own not condition criteria:

Evaluate immediately

Not condition is met as soon as x is false.

Duration trigger

Not condition is met as soon as <predicate> is false. Notice that <predicate> does not have to be false for the entire value of <lasting>.

Percentage trigger

Not condition isn’t met until <predicate> is false for longer than 100 - <percent> amount of <lasting>. For example, if <predicate> is cpu.utilization >= .50, <lasting> is 50 seconds, and percent is .50` (50%), the alert condition doesn’t clear until CPU utilization is less than 50 percent for 25 seconds.

If you don’t want any of these clear behaviors, you can set the clear condition as the second argument to detect(). For example, an alert condition that triggers when a > b continuously for the most recent 10 minutes clears as soon as a ⇐ b. If you prefer, you can require that a ⇐ b for more than 30 seconds before clearing the alert.

For more information about clear conditions and how they interact with trigger conditions, see the detect() function documentation.

To specify a detector using the API, you have to specify a SignalFlow program as one of the request properties. In general, SignalFlow program is simply one or more SignalFlow statements, but a SignalFlow detector program has specific requirements. The sections in this describe the SignalFlow program requirements for detectors.

Detector SignalFlow programs

The SignalFlow program for a detector must have:

  • One or more calls to detect().

  • A trigger condition for each detect() call.

  • A publish() call that publishes the results of the detector, with a label that’s unique within the program.

In addition, the program can optionally have:

  • Clear conditions

  • A mode designation that determines when and how the conditions are evaluated.

Detector program requirements

The SignalFlow program must have at least one stream constructor that provides the base data. This is usually a call to the data() function.

You can use the resulting stream in the following ways:

  • Without change

  • Modified by one or more chained stream methods

  • Transformed into new streams using operators that perform a calculation on the input stream, such as returning the square root or natural log of input values, the mean of all current values, or dropping values that aren’t within a certain range.

After you modify and transform streams, you can construct boolean expressions from streams by comparing streams or by comparing streams to a constant value. You can further modify stream using when() functions if you you want to apply duration or percentage of duration conditions.

The result of these operations is a simple condition that you can use as it stands or combine with other simple conditions using the boolean logic operators and, or, and and not to construct a compound condition.

For example, this simple condition triggers an alert if the amount of deviation in CPU usage is high, using a base data stream modified by the stddev() function and compared to 2:

data('cpu.utilization').stddev() > 2

The following compound condition considers both the mean and the standard deviation. The base data stream is cpu.utilization, modified by stddev() function and mean() function. The resulting data streams are compared to constants, resulting in four boolean expressions that are combined into two. The overall expression is true if either of the two child expressions are true. This example also uses the when() function to create a duration requirement:

(data('cpu.utilization').stddev() > 3 and data('cpu.utilization').mean() > 50) or when(data('cpu.utilization').stddev() > 3 and data('cpu.utilization').mean() > 30,'5m')

SignalFlow calendar window transformations for detectors

SignalFlow programs for detectors let you use calendar window transformations, which perform a computation over calendar intervals or windows, such as days, weeks, and months.

SignalFlow provides calendar window transformations for the following methods:

To learn more about SignalFlow calendar window transformations, see Calendar window transformations.

Timezones and SignalFlow calendar window transformations

By default, SignalFlow interprets calendar window transformations relative to the Coordinated Universal Time (UTC) time zone. To change this, use the timezone property when you create or update a detector. The property value is a string that denotes the geographic region associated with the time zone. For example, the following JSON is part of a detector request body:

1
2
3
4
{
  "name": "European Detector",
  "timezone": "Europe/London"
}

The section Supported SignalFlow time zones lists the supported time zones.

Requirements for custom notifications

To support custom notification messages that include input data, you have to assign the data to variables in the detector trigger and clear conditions in the SignalFlow program. You can then retrieve the data using the variables.

For example, this SignalFlow expression creates a valid detector:

detect(data('cpu.utilization').mean() > 50).publish('highCpu')

If you want to include the mean CPU utilization value in notifications when the alert is triggered, you have to use this equivalent SignalFlow instead:

highMean = data('cpu.utilization').mean(); detect(highMean > 50).publish('highCpu')

This lets you refer to the value of highMean in your custom notification message definition.

Working with nulls and missing data

Data points in a stream evaluate to null for several reasons, such as performing an operation that results in dividing by 0. In addition, data points in a stream get delayed or dropped. When any of these conditions occur, SignalFx defaults to replacing the data with a null after waiting for the duration specified in the maxDelay property of the detector.

When SignalFlow evaluates a stream to determine if it should trigger or clear an alert, the null values aren’t considered valid, but they’re also not considered anomalous. Instead, for conditions that require specific criteria to be met for a set duration, SignalFlow resets and restarts the clock.

To avoid this behavior, set an explicit value for missing data using one of these common methods:

Last value extrapolation

Set the extrapolation for each call to data() to last_value. This option tells SignalFlow to wait for maxDelay to expire, and then sets missing data points to the value that immediately preceded the gap. As a result, a single missing data point doesn’t interrupt the duration measurement for the condition. By default, an extrapolation value of last_value causes SignalFlow to set up to 100 consecutive missing data points. To set fewer data points, set the maxExtrapolation argument. As soon as SignalFlow receives a valid data point, it resets the maximum number of extrapolations. For example, the following SignalFlow program lets a detector ignore up to three consecutive missing data points before resetting the time for ongoing durations:

detect(when(data('cpu.utilization', extrapolation='last_value',maxExtrapolations=3).mean() > 50),'10m')).publish('highCpu')

fill stream

This method is similar to last value extrapolation, but it lets you set an explicit replacement value for missing data points and uses a time-based rather than a number-based limitation to determine when to stop replacing null values. After maxDelay expires, SignalFlow replaces null values with the value in the fill stream method call for the duration specified in the call. SignalFlow resets the fill duration whenever it receives valid data. Using the fill stream method lets you ensure that the fill value matches or doesn’t match the detector condition. For example, the following SignalFlow statement sets any missing values to 75 for 15 seconds:

detect(when(data('cpu.utilization').fill(value=75,duration='15s').mean() > 50, '10m')).publish('highCpu')

Detector data resolutions

Detectors have two distinct types of resolutions:

Detector job resolution

The interval at which SignalFlow analyzes data to determine if it should trigger an alert. The type of data being analyzed and the transformations applied to the data determine this resolution. It remains constant throughout the life of the detector.

Data display resolution

Rate at which data populates the detector visualization. SignalFx sets this resolution to the coarsest resolution among all the publish() function calls associated with the detector, because all the results are displayed in the same visualization. The data display resolution may change when someone modifies the time window for the display.

Rules control how triggered and cleared alerts are processed. Each detect function is mapped to a severity and a set of notifications using the unique label inside the associated publish method.

Alert severity

Severity indicates the relative impact of an alert. In the web UI, you can filter alerts by severity, which helps you focus on the most important alerts first.

SignalFx provides five severity levels as an enumerated type in the API:

  • Critical

  • Warning

  • Major

  • Minor

  • Info

SignalFx doesn’t assign any special meaning to these values, so you can use them however you want.

Table 1. SignalFx to PagerDuty severity mapping
SignalFx severity PagerDuty service

Critical

Critical

Major

Critical

Minor

Error

Warning

Warning

Info

Info

Notifications

Each rule can include one or more notification definitions indicating where and how to send notifications.

Notification recipients

Detector alerts can send notifications to individual users or teams using email, third-party messaging services, or third-party incident management services.

Detector alerts can go to the following generally-available services:

  • Email address for a single user

  • Email address for one or more SignalFx teams

  • Team page in SignalFx

In addition, detector alerts can go to the following third-party notification services:

  • Amazon EventBridge

  • BigPanda

  • Jira

  • Microsoft Office 365

  • Opsgenie

  • PagerDuty

  • ServiceNow

  • Slack

  • VictorOps

  • Webhook

  • xMatters

To use a third-party notification service, you first have to integrate the service with SignalFx.

Integrating with notification services

A quick way to see if the notification service is already set up is to use the web UI. To learn more, see the topic Integration with Notification Services in the product documentation.

To use the API:

  1. Use the operation GET https://api.{REALM}.signalfx.com/v2/integration, setting the type query parameter to the service you’re looking for. For example, to see if you have integrated the Slack service, use GET /integration?type=Slack.

  2. If the response is 200 OK, then the service is already integrated. Note the id property in the response body; this is a SignalFx-assigned identifier for the integration that you specify when you set up a notification for an alert.

  3. If you receive the response 404 Not Found, you have to integrate the notification service. Because each integration has its own request body for Create Integration (POST https://api.{REALM}.signalfx.com/v2/integration) you need to refer to the Integrations API reference page for more information.

Considerations for using notification services

Jira

Use the Jira integration to automatically record hardware and software issues. When an alert generates a Jira notification, Jira creates a new ticket from the information in the notification. When the alert clears, Jira adds a comment to the ticket.

To learn more about integrating Jira with SignalFx, see the topic Integrate with Jira in the product documentation.

To learn more about Jira integration using the API, see the topic Integrating with Jira.

Custom notification messages

If you prefer, you can specify custom notification messages. They are only available via the API and only for v2 detectors. Also, you can’t display or modify them in the web application.

Custom messages include separate subject and body sections, specified as parameterizedSubject and parameterizedBody properties when you create a rule for the notification. All notification types accept plain text containing valid ASCII characters, as well as any of the variables provided by SignalFx. Some notification types also render Markdown text in their messages. See the sections below for more information.

Variables in custom notification messages

Custom notification variable contain the following types of information: * Information about the detector * Current state of the detector at the time an alert is triggered or cleared * Detector job data including conditions and dimensions * Detector program information

Insert custom notification variables in a message definitions by inserting the variable name surrounded by curly braces { and }:

  • Double curly braces indicate a variable the API substitutes in place as is. Some characters in the value may be interpreted by the API rather than passed through and rendered in the output.

  • Triple curly braces indicate a variable for which the API escapes characters as needed. For example, use triple braces to escape quotation marks and angle brackets.

The following list is a sample of the available variables. The complete list is available in the :

  • detectorName - the name of the detector (as specified in the name property)

  • detectorId - the ID of the detector (as specified in the id property); permits notification recipient to use API calls to obtain further information about the detector

  • ruleSeverity - the severity defined for the rule (as specified in the rules[x].severity property corresponding to the rule controlling the notification)

  • runbookUrl - a link to more information about how to process the notification (as specified in the rules[x].runbookUrl property corresponding to the rule controlling the notification)

  • tip - a quick first step to try upon receipt of the notification (as specified in the rules[x].tip property corresponding to the rule controlling the notification)

  • anomalous - true if the alert is currently triggered

  • normal - true if the alert is currently cleared

  • #if, else, /if - let you create conditional text; most useful for providing alternate text in alert triggered and alert cleared messages

  • inputs.variable.value - the value of the detector condition indicated by the specified variable corresponding to a defined data stream. See the SignalFlow Syntax section above for more information.

A more complete list of supported variables is available in the table Detector and Rule Details in the production documentation.

Markdown support in custom notification messages

GitHub-flavored Markdown is also supported in the body of custom notification messages, but not in the message subject or message tip. However, it may or may not be rendered depending on the intended output format of the messages. Markdown isn’t supported in notifications sent to any third-party incident management system or in notifications sent to third-party messaging systems other than Slack.

Email notifications support all Markdown formatting except tables. This includes:

  • Character styles.

  • Headers on a single line.

  • Links that use the syntax [description](url).

  • Images that use the syntax ![alt text](image-url).

  • Ordered and unordered lists.

  • Horizontal rules. Three or more consecutive underscores, hyphens, or asterisks on their own line render as a horizontal rule.

Slack notifications have limited markdown support:

  • Character styles are supported

  • Header notation is stripped out

  • The link constructs for URLs and images are replaced by the URL itself.

  • Unordered lists are displayed without the bullets

  • Ordered lists are statically numbered.

  • Horizontal rules are removed entirely.

Markup is stripped out entirely from other types of notifications:

  • [text](url) is rendered as text url

  • ![alt](image-url)] is rendered as alt image-url.

  • Headers and character styles are ignored; the notation symbols are removed. notation symbols removed.

  • Lists are rendered as in Slack.

  • Horizontal rules are removed entirely.

Alert muting

To stop sending notifications from alerts generated by a detector, you mute the alert by creating alert muting rules. You can still view and track muted alerts and events in the web UI, and you can retrieve them using the API. Muting stops notifications while you’re making changes or performing tests.

For example, suppose you monitor a server with a detector that’s triggered when cpu utilization falls below a certain level. Normally, you use this detector to notify you that the server may have crashed.

Now you decide to shut down the server to do maintenance. If you don’t set alert muting rules, the detector starts sending out alert notifications. Recipients who didn’t know about the planned shutdown think that a real emergency is occurring.

If you do set alert muting rules, the detector continues to issue alerts and events, but it doesn’t send out notifications. When you’re finished with maintenance you can unmute the detector.

The following table compares disabling the detector with muting its notifications.

Disabling detector versus alert muting
Disabling the detector Muting alerts

Effect on alerts and notifications

Alerts aren’t triggered, so events aren’t created

Events are created but notifications are affected, as described in the next table

Number of API calls

Separate call for each detector

One call for several detectors

Prerequisites

Internal detector ID retrieved using the API

Name of dimensions (visible in web UI)

Duration

Permanent, unless you re-enable the detector.

Can be a specific time span or indefinite.

  • Specific time span: Notifications resume at the end of the span, except as noted in Considerations for alert muting.

  • Indefinite: You need to unmute the notifications in the web UI or using the API.

Considerations for alert muting

  • An alert muting rule may take up to one minute to go into effect.

  • SignalFx may send notifications during an alert muting period, as described in the following table:

Notifications during alert muting
Condition 1 Condition 2 Result

Alert is active before the muting period

Alert is still active at the end of the muting period

No new notification is sent

Alert is active before the muting period

Alert clears during the muting period

Clear notification is sent immediately (during the muting period)

Alert is triggered during the last two weeks of the muting period

Alert is still active at the end of the muting period

Alert notification sent after the end of the muting period

Alert is triggered more than two weeks before the end of the muting period

Alert is still active at the end of the muting period

No notification is sent during or after the muting period

Alert is triggered during the muting period

Alert is cleared during the muting period

No notification is sent during or after the muting period

To learn more about alert muting, refer to the topic Mute notifications in the product documentation.

Examples

Create a detector

You can create and manage detectors using REST API calls. The following example shows you how to create a detector which monitors the jvm.cpu.load metric and notifies person@example.org when it crosses a static threshold of 60.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
$ curl --request POST \
    --header "X-SF-TOKEN: <YOUR_ACCESS_TOKEN>" \
    --header "Content-Type: application/json" \
    --data \
    '{
        "name": "CPU load too high",
        "programText": "detect(data(\"jvm.cpu.load\") > 60).publish(\"Load above 60%\")",
        "rules": [
            {
                "severity": "Critical",
                "detectLabel": "Load above 60%",
                "notifications": [
                    {
                        "type": "Email",
                        "email": "person@example.org"
                    }
                ]
        ]
    }' \
    https://api.<REALM>.signalfx.com/v2/detector

The response body is similar to the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
{
    "created": <CREATED_TIMESTAMP>,
    "creator": "<CREATOR_ID>",
    "customProperties": {},
    "description": null,
    "id": "<DETECTOR_ID>",
    "lastUpdated": <UPDATED_TIMESTAMP>,
    "lastUpdatedBy": "<UPDATER_ID>",
    "maxDelay": null,
    "name": "CPU load too high",
    "programText": "detect(data(\"jvm.cpu.load\") > 60).publish(\"Load above 60%\")",
    "rules": [
        {
            "description": null,
            "detectLabel": "Load above 60%",
            "disabled": false,
            "notifications": [
                {
                    "type": "Email",
                    "email": "person@example.org"
                }
            ],
            "severity": "Critical"
        }
    ],
    "tags": [],
    "visualizationOptions": null
}

Enable or disable a detector

By default, detectors you create are enabled, so they trigger an alert when their conditions are met.

You can disable and then re-enable a detector using the API.

Disable a detector

To disable a detector, use the operation PUT https://api.{REALM}.signalfx.com/v2/detector/{DETECTOR_ID}/disable. The value of DETECTOR_ID is returned in the response to the operation POST https://api.{REALM}.signalfx.com/v2/detector that creates the detector.

For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$ curl \
    --request PUT \
    --header "X-SF-TOKEN: <YOUR_ORG_TOKEN>" \
    --header "Content-Type: application/json" \
    --data \
    '
        [
            "Load above 60%"
        ]
    ' \
    https://api.<REALM>.signalfx.com/v2/detector/<DETECTOR_ID>/disable

Enable a detector

To re-enable a disabled detector, use the operation PUT https://api.{REALM}.signalfx.com/v2/detector/{DETECTOR_ID}/enable. The value of {DETECTOR_ID} is returned in the response to the operation POST https://api.{REALM}.signalfx.com/v2/detector that creates the detector.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$ curl \
    --request PUT \
    --header "X-SF-TOKEN: <YOUR_ORG_TOKEN>" \
    --header "Content-Type: application/json" \
    --data \
    '
       [
           "Load above 60%"
       ]
    '
    https://api.<REALM>.signalfx.com/v2/detector/<DETECTOR_ID>/enable

Read, update, or delete a detector

You can also read, update, and delete detectors.

  • Read detector:

    • One or more detectors based on search criteria:
      GET https://api.{REALM}.signalfx.com/v2/detector?query

  • Update detector: PUT https://api.{REALM}.signalfx.com/v2/detector/{DETECTOR_ID}

  • Delete detector: DELETE https://api.{REALM}.signalfx.com/v2/detector/{DETECTOR_ID}

Detector events and incidents

You can retrieve the events and incidents that a detector has produced using the operations GET https://api.{REALM}.signalfx.com/v2/detector/{DETECTOR_ID}/events and GET https://api.{REALM}.signalfx.com/v2/detector/{DETECTOR_ID}/incidents.

The following curl command demonstrates how to find active incidents (alerts) for a detector.

1
2
3
4
$ curl \
    --request GET \
    --header "X-SF-TOKEN: YOUR_ACCESS_TOKEN" \
    https://api.<REALM>.signalfx.com/v2/detector/<DETECTOR_ID>/incidents

To manually clear an incident, use the operation GET https://api.{REALM}.signalfx.com/v2/incident/{INCIDENT_ID}/clear. The following curl command demonstrates how to do this.

1
2
3
4
5
$ curl \
    --request PUT \
    --header "X-SF-TOKEN: <YOUR_ACCESS_TOKEN>" \
    --header "Content-Type: application/json" \
    https://api.<REALM>.signalfx.com/v2/incident/<INCIDENT_ID>/clear

Detector notifications

For example, you can specify a detector that checks for mean CPU usage over 60%. When it’s triggered, the detector sends alerts with Major severity to both a Slack channel and the "on call" email. Create the detector using the following request body:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
    "name": "High Mean CPU",
    "programText": "A = data(\"cpu.utilization\").mean(); detect(A > 60).publish(\"highMeanCpu\")",
    "tags": ["CPU"],
    "rules": [
        {
            "detectLabel":"highMeanCpu",
            "notifications": [
                {
                    "type":"Email",
                    "email":"oncall@example.com"
                },
                {
                    "type":"Slack",
                    "credentialId":"_id_",
                    "channel":"detector-alerts"
                }
            ],
        "runbookUrl":"http://runbook.example.com",
        "tip":"Add more machines!",
        "severity":"Major",
        "parameterizedSubject": "{{ruleSeverity}} Alert: {{{ruleName}}} {{{detectorName}}}",
        "parameterizedBody": "{{#if anomalous}}\nRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered\n{{else}}\nRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared{{/if}}\n\n{{#if anomalous}}\nTriggering condition: CPU utilization mean > 60\n{{/if}}\n\n{{#if anomalous}}Signal value: {{inputs.A.value}}\n{{else}}Current signal value: {{inputs.A.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}"
        }
    ]
}
When the alert triggers, you see the following notification delivered to the "on call" email address: email:

Alert triggered notification with custom email message
Alert triggered notification with custom email message

The alert also delivers the following message to Slack:

Alert triggered notification with custom Slack message
Alert triggered notification with custom Slack message

When you click the message, you see the following expanded text:

Alert triggered notification with expanded custom Slack message
Alert triggered notification with expanded custom Slack message

When the detector condition is no longer met, the alert "clears" and its status reverts to OK. You now see the following notification delivered to the on call email address: email

Alert cleared notification with custom email message
Alert cleared notification with custom email message

The corresponding notification appears in Slack:

Alert cleared notification with custom Slack message
Alert cleared notification with custom Slack message

Custom notification messages

To specify a detector with a simple custom notification message containing markdown, create a new detector with JSON that’s similar to the following request body:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{
    "name": "Low Mean CPU - Minor Alert",
    "programText": "A = data(\"cpu.utilization\").mean(); detect(A < 15).publish(\"lowMeanCpu\")",
    "tags": ["CPU"],
    "rules": [
        {
            "detectLabel": "lowMeanCpu",
            "notifications": [
                {
                    "type": "Email",
                    "email": "oncall@example.com"
                },
                {
                    "type": "Slack",
                    "credentialId": "_id_",
                    "channel": "detector-alerts"
                }
            ],
        "tip":"Consolidate Machines if possible to save money",
        "severity":"Minor",
        "parameterizedBody": "{{#if anomalous}}\n *Low CPU Usage* triggered \n\n {{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}"
        }
    ]
}

The notification message in email has the following form:

Email alert with custom Markdown
Email alert with custom Markdown

In Slack, the notification message has this form:

Slack alert with custom Markdown
Slack alert with custom Markdown

Full detector example

Create the stream

Streams are instantiated with the data() SignalFlow function, which takes as its main argument a query that select the metric time series you are interested in. The query string supports non-leading wildcards with the * character. You can also specify the optional filter argument. For example:

1
2
3
4
5
6
7
8
9
data('cpu.utilization').publish()
data('cpu.*').publish()

#With a filter
data('cpu.utilization', filter=filter('env', 'QA') and filter('serverType', 'API')).publish()

#Specifies the filter as a separate variable
qa_api_servers = filter('env', 'QA') and filter('serverType', 'API')
data('cpu.utilization', filter=qa_api_servers).publish()"

Apply analytics to the stream

To make calculations on the data in the stream, call analytics methods on it:

1
2
#Find the p90 of the 1h mean of total cache hits by datacenter
data('cache.hits').sum(by='datacenter').mean(over='1h').percentile(90)

Create a detector

Detectors are functions that continually examine a stream, looking for a specific condition you specify. When they detect the condition in the stream, they issue an alert and send out notifications. They can send notifications to incident management platforms such as PagerDuty or messaging systems such as Slack or email, or both.

To create a detector, call detect() on a data stream that you publish. For example:

1
2
#Send events when cpu.utilization is above 50 (and when it falls below again)
detect(data('cpu.utilization') > 50).publish('cpu_too_high')

Specify detector conditions

You specify conditions with predicates, which are expressions that evaluate to true or false. The input to a predicate is a data stream or streams; the predicate compares each value in the stream against a threshold value. If the comparison evaluates to true, then SignalFlow replaces the input stream value with True (Python style); otherwise it is replaced with False. If the result is True, SignalFlow keeps the metadata for the input value.

To combine simple predicates into more complex expressions, use the logical operators and, or, and not. You can also use parentheses ( ) to isolate an expression and evaluate it before any outer expressions.

The following code shows you some examples of SignalFlow predicates for detectors

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#True when any cpu.utilization timeseries' value is greater than 50
data('cpu.utilization') > 50

#True when cpu.utilization greater than 30 and memory.utilization less than 40
data('cpu.utilization') > 50 and data('memory.utilization') < 40

#Complex example of a condition. Checking if the moving average of the memory
#utilization is greater than 2 times the stddev
mem = data('memory.utilization')
mem_mean = mem.mean(over='1h')
mem_stddev = mem.stddev()mem_mean > 2 * mem_stddev

Use when() to specify predicate durations

By default, the state of the predicate changes when its predicate value changes from False to True or from True to False. To alter this behavior, use the when() function to require that the condition is true for a specified amount of time or a percentage of a specified amount of time. Use duration controls to create detectors that ignore temporary spikes. For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#Returns True at the moment cpu.utilization is above 50
#Returns False at the moment cpu.utilization is less than or equal to 50
when(data('cpu.utilization') > 50)

#Returns True when cpu.utilization has been above 50 for 5 minutes
#continuously;
#Returns False if cpu.utilization drops below 50 at
#any point during the next 5 minutes, that is, even if it dips instantaneously
#below 50.
when(data('cpu.utilization') > 50, '5m')

#Returns True when cpu.utilization has been above 50 for 75% of 5 minutes
#continuously; will generate False if the cpu.utilization drops below 50 for
#more than 25% of 5 minutes.
when(data('cpu.utilization') > 50, '5m', .75)"

Conditions are like filters, because you can specify one without using it. To use it, you have to pass its value to detect().

Specify a detector using detect()

The detect() function takes the output of predicate objects as inputs and sends out events based on the predicate states and the specific arguments you pass to detect().

The detect() function has this syntax:

detect(<on_predicate>, off=<off_predicate>, mode=<evaluation_mode>

  • The only required argument is <on_predicate>, which specifies a predicate that causes the Anomalous event to fire.

  • The optional off=<off_predicate> specifies a predicate that causes the Ok event to fire.

  • The optional mode=<evaluation_mode> controls how detect() evaluates predicates.

Using <on_predicate> and off=<off_predicate>

Table 2. Result of detect() for different number of arguments
Call Argument meaning Result

detect(<on_predicate>)

Specifies when to fire Anomalous:

  • When <on_predicate> changes from False to True, detect() fires the Anomalous event

  • When <on_predicate> changes from True to False, detect() fires the Ok event

detect(<on_predicate>, off=<off_predicate>)

Specifies when to fire Anomalous and when to fire Ok

  • When <on_predicate> changes from False to True, detect() fires the Anomalous event

  • When <off_predicate> changes from False to True, detect() fires the Ok event
    This form lets you specify separate conditions for sending the events

detect(<on_predicate>, off=<off_predicate>, mode=<evaluation_mode>)

Controls how detect() evaluates predicates

  • paired: If <on_predicate> is True and <off_predicate> is False simultaneously, detect() fires the Anomalous event; otherwise it doesn’t fire an alert.

  • split: detect() only evaluates <on_predicate> if no alert is currently set. If <on_predicate> is true, detect() fires Anomalous. Similarly, detect() only evaluates <off_predicate> if an alert is currently set. If <off_predicate> is true, detect() fires Ok.

Publish detect() events

To send events, you need to call publish() on the results of detect(). For example:

1
2
3
4
5
6
7
8
#Send events when cpu.utilization is above 50 (and when it falls below again)
detect(data('cpu.utilization') > 50).publish('cpu_too_high')

#Send events when cpu.utilization is above 50C
#Clear events when the cpu.utilization is below 40 and memory.utilization is below 20 for 5 minutes
cpu = data('cpu.utilization')
mem = data('memory.utilization')
detect(cpu > 50, when(cpu < 40 and mem < 20,'5m')).publish('cpu_too_high')

© Copyright 2019 SignalFx.

Third-party license information