CatsCrdl

CatsCrdl

Daniel's thoughts on infosec

Building Resilient Detection Suppressions

When a false positive occurs, how do you approach tuning your detector? Do you allowlist that one event? That can lead to another false positive tomorrow. This article will provide an example of how we build resilient suppressions.

Daniel Wyleczuk-Stern

4-Minute Read

A Cybersecurity Engineering Playing Whack-a-Mole

Intro

Raise your hand if you’ve worked on suppressing a false positive for a detector and then the very next day, it’s back in your “fix” queue with another false positive. I’d expect almost everyone who’s worked in detection engineering for any period of time has their hand raised or is nodding along. Why does this happen? Well, sometimes the detector itself isn’t very good and it’s alerting on a lot of benign behavior. But sometimes, the answer is that the suppression that was implemented was too specific and wasn’t properly generalized to capture the benign behavior.

At Snowflake, when we write our suppressions, we strive to understand the root cause that drove the false positive so that our detections are resilient and don’t immediately pop back up in the queue for another tweak. In this article, I’ll walk through my thought process on a recent suppression request that came into our queue.

The Detection

Like most cloud based companies, we have numerous rules that monitor IAM activity in our cloud environments. One particular rule is focused on monitoring the behavior of some very important roles that orchestrate a lot of backend services. During development of this system, we worked with our product security and cloud engineering to orchestrate detections around how these roles are used. We built a static list of roles and the actions they performed and created an “allowlist” style detection. Essentially, we said that we expect this set of roles to only perform these actions. Anything that deviates from this behavior should generate an alert. We understood that these are more fragile, but considering the importance of the service, we decided that the balance played in favor of security.

The False Positive

When the benign activity occurred, someone had modified the backend application to use the role to perform a new action. We were able to verify with the engineers and the pull request that the behavior we were seeing in the logs was, in fact, benign. We also verified by examining the permissions granted to the role that it was authorized in the IAM policy.

The Suppression

Now it came time for us to decide how to alter the detector so that we wouldn’t see this type of false positive again.

Approach 1

The most obvious approach is to extend our list of allowed methods to include the new method. However, we quickly realized we’d run into more benign alerts if someone wrote another module with new capabilities, which certainly happens.

Approach 2

The next idea was to see if we could dynamically build an allowlist of methods based on the permissions granted to the role. We took a look at the grants to the role and realized that wouldn’t be feasible. Considering the role was used to orchestrate the environment, it had quite a range of permissions granted to it.

Approach 3

We decided to pivot and say: let’s assume the role can do anything. What if we focus our efforts on making sure the role is being used from the correct compute nodes? (We have separate monitoring on validating the security of those nodes.) Then our assumptions can be:

  • These roles are orchestrated by a service running in these compute nodes
  • This service is controlled by our CI/CD systems with code living in GitHub
  • The compute nodes are secure via monitoring and various other controls
  • Our repo is secure via codeowners, protected branches, etc

Thus, we can state that so long as the role is being used from the instance, it’s being used correctly. To accomplish this, we can look at our asset inventory and build a list of internal IPs associated with the cluster running this service at detection run time. Furthermore, we realized that we should update our list of static roles to be dynamic. Again, we used our asset inventory to pull a list of roles rather than our static list. To ensure our list of roles was accurate, we based our filtering on tags applied to the roles.

After implementing these changes, we updated our detector, tested it, and pushed it back to production.

Conclusion

When building dynamic suppressions, it’s important to consider several factors, with the key focus being on ensuring that all elements that can be made dynamic are indeed implemented as such. In our example, we realized that our list of IPs and roles should be more dynamic to yield a more resilient detector. In addition, by understanding our threat model and other compensating controls, we were able to remove a condition that was prone to inducing benign false positive behavior. Hopefully this short article will help you the next time you come across a false positive.

Say Something

Comments

Recent Posts

Categories

About

A random collection of thoughts on cybersecurity.