Real Time vs Scheduled Query Detections - A Guide For Detection Engineers
Many SIEM tools nowadays offer the opportunity for you to write rules on streaming data or run scheduled queries on a periodic basis. But when should you use which and why? This blog post is designed to serve as a guide to those designing their detection architecture.
Most modern SIEMs offer 2 primary methods for running their queries: real time rules and scheduled queries. Each option offers a variety of pros and cons that you should consider as you develop your detection. Before we dive into that, let’s clarify what we mean by each.
Definitions
A real time rule, often called a streaming rule, runs on a stream of data. Essentially, as the data enters the SIEM, each data point will get processed against the rules that correspond with that kind of data. It’s important to note that these are actually more accurately “near” real time rules. There’s delays due to log generation, ingestion, parsing, running the rule logic, and delivering the results. Let’s take a look at some examples.
OKTA_SUPPORT_ACCESS_EVENTS = [
"user.session.impersonation.grant",
"user.session.impersonation.initiate",
]
def rule(event):
return event.get("eventType") in OKTA_SUPPORT_ACCESS_EVENTS
def title(event):
return f"Okta Support Access Granted by {event.udm('actor_user')}"
def alert_context(event):
context = {
"user": event.udm("actor_user"),
"ip": event.udm("source_ip"),
"event": event.get("eventType"),
}
return context
The above rule is from Panther and can be found at okta account support access.
The rule is configured to run on Okta logs. For every event, it determines if the event type was in a list of defined Okta support access events. This rule runs as the Okta logs are ingested into Panther and is near-real time: as soon as the logs are shipped to Panther, the rule runs.
Now, let’s take a look at a scheduled query.
AnalysisType: scheduled_query
QueryName: Okta Investigate MFA and Password resets
Enabled: true
Description: >
Investigate Password and MFA resets for the last 1 hour
SnowflakeQuery: >
SELECT
p_event_time,
actor:alternateId as actor_user,
target[0]:alternateId as target_user,
eventType,client:ipAddress as ip_address
FROM panther_logs.public.okta_systemlog
WHERE eventType IN
('user.mfa.factor.reset_all',
'user.mfa.factor.deactivate',
'user.mfa.factor.suspend',
'user.account.reset_password',
'user.account.update_password',
'user.mfa.factor.update')
and p_occurs_since('1 hour')
ORDER by p_event_time DESC
Schedule:
RateMinutes: 60
TimeoutMinutes: 1
The above query can be found here: okta mfa reset (though it was modified for this blog post)
This query operates on data that has already been ingested into the SIEM. Every hour, it runs a SQL query over the Okta system logs and looks for events related to MFA. If any match, it will generate an alert.
Types of Cost
I believe that nearly any rule you write as a streaming rule you can also write as a scheduled query and vice versa. So why would one choose one over the other? It comes down to cost in more ways than one. First, let’s identify the types of cost.
- Real Time Analytics - some SIEMs charge for the cost of running real time analytics.
- Batch Processing - some SIEMs will charge for running analytics and may have different charges for real time vs scheduled rules.
- Enrichment - nearly every alert nowadays requires enrichment. With a real time rule, this often can be done with lookup tables or calling an external API. With a scheduled query, one can use joins.
- Development - time is money friends! If a rule takes longer to develop and test, that’s a cost that needs to be accounted for.
- Error Potential - failures can be disastrous for a SOC as they can cause false negatives leading to an extended breach. What can cause errors for these rules? For a streaming rule relying on an API for enrichment, you can run into rate limiting or just general API failures if the API goes down. For a scheduled query where you’re relying on joins, you can run into data ingestion delay sync issues. Let’s dive into this more.
Data Ingestion Failures
Let’s refer to our above scheduled query. In this scenario, we’re ingesting Cloudtrail, Okta, and Jamf into our data lake. Let’s imagine that our Jamf ingestion gets delayed for whatever reason. And during this delay, a user starts working from a coffee shop with a new public IP address. We could very easily run into a false positive as the user’s new laptop IP would not be in our data lake yet. We can also imagine similar false negatives where an inner join can fail due to data ingestion. How do we solve this issue? We have a couple of options.
- Staggered data lookbacks. Instead of looking back from -1 hour to present, we could look back -2 hours to -1 hours. This gives us a buffer window for data to hit our data lake. However, it automatically increases our time to detect by 1 hour.
- Overlapping lookbacks. Now let’s imagine we lookback -2 hour to present but we run every 1 hour. This keeps our time to detect at a more reasonable threshold, but we’re greatly increasing the volume of data we’re examining. We’ll also have to be cautious of how we plan for alert deduplication.
Example Costs
Let’s take a look at some examples of common detections and see where their costs are.
- Streaming rule on firewall logs looking for IoC IPs
Let’s look at some example code.
from ioc_library import lookup_ioc
def rule(event):
return lookup_ioc(event.udm("source_ip")) or lookup_ioc(event.udm("dest_ip"))
def title(event):
if lookup_ioc(event.udm("source_ip")):
return f"Source IP ({event.udm('source_ip')}) matched a known IoC"
else:
return f"Dest IP ({event.udm('dest_ip')}) matched a known IoC"
In this sample code, we have a library that allows us to pass in an IP and query an API to see if the IP matches a known IoC. What potential costs could we have?
We’re going to have real time analytic and enrichment costs. Especially concerning is the fact that we’re calling the lookup_ioc function up to 3 times. (Note that this is a naive implementation to serve as an example.) Depending on the volume of our firewall traffic, it’s very likely we could run into a rate limit on the API.
- Scheduled query matching AWS logins with Okta usernames with Jamf IPs
AnalysisType: scheduled_query
QueryName: Okta AWS Console Logins from Non-Laptops
Enabled: true
Description: >
Investigate AWS console logins from Non-Laptops
SnowflakeQuery: >
SELECT *
FROM panther_logs.public.cloudtrail ct
INNER JOIN panther_logs.public.okta_users okta on
okta:email like CONCAT('%', lower(ct:userIdentity.userName), '%')
LEFT OUTER JOIN panther_logs.public.jamf_inventory jamf on
jamf:externalIP = ct:sourceIPAddress
WHERE ct:eventName = 'ConsoleLogin'
and p_occurs_since('1 hour')
and jamf:externalIP is NULL
ORDER by p_event_time DESC
Schedule:
RateMinutes: 60
TimeoutMinutes: 1
In the above login, we’re joining our Cloudtrail data with Okta, and then joining that again on Jamf data to see where users are logging in from devices that aren’t their laptops. How would we write this if it was a streaming rule? Similarly to the first example, we’d have to have an API where we could pass an IP and see if that IP is associated with one of our devices in Jamf.
What are the costs here?
We’re going to have processing costs in our SIEM - if we have a lot of Cloudtrail logs, we could be searching a lot of data. Additionally, while not super relevant in this example, depending on the join we could face costs/performance concerns.
Deciding on a Style
Now that we’re familiar with some of the costs, what questions should we ask ourselves when designing a rule?
Factors that favor real time rules
How quickly do I need to be alerted?
A streaming rule will notify your SOC/IR team in near-real time. If the rule doesn’t require an immediate response, one could use a scheduled query.
How much data do I need to look at?
Let’s envision we’re using a scheduled query that’s using data source A and B, and we’re joining the past hour of A on all of B. If B is a very voluminous data source, we could be looking at a very slow query. In this case, it may be better to run a streaming rule on A and then query B as necessary in the next step of the pipeline.
How many rules do I have running on this data set and how large is the data set?
Let’s say we’re writing rules on VPC flow log data which has a LOT of data. Let’s say we want to write 5 different rules to look for IoCs, unexpected ports, etc. (ones that are very simple). If we were to write 5 different scheduled queries, we would be looking at the same (large amount of) data multiple times. This can get expensive. We could write 1 scheduled query but it would be complicated and hard to parse. So in this case, writing streaming rules would make more sense and we could rely on downstream enrichments.
Factors that favor scheduled queries
Do I need enrichment to generate an alert?
In our first example, we needed to enrich our data in order to generate an alert. This means we’ll need to call an API on every data record. On the other hand, if we only need to call the API after we’ve decided to generate an alert, we’ll significantly reduce how often we’ll need to call the API and avoid potential rate limit concerns.
How many data points do I need to make a conclusion?
In our second example, we needed Cloudtrail, Okta, and Jamf in order to make a determination if the login was malicious. This would be difficult to do with a real time rule as it would likely require the development of specific infrastructure to support these lookups.
How well does my infrastructure handle failure?
Let’s say you hit a rate limit and your API lookup fails. What happens? Does your infrastructure rerun the rule at a later date? Or does your infrastructure ignore the event and not process it? If you suspect you may hit a rate limit with your rule and your infrastructure fails silently, you should consider re-designing the rule or running it as a scheduled query.
Do I need to consider system state?
Let’s imagine a scenario where we have an endpoint security tool, and we want to alert if it’s uninstalled. However, our IT team has a script that is triggering false positives because it temporarily disables the agent when it applies updates. So we want to know what state the system was in when the disable command was run. You could potentially set up some sort of state analysis pipeline, but it can get prohibitively expensive. It’s much simpler to use joins or something like match_recognize.
Other Factors
What’s my team familiar with?
If your team is more familiar with real time rules, you should defer to that. It’s likely your existing infrastructure and tooling better supports development and testing of those rules. That doesn’t mean you should ignore the alternative all together, though!
What infrastructure and supporting automation and tooling have we already built?
Are you architected around ingesting all your data into your security data lake? Or do you have a robust series of APIs to query 3rd party endpoints? This could drive your decision one way or the other; though with Snowflake’s UDF support, you could call these APIs from your scheduled queries if you wish.
Conclusion
There’s some great blog posts on detection engineering out there, but I often see them focusing too much on theory or on the nitty gritty of some technology. My hope is that this article will be useful to practioners especially those who are charged with architecting their detection systems.
Now that you’ve got this information and are equipped with these questions, you’re prepared to make better decisions when it comes to designing your rules. Are there other things you consider during your design process? Please let me know!