Let me restate the title, SIEM deployments don’t fail. The technology to accept logs, parse and correlate events has existed in a mature state for some time now. The space is so mature that we even have a slight divide between SIEMs that are better suited for user tracking and compliance and some that are better at pure security events depending on how the technology “grew up”. So the pure technical aspects of the deployment are generally not the reason your SIEM deployment fails to bring the value you, or your organization, envisioned (no pun intended).
Remember that old ISO thing about people, process, AND technology? Seems we often forget the first two and focus too much on the technology itself. While I’d like to say this is limited to smaller organizations the fact it that it is not. The technology simply supports the people who deal with the output (read: security or compliance alerts) and the process they take to ensure that the response is consistent, repeatable, tracked, and reported. That being said we also seem to forget to plan out a few things before we start down the SIEM path in the first place. This post aims to provide you with the “lessons learned” from both my own journey as well as that of what I see my clients go through in a Q and A format.
Question 1. Why are we deploying SIEM or a log management/monitoring solution?
The answer to this is most likely going to drive the initial development of your overall strategy. The drivers tend to vary but generally fall into the following categories or issues (can be one or more):
What are just as important are the goals of the overall program. Are we more concerned with network or system security events? Are we focused on user activity or compliance monitoring? Is it both? What do we need to get out of this program at a minimum level and what would be a nice to have? Where does this program need to be in the next 12 month? The next 3 years? Answering these questions helps answer the question of “why”. The purpose and mission must be defined before we even think about looking at the technology to support the program. While this seems like a logical first step most people start by evaluating technology solutions and then backing into the purpose and mission based on the tool they like the most. Remember, technology is rarely the barrier.
Question 2. Now that we are moving forward with the program, how do we start?
The answer to this one will obviously depend on the answers to some of the questions above. Let’s assume for a moment, and for simplicity of this post, that you have chosen security monitoring as the emphasis of the program. Your first step is NOT to run out to every system, application, security control, and network device and point all of the logs at the highest (i.e. debugging) level at the SIEM. Sure, during a response having every log imaginable to sort through may be of great benefit, however at this stage I’m more concerned that I have the “right” logs as opposed to “all” logs. One of the reasons I see this “throw everything at the SIEM and see what sticks” idea may be partially driven by the vendors themselves or an overzealous security guy. I could image a sales rep saying “yes, point everything at us and we’ll tell you what is important as we have magical gnomes under the hood who correlate 10 times faster and better than our competition”. Great, as long as what is important to you exactly lines up with what the vendor thinks then go for it (joking, of course).
The step that seems most logical here is to define what events, if they occur, are most important given your organization, business, structure, and the type and criticality of data you store or is most valuable. If we define our top 10, 20, 30, etc. and rank these events by criticality we have started to define a few things about our program without even knowing it. First, with a list of events we can match these up to the log sources that we would need in order to trigger an alert in the system. Do we need one event source and a threshold to trigger? Or is it multiple sources that we can correlate? Don’t be surprised if your list is a mixture of both types. Vendors would love for us to believe that all events are the result of their correlation magic, but in reality that just isn’t true. We can take that one step further and define the logs we would need to further investigate an alert as well. Second, we started to define an order of criticality for both investigation and response. Given the number of potential events per day and a lack of staff to investigate every one, we need to get to what matters which should be our critical or higher risk events first.
One thing to keep in mind here as well is to not develop your top “x” list in vacuum. As part of good project planning you should have identified the necessary business units, lines, and resources that need to be involved in this process. Security people are good at thinking about security, but maybe not so much about how someone could misuse a mainframe, SAP, our financial apps and so on. Those who are closer to the application, BU, or function may end up being a great resource during this phase.
And finally, events shouldn’t be confined to only perimeter systems. If we look at security logging and are concerned about attacks we need to build signatures for the entire attack process, not just our perimeter defenses which fail us 50% of the time. Ask yourself, if we missed the attack at the perimeter, how long would the attacker have access to our network and systems until we noticed? If the Verizon DBIR report is any indication the answer may be weeks to months.
Question 3. I’ve defined my events, prioritized them, and linked them to both trigger log sources and investigation log requirements. Now what?
Hate to say it, but this may be the hardest part of the process. Hard because it assumes your company has asset management under control. And I don’t mean being able to answer where a particular piece of hardware may be at a given moment. I do mean being able to match an asset up to its business function, use, application, support, and ownership information from both the underlying services layer (i.e. OS, web server, etc.) as well as the application owner. All of this is in addition to the standard tracking of a decent asset management program such as location, status, network addressing, etc. If you lack this information you may be able to start gathering the necessary asset metadata from various sources that may (hopefully) already exist. Most companies have some rudimentary asset tracking system, but you could also leverage output from a recent business impact analysis (BIA) or even the output from the vulnerability assessment process…assuming you perform periodic discovery of assets. Tedious? Yes.
Let’s assume we were able to cobble something together that is reasonable for asset management. Using our top “x” list we can identify all of the log sources and match those up to the required assets. Once we know all of the sources we need to:
One client I had called this a Monitored Asset Management program or something to that effect, which I thought was a fitting way to describe this process. This isn’t a difficult as one may think given that the systems logging into our SIEM tend to be noisy, so a system that goes dead quite for a period of time is an indicator of an potential issue (i.e. it was decommissioned and we didn’t know, someone changed the logging configuration, or it is live yet has an issue sending (or us receiving) the logs). One thing that does slip by this process is if someone changes the logging level to less than what is required for our event to trigger, thus blinding the SIEM until the level is changed back to the required setting.
In addition to the asset management we should test our event for correctness at this point. We should be able to manually trigger each event type and what as it comes in to the SIEM or dashboard. I can admit I have made this mistake in the past, believing that there is no way we could have screwed up a query or correlation so that the event would never trigger…but we did. You should also have a plan to test these periodically, especially for low volume high impact type of events to ensure that nothing has changed and the system is working as designed.
Question 4. To MSSP, or not to MSSP, that is the question. Do you need an MSSP and if so what is their value?
This is also a tough question to answer as it always “depends”. Most companies don’t have the necessary people, skills, or availability to monitor the environment in a way which accomplishes the mission we set for ourselves in step 1. That tends to lead to the MSSP discussion of outsourcing this to a 3rd party who has the people and time (well, you’re paying for it so they better) to watch the events pop up on the console and then do “something”.
Let me start with the positive aspects of using an MSSP before I say anything negative. First, they do offer a “staff on demand” which may be a good way to get the program off the ground assuming you require a 24×7 capability. That is a question that needs to be answered in step 1 as well, and you should ask yourself if we received an alert at 3am, do we have the capability to respond or would that be taken care of by the first security analyst on our team in the morning? 24×7 monitoring is great, assuming you have the response capability as well. Second, they do offer some level of comfort in “having someone to call” during an event or incident. They tend to not only offer monitoring services but also may have response capabilities, threat intelligence information (I’ll leave the value of that one up to you), and investigation.
Now on to the negatives of using an MSSP. First, they are “a SOC looking at a SIEM console”, and not “your SOC who cares about your business”. The MSSP doesn’t view the events in the same business context and you unless you give them that context and then demand that they care. Believe me, I’ve tried this route and it leads to frustrating phone calls with MSSP SOC managers and then the sales guy who offers some “money back” for you troubles. Even if you provide the context of the system, network architecture, and all the necessary information there is no guarantee they will use it. To give you a personal example we used an unnamed MSSP and would constantly receive alerts from them stating that a “system” was infected as it was seen browsing and then downloading something bad (i.e. JavaScript NOOP sled or infected PDF). That “system” turned out to be the web proxy 99.9% of the time. To show how ridiculous this issue was all you had to do was look in the actually proxy log record, which was sent to them, to determine the network address (and host name) of the internal system that was involved in the event. Side note, they had a copy of the network diagram and a system list which showed the system by name, network address, and function. Any analyst who has ever worked in a corporate environment would understand the stupidity of telling us that the web proxy was potentially infected. Second, MSSPs, unless contractually obligated, may not be storing all of the logs you need during an incident or investigation. Think back to the answer to question 2 for a moment where we defined our event, trigger logs, and logs required to further investigate an event. What happens if you receive an event from the MSSP and go back to the sources to pull the necessary logs to investigate only to find they were overwritten? As an example from my past, and this depends on traffic and log settings, but Active Directory logs at my previous employer rolled over every 4 hours. If I wasn’t storing those elsewhere I may have been missing a necessary piece of information. There are ways around this issue which I plan on addressing in a follow up post on SOC/SIEM/IR design.
Question 5. Anything else that I need to consider? What do others miss the first time around, or even after deploying a SIEM?
To close this post I’d offer some additional suggestions besides some of the (what I feel are obvious) suggestions above. People are very important in this process, so regardless of the technology you’re going to need some solid security analysts with skills ranging from log management to forensics and investigations. One of the initial barriers to launching this type of program tends to be a lack of qualified resources in this area. It may be in your best interest to go the MSSP route and keep a 3rd party on retainer to scale your team during an actual verified incident. Also, one other key aspects of the program must be a way to measure the success, or failure, of the program and processes. Most companies start with the obvious metric of “acknowledged” time…or the time between receiving the event and acknowledging that someone saw it and said “hey, I see that”. While that is a start I’d be more concerned that the resolution of the event was within the SLAs we defined as part of the program in the early stages. There is a lot more I could, but won’t, go into here on metrics which I’ll save for a follow up post. In my next post I’ll also talk about “tiering” the events so that the events with a defined response can take an alternate workflow, and more interesting events which require analysis will be routed to those best equipped to deal with them. And finally, ensuring that the development, or modification, of the overall incident response process is considered when implementing a SIEM program. Questions such as, how will SIEM differ from DLP monitoring and how does SIEM compliment, or not, our existing investigative or forensics tool kit, will need to be answered.
Conclusion
To recap the simple steps presented here:
While I think the list above and this post are quite rudimentary I can admit that I made some of the mistakes I mentioned the first time I went through this process myself. My excuse is that we tried this for ourselves back in 2007, but I find little excuse for larger organizations making these mistakes some 5 years later. Hopefully you can take something from this post, even if it is to disagree with my views…just hope it encourages some thought around these programs before starting on the deployment of one.
Comments