ITOA: Differences and Similarities
By Vishnuteja Nanduri,
Practice Leader for Data Science and Engineering in IT Operations Analytics, IBM
With all the hype and buzz around the catchall phrase Big Data Analytics, many of us are bound to get lost in the terminology. Each and every industry has now begun to co-pt the word analytics and has created a silo for their specific analyses. I have heard of and worked in some of the following: healthcare analytics, fraud analytics, business analytics, manufacturing analytics, risk analytics, marketing analytics, digital analytics, software analytics, and pricing analytics to name a few. A recent addition to that list is "IT Operations Analytics (ITOA)," and the goal of this series of articles is to bring to fore this exciting and budding area.
I wrote a simple R code to fetch tweets from the premier Data Science hub-‐-‐Data Science Central-‐-‐ and created a quick word cloud using www.wordle.net and here's what we see (for full disclosure, I do not work for DSC or Wordle):
Figure 1: Word cloud created using latest tweets from @DataScienceCtrl
Now, the only reason I did this quick exercise is to see whether terms related to IT Operations Analytics appear in the word cloud. Of course, this is a very small sample of 275 tweets I was able to pull; nevertheless, it can be seen that there are terms related to ITOA, such as Internet of Things (IoT), but their occurrence is rather limited.
In this article, I will discuss how ITOA is different from the other domains and in what ways it is similar. First off, let's talk about the differences beginning with the key ingredient-‐-‐data. ITOA data are vastly different from those one would use in marketing or advertising analytics. ITOA deals with data that is auto-‐generated by machines. Each time you log-‐on to a website or your favorite mobile app, you leave a trail of breadcrumbs that can help an analyst figure out patterns and trends. This data is called a log. A log is a record of an event that has happened in the past. ITOA, in most cases, deals with log data that has been generated automatically by machines. These days, where 90% of all the data that exists in the world today has been generated only in the last few years, data collection is ubiquitous. This is even more the case with IT data where every second (sometimes even lesser) a log is generated or a system heartbeat is generated to indicate its 'health.' Think about all the data we generate in an inter-‐ connected world with our every click. For instance, consider Jane, who walks into office on a given day.
She clocks-‐in to work—a log is generated
She opens her laptop and signs-‐in—a log is generated
She goes to her favorite news website—a log is generated She opens her email—a log is generated
She opens her favorite text editing tool-‐ a log is generated
She steps out for a coffee and steps back in with her id card-‐ a log is generated
You get the idea, don't you?
Figure 2: A sample log file (source: http://bit.ly/1dqduk0)
Sure, most of this data may or may not be stored in the long-term or analyzed, but the important logs such as those when she logs into a business-critical application, are recorded, stored, and nowadays with the help of ITOA, analyzed, and acted upon if needed. So what does a log file look like? See Figure 1. The first thing that comes to mind is the sheer mess that is the data. This in analytics terminology is known as semi-structured data, which is slightly more structured than completely unstructured text data and significantly less structured than 'structured' data shown in Figure 3.
Figure 3: Example of Structured Data
Now, let's try to dig a little deeper using a simple table. I understand that the table below is at an admittedly high level, but the intent here is to showcase the typical key differences between ITOA data and other data, say, from a marketing setting. Let's compare both using the 3,4, or 5 V's of BigData (frankly, I have lost count…let's just stick with 4 shall we?)
Table 1: Comparison between ITOA Data and Marketing Data using the 4-‐Vs of BigData (for a similar time period)
Similarities with other fields
Now, let's talk about the similarities when it comes to the various methodologies that can be used to analyze ITOA data with another field, such as marketing. ITOA data is typically analyzed using time series models since the data are indexed in time. Marketing data, such as buying patterns of individuals can also be analyzed using time series modeling. Clustering is another technique that can be used to find similar items (servers, configuration items, etc) after parsing log files. Marketing data lends itself quite easily to clustering as well, where buyers with similar purchase patterns can be clustered together. Yet another similarity can be drawn with respect to the analysis of text data. Some basic text analytics techniques can be applied in both domains to generate actionable insights. I wish to bring up a caveat here. So far, I have discussed ITOA data only as machine generated data, which does not tell the entire story. It also contains a lot of human-‐generated data such as the incident-‐problem-‐change tickets with comments and feedback entered manually.
In this series of articles, I will discuss some of these advanced machine-‐learning techniques that can be used to analyze IT operations data, in great detail. These will include discussions on parsing machine-‐ generated data and converting the semi-‐structured data to structured data for subsequent analyses. I will also discuss statistical modeling techniques that can be used to identify seasonal and cyclical patterns within log files. I will cover some of the advanced text analytics algorithms as applied to the text data within log files, incident, problem, and change records in an IT setting. Stay tuned…
About Vishnuteja Nanduri
Vishnuteja Nanduri is the Practice Leader for Data Science and Engineering in IT Operations Analytics at IBM. He currently leads a team of talented data scientists and data engineers in ITOA. He holds a Ph.D. in Industrial Engineering from the University of South Florida, and has been an active researcher in the domain of analytics and machine learning for over a decade. Follow him on twitter @drvnanduri.