Week 7 – Day 2

Today was peace 😉

I learned about the elbow method to find the optimal value of K for the KMeans algorithms (https://bl.ocks.org/rpgove/0060ff3b656618e9136b). I went through many resources online and was a bit confused at some point because of slight differences. Anyway, I have that implement now in my ECL code.

Tomorrow, I will focus on KMeans++ which is an approach to better initialize the centroids.

See you tomorrow 🙂

Week 7 – Day 1

First day of the second part of the internship!

I had a meeting this morning with my supervisors to make a kind of wrap up about the first part and also discuss about the project’s current state and see what to do next. For the first part, I can say I have been mostly learning: learning ECL, learning about the HPCC logs, designing an approach to tackle the project. I also started coding.

However, one of my main issues, which was raised again today was the availability of a large dataset. For this project, I have to work with a large HPCC log file. Unfortunately, the logs I am generating myself are not enough and the company, certainly due to security issues, cannot provide me with log files. Hence, last week, I found a similar log file which I am currently using for testing.

Just as a reminder, one of my target anomalies is to find suspicious users. Hence, I am working on clustering or grouping users to find outliers. I am going with KMeans which is available in the HPCC Systems ML-Production bundles. Tomorrow, I will implement the elbow method to find an optimal number of clusters and analyze the resulting clusters to adjust the parameters.

See you tomorrow 🙂

Week 6 – Day 5

Last day of Week 6!

I believe I made some progress. I found the reason why my requests where taking time to proceed. I was joining tables in a weird way, directly in the RECORD structure, without using the JOIN function. Anyway, I am glad I found the issue and now my workunits terminate in subsequent time.

I have slowly moved up the ladder up to the point where I need to cluster users according to their activities (number of submitted requests on a daily basis). For that, I am using the HPCC KMeans and ML_Core production bundles. I am still working on it. More precisely, I am working on finding the best value for K and also my right initial centroids.

Have a nice weekend 🙂

Week 6 – Day 4

Today was okay!

I have the code set up to generate the dataset I will be using for the ML algorithms. I used the HPCC ML library in the version of the code which is currently in Github: https://github.com/vzeufack/log-based-anomaly-detection.git. However, my supervisor told me that I am using an outdated version. So, I will update the code at that level tomorrow. I could have done that today but as the figure shows, some requests are taking quite long to finish since I am now using a larger dataset with about 2 million log lines. The figure shows a workunit which runs an ECL files whose job is to divide the log file into time windows. It has been running for an hour and a half now 😩. I will leave that running and hopefully it will be done tomorrow. I may also try to optimize the code to improve the running time.

Week 6 – Day 2

At last, I finally found a dataset which is similar to HPCC logs and sufficient enough I believe for testing. Here are the specifications of the dataset which I freely got from: http://bytequest.net/index.php/2017/01/03/freely-available-large-datasets-to-try-out-hadoop/

Dataset: NASA-HTTP Web Server Log
Description: NASA-HTTP Web Server Log data set contains all HTTP requests to the NASA Kennedy Space Center WWW server in Florida for the month of July 1995. Records consists of the following fields: host making the request, timestamp in the format “DAY MON DD HH:MM:SS YYYY”, request, HTTP reply code, bytes in the reply. Detailed information about this dataset can be accessed at
http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html
Download URL: ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz
File Type: ASCII
File Size: Compressed GZ archive: 19.7 MB
Uncompressed ASCII: 205.2 MB

As the description mentions, the dataset contains HTTP requests. Hence, I will use my code to monitor the number of requests submitted by IP on a daily basis and use that feature to group users using my implementation.

Tomorrow, I will start by cleaning and preparing the data.

See you then 🙂

Week 6 – Day 1

Today I had my code set up for monitoring user’s submitted queries. More importantly, I discussed with my supervisors about the project. From our discussion, I got two things to work on: my anomaly detection approach and my dataset.

For now, my approach has consisted in finding anomalies by users. In other words, I was treating users activities individually. My current anomaly detection approach would detect an abnormal behavior regarding a user only if that user deviates from its known pattern. My supervisor suggested me to also consider the users as a group to be able to make group of users and find those users which do not belong to any group and therefore could be potential threats.

Concerning the dataset, I kind of have an issue at this level because at the beginning of this project, I intended to use HPCC logs. However, I did not generated enough logs and also, I am the only user on my HPCC VM. Hence, either I will have to create one or use an existing one which has some similarities with the HPCC logs. I made some research on the availability of some logs and I think I will use an existing one. I should have the selection done by tomorrow.

See you then 🙂

Week 5 – Day 5

Week 5 could be summarized by the following picture:

The four figures represent the same data, differently. Those are my vectors constructed by counting the number of requests submitted by users. The x-axis shows numbers from 1 to 6 respectively representing user#1 to user#6. The y-axis shows the number of requests submitted. And the counts has been made on three time slots: [00am – 08am), [08am – 04pm) and [04pm – 00am). With this, the number of requests submitted by users can be monitored and an unusual pick or drop can be detected.

Next week, I will be working on monitoring the average running time per queries and updating my design document to include mechanisms to find root causes of detected anomalies.

Have a kind weekend 🙂

Week 5 – Day 4

Nothing much to say today. It was Independence Day celebrated calmly at home. I did a little tiny progress though.

I spent almost 50% of my time today setting up my laptop. Why? As mentioned in my early posts, I work from “home” but not exactly. I work from campus (Kennesaw State University) where my thesis’ supervisor allocated me a Desktop. I am used to work on it and it is a lot faster than my laptop. As the school was closed today, I had to really work from home and use my slow laptop 😓. I had to download the ECL tools again. That setup should have been easy but I got some weird errors with the latest VM on which ThorMaster would not work properly (not allowing me to spray files). I had to get a less recent version. Anyway, I was able to get my slow laptop ready to continue the project.

I did started working on implementing the code for detecting my second target anomaly related to query running time. The only thing I did about that was to extract my features from the logs: date, time, query run time, and query name. I will continue tomorrow with the analysis.

See you tomorrow 🙂

Week 5 – Day 3

Happy day 😊

As the image shows, I finally got to my first target result. Yesterday’s design on Excel has been extremely helpful. Now, I can monitor the number of queries submitted by users. The table shows a “uid” column representing the users. In the next column, we have their respective vectors starting with “vid” which is the vector ID and the rest are the vector’s components.

Starting tomorrow, I will be targeting queries running time.

Happy Independence Day 🎉