Week 9 – Day 2

Today, I continued working on setting up my features.

Also, I had a meeting with my supervisor about the project’s progress. Since, I am having trouble finding a dataset from which I can detect my second target anomaly, we agreed to change the second target anomaly. Now, I will be targeting workflow anomaly. Apart from discussing about the second anomaly, we also discussed about the deliverables, since the internship is going towards its end. I really need to focus on getting an output, showing some anomalies detected through my analysis, showing visuals.

Week 9 – Day 1

Welcome to day 1 of week 9.

Today, I have been working on expanding my features, as advised by my supervisor. Initially, my dataset has the following attributes:

Record ID
Date
Time
Timezone
User IP of user making the request
Request
Reply code
Bytes in the reply

From this attributes, I made some transformations to get the following features:

Average number of requests submitted within the first 8 hours of a day
Average number of requests submitted within the second 8 hours of a day
Average number of requests submitted within the last 8 hours of a day
The number of days the user submitted a request
The number of different requests submitted by the user
The average number of bytes in reply

Week 8 – Day 5

I will name week 8 as the “Dumb week” 😓

Yes, I have been dumb. I did wrote a full pipeline to detect the first anomaly (training, prediction and anomaly detection). I have spent the whole week coding. However, I have mostly being doing “extreme programming”. At the end of the week, after a meeting with my supervisor, I realized I was coding towards the wrong direction.

My number of features were small, I added some useless functions, my separation of training and test set was biased and and I was testing on a test set which was not similar to training data. That is to say, I have done things but not the right things.

Now, I have to do some little design and clarify the approach before going back to code.

Week 8 – Day 4

At last I was able to make some predictions today!

My training and prediction pipelines are fully setup. The figure on the right shows the elbow plot which relates the number of centroids to the error Sum of Squares. We clearly have our elbow at 2 centroids. Therefore, I went with two centroids. It should be noted that I added a standardization code. Indeed, I am working with four features:

the number of requests submitted between 00:00 am and 08:00 am excluded
the number of requests submitted between 08:00 am and 04:00 pm excluded
the number of requests submitted between 04:00 am and 00:00 am excluded
the number of days the user submitted a request

The first three features have similar values but the 4th does not. Therefore, I had to standardize all the features before training.

Tomorrow, I will insert the anomaly detection code and upload the code to GitHub.

Week 8 – Day 3

Today was better than yesterday 🙂

Thanks to my mentor, I found a solution the problem I had yesterday. She gave me this link https://hpccsystems.com/training/documentation/ecl-language-reference/html/DATASET_from_TRANSFORM.html on which I found a special way to use the DATASET functionality. It does everything I needed to implement my method to set my initial centroids.

Now, I almost fully set to perform my first trainings. 🙂

Week 8 – Day 2

Today I just got stuck on a method.

I am having trouble to write a code to automate the process of setting up the initial centroids IDs for K-Means. The input would be the number of centroids n and the output should be a set of ids evenly separating the dataset.

Hence, I would like to get SET = [1, (size/n), (size/n)*2, (size/n)*3, … , (size/n)*(n-1)] where size is the dataset size.

As example, If n = 3 and the dataset size is 20, the output should be [1, 6, 12].

I spent half of the day on that without success. 🤕

Hope the night would help me find the solution 🤞

Week 8 – Day 1

Welcome to week 8!

Today, I did two things:

I cleaned and divided my data set into training and test set. My cleaned dataset have the following data structure:

EXPORT layout := RECORD
UNSIGNED4 recID;
STRING11 date;
STRING8 time;
UNSIGNED2 timezone;
STRING53 ip;
STRING333 msg;
UNSIGNED2 replyCode;
INTEGER4 bytesInReply;
END;

Secondly, I started writing some reusable code. Indeed, for both my training and testing data, I will use the same code to extract the features which will be fed to ML algorithms.

I should be ready to train and test my algorithm tomorrow.

Take care 🙂

Week 7 – Day 5

Last day of the week 7!

Week 7 has been a Machine Learning Week or a K-Means week. I have been mostly tuning the clustering algorithm. Today, I realized that I forgot to put aside a test set for testing my clusters. That is not a big deal.

Moreover, with the precious help of my mentor Xu Lili, I fixed a mistake in my code related to K-Means centroids. When I was running the algorithm, my final centroids’ positions were in 2D (x, y) while the dataset was in 4D. That was due to a small mistake I made while setting up the model. Anyway, I now have a full and clear line to follow to setup my anomaly detection system.

If everything goes well, I should be able to detect abnormal users by next week 🙂 Moreover, I have to be writing a paper related to my project for the 4th International Conference on Computational Systems and Information Technology for Sustainable Solutions [CSITSS – 2019] [http://csitss.rvce.edu.in/csitss2019/].

Yeah, many things to do and that is what we live for “to make good changes through positive actions”. Wish a good weekend 😉

Week 7 – Day 4

For now, I have given up on Kmeans++. I have used the elbow method to set my number of clusters to 2.

So, I have run the K-Means algorithm using 2 centroids. Next, I have to get the meaning out of the output and see whether it makes sense or not. If not, I may have to add more attributes to K-Means.

Week 7 – Day 3

This one of those head-aching day 😩

I read an infinite number of resources discussing K-Means++ and more precisely its centroids’ initialization approach.

I ended up with this algorithm from https://medium.com/machine-learning-algorithms-from-scratch:

Steps involved in Kmeans++ initialization are:

Randomly select the first cluster center from the data points and append it to the centroid matrix.
Loop over the number of Centroids that need to be chosen (K):
For each data point calculate the euclidian distance square from already chosen centroids and append the minimum distance to a Distance array.
Calculate the probabilities of choosing the particular data point as the next centroid by dividing the Distance array elements with the sum of Distance array. Let’s call this probability distribution as PD.
Calculate the cumulative probability distribution from this PD distribution. We knew that the cumulative probability distribution ranges from 0 to 1.
Select a random number between 0 to 1, get the index (i) of the cumulative probability distribution which is just greater than the chosen random number and assign the data point corresponding to the selected index (i).
Repeat the process until we have K number of cluster centers.

However, I am a bit stuck at the implementation level because I cannot generate a random number between 0 and 1 using ECL. I may have to find a way around 🤔