Week 5 – Day 2

Today was better though I do not yet have my first final result!

The following figure shows my current state (on the left) and my target (on the right), in terms of the implementation for user monitoring. The obtained vectors will be clustered by HPCC ML clustering algorithms.

Current state
Sample Target

Yesterday, I had a hard time reaching my target. During the night, I thought of a way around and went for it this morning. However, I think I kind of took it too easy. I did again the same mistake I did yesterday, meaning I went straight to coding having in mind that the gap (transformations to make) from my current state and my target was not that much. While coding, I found it hard to give names to tables or variables or records or anything because the exact transformations I had to make to reach my target were not clear. I forced myself to code without reaching anywhere.

So, I stopped coding and went on an Excel file. I could not reach my target by coding while thinking. I needed to write down the steps to clearly see what and how to do. That is what I did in the afternoon and came up with the following:

Now, the process is clear. Coding tomorrow should normally be less stressful 🙂

Week 5 – Day 1

I am just having headaches now 🤕

The code for cleaning the data is done. The dates have been fixed (they were recorded by the HPCC Systems with a 4 hours difference from the actual time 🤷‍♂️). I have also successfully extracted my features which are usernames which I will be monitoring in terms of how much request they submitted within a defined time interval. Added to that, I have divided my dataset into time windows as suggested by the column “window id” in the table below.

However, after that, I have been struggling to arrange my features into a data structure which can be handled easily by HPCC-ML algorithms. I did that data preparation before but the difference now is that I am working with the full dataset now. Before, I was using a single log file (meaning one date). Now, the full dataset have records for multiple dates. As I am saying, it seems like a simple grouping by date will solve the problem. But, the ML algorithms and their inputs seem to be complex and easier to run using simple data structures. I tried for example to insert the “GROUP” keyword in the ToField function and got an error. Anyway, I should be able to find a way around.

Hope you kickstarted your July without a headache but with a lot of joy 🎈

See you tomorrow 🙂

Week 4 – Day 5

Final day of week 4!

The week was kind of productive even though I did not met my objectives. I wanted to have a full implementation of the current version of the design document but I couldn’t. I can say I am 40% done with the clean code. I could have done more If I did not do that weird mistake yesterday.

Indeed, yesterday, I almost did nothing because I could not output my dataset. I was getting this error: “Error: System error: 1301: Pool memory exhausted: pool id 4194304 exhausted, requested 3085 heap(1/4294967295) global(1/1216) WM(0..38) (in Disk Read G1 E2) (0, 0), 1301,”. This was because I, for a reason I do not know yet, changed the file type of my dataset from CSV to THOR. It started working fine again when I switched back to CSV.

If nothing like that happens again next week, I should have version 0.1 of my project ready and uploaded next week.

Have a wonderful weekend 🙂

Week 4 – Day 4

It seems like having a beautiful code which works perfectly on well-structured data is harder than I thought 🤔

Yesterday, I had my modules done (I mean I thought they were done!) except for generating vectors. I did found a way around vectorization but somehow my code is not ready yet. Whyyyy?

Well, I realized I skipped some important steps just to have something running. I would even say one important step: !!! checking my log files !!! I thought, since they were programmatically generated, they were clean. Hence, I skipped the profiling part. But today, since I wanted to upload the code the Github with all the steps, I added a profile module and tested it on some log files. Surprise!!!

I found there were lines (usually by pair) which were supposed to be a single one. I found lines full of “null” objects, quotation marks alone on their lines etc. That was already enough for me to understand that I lied to myself when I said I was done with cleaning. Anyway, I did remove all that noise.

There is one more issue which I will have to remove tomorrow before rerunning the process cleanly this time. Indeed, I realized that, for some reason, the times in the log files had a difference of 4 hours with my actual time. So, for example, when I worked past 8:00PM, the HPCC Systems recorded the logs in the log file of the next day. Hence, I have to bring everything back to the right time and date.

See you tomorrow, with the clean code 🙂

Week 4 – Day 3

Today was mmmmmmmmmmm ☹

My plan was to finish coding the current version of the design. Unfortunately, I couldn’t due to some coding difficulties. However, I did almost every steps: cleaning, feature extraction (user info), windowing but got stuck a bit on vectorization.

I wanted to implement a reusable vectorization module. I wanted the number of vectors to be a parameter, a number which can be set without having to update the code logic. That was hard to achieve. I spent almost two hours on that without success. For now, the number of vectors is preset. If I want more or less vectors, I will have to add or remove lines of code 😔

With that step done, even though not as I wanted, I will go ahead tomorrow with the ML algorithms.

See you tomorrow 🙂

Week 4 – Day 2

I spent most of this day “thinking” 🤔

Yesterday, I quickly wrote some code to go through the process of extracting my features (users’ info) and generating vectors for the ML algorithms and even tested AggloN (Agglomerative Hierarchical Clustering) on my vectors. For this purpose I used a single log file.

Today, I had to think about the whole process of training and testing. I selected a dataset: 16 esp log files which I generated while taking the Online ECL classes. They are equivalent to 16 days of records. I will get 4 more files for testing purposes.

Apart from selecting my training and testing dataset, I started structuring the code to handle those files. For now, I have the code which get the raw data and clean it. Tomorrow, I will go ahead with all the other steps: feature extraction, time windowing, grouping, counting, vector generation and training.

See you tomorrow 🙂

Week 4 – Day 1

Today was OK!

My objectives for this week are:

Implement the current version of the design document
Improve the design document by adding approaches to find root causes of detected anomalies

By the way, here are some images of my target anomalies:

On the left, we have the average elapsed time per query and on the right, we have the number queries by user. The system I am implement is trying the detect whenever any of those two parameters do not behave as usual, meaning a query not running as usual or a user having an abnormal number of submitted queries.

Today, I was able to write the code to generate the vectors which would be fed to the ML algorithms.

Tomorrow, I would mostly be testing and refining the code. I harcoded many parameters which should have instead being passed to a function. That was just to have something testable. Therefore, after testing, I would make sure to have a well-structured modular code.

Hopefully, I would get some good results.

See you tomorrow 😉

Week 3 – Day 5

Last working day of the week!

I got the log parsing code clean and ready to move forward. I started writing a bit of code to get my data prepared for ML algorithms but I am not done yet.

I can say the week was productive. I am very grateful to my supervisors because they helped me provide a direction to my project. Before this week, I was fully blind in terms of the final output of the project. Now, even though it is not 100% clear, I do have a kind of direction as I defined my target anomalies.

Next week, I will have to move further with implementing the current design document. Moreover, although I have defined my target anomalies and found an approach to detect them, I also need to update the design doc to contain an approach to find the root cause of the detected anomalies.

And yeah, that was the third week!

Wish you a wonderful weekend 🙂

Week 3 – Day 4

Today was cool 😎

In the morning, my supervisor replied about the updated design I sent her. The updated version contains the target anomalies (anomalous number of requests emitted by a user, queries taking abnormally long time to process) and the approach to detect them. The reply was positive ✔, meaning I am heading to the right direction. She allowed me to start coding with the current design as baseline.

However, while implementing the design, I have to update it. Indeed, apart from detecting the target anomalies, I have to find ways to also help engineers determine the root cause of the found anomalies.

So, I did start the implementation today with the parsing step which has changed with the new design. I had to play a bit with regular expressions to extract my features from the log messages. I did some test which looked like working.

Tomorrow, I will clean the code (rename variables, remove unnecessary lines, variables etc.) to have the parsing phase done. It should however be noted that I was a bit doubting about my parsing approach. I was wondering if I should design an automated parsing function which was the plan before the new design. I sent an email to my supervisor about that. If I need to, then I have to change my current approach which is kind of manual.

See you tomorrow and stay blessed 🙂

Week 3 – Day 3

I cannot really say something about today.

I did made some progress but I do not know yet if I made the progress towards the right direction. As objective, I decided to target the following anomalies:

Anomalous number of requests emitted by a user
Queries taking abnormally long time to process

Hence, the anomaly detection system I am implementing should help HPCC Systems engineers find strange behaviors both in terms of requests emitted by a user and also queries’ running time, without having the read the log file line by line. For this purpose, I defined an approach based on some queries I tested on WsECL. I wrote down the approach in a file which I sent to my mentors.

I really hope to be heading towards the right direction.

See you tomorrow for their reply 🤞