First day of the second part of the internship!
I had a meeting this morning with my supervisors to make a kind of wrap up about the first part and also discuss about the project’s current state and see what to do next. For the first part, I can say I have been mostly learning: learning ECL, learning about the HPCC logs, designing an approach to tackle the project. I also started coding.
However, one of my main issues, which was raised again today was the availability of a large dataset. For this project, I have to work with a large HPCC log file. Unfortunately, the logs I am generating myself are not enough and the company, certainly due to security issues, cannot provide me with log files. Hence, last week, I found a similar log file which I am currently using for testing.
Just as a reminder, one of my target anomalies is to find suspicious users. Hence, I am working on clustering or grouping users to find outliers. I am going with KMeans which is available in the HPCC Systems ML-Production bundles. Tomorrow, I will implement the elbow method to find an optimal number of clusters and analyze the resulting clusters to adjust the parameters.
See you tomorrow 🙂