Sonali Syngal

Machine Learning Specialist at Mastercard

Sonali Syngal is a Applied Scientist working in the fintech domain at Mastercard. Her diverse background in Statistics, Machine Learning and Economics allows her to think of problems in a multi-disciplinary way. Having studied from top colleges around the world, including Shri Ram College of Commerce, DU and Imperial College London, she has an exceptional academic background. Previously, she has worked as a ML Researcher in Reinforcement Learning at Vodafone, London. With her work, she aims to add value in business and industry through the mathematics of Machine Learning and data manipulation techniques that may not have been tried before for the problem.

WATCH LIVE: December 8 @ 4:00PM – 4:20PM ET

Server Failure Detection using Deep Learning: Moving from Research Datasets to Real-World Industry Server Data

As industrial systems continue to grow in terms of scale and complexity, having an effective as well as proactive failure management approach helps mitigate the impact of server failure. While supervised methods fail to perform well in real-world servers due to label noise in log data as well as their failure to detect unseen failures, unsupervised techniques are often too simplistic in differentiating between complex log structures. Additionally, current log pre-processing techniques only account for textual similarities between logs in clustering them. We propose a semi-supervised solution that learns the complex understanding of healthy and failure log patterns using an ensemble of deep learning based density and sequential solutions, along with statistical distribution modelling. We also propose a solution to cluster logs taking into account both contextual as well as token similarities between them. Experimental evaluations on real world log data show that our proposed solution outperforms other existing log-based anomaly detection methods for real world application. The solution was implemented for 3000 servers for 6 months of log data, and was able to pick up server failures upto 2 weeks in advance without raising an excess of false alarms.