And the Winner Is…: Insights from a Gradient Boosting (GBM) Benchmark
Szilard Pafka, PhD
Chief Scientist, Epoch
With all the hype about deep learning and “AI”, it is not well publicized that for structured/tabular data widely encountered in business applications it is actually another machine learning algorithm, the gradient boosting machine (GBM) that most often achieves the highest accuracy in supervised learning/prediction tasks. In this talk we’ll review some of the main open source GBM implementations such as xgboost, h2o, lightgbm, catboost, Spark MLlib and we’ll discuss some of their main performance characteristics. We’ll go more in-depth vs all my previous talks on the topic, and we’ll discuss in details training speed, memory footprint, scaling to multiple CPU cores, performance degradation on hyperthreaded cores and multi-socket CPUs, performance on latest Intel and AMD CPUs, GPU implementations, GPU utilization patterns etc. and also several 2020 recent updates such as improved multi-core performance in xgboost and speedups in catboost.
Szilard studied Physics in the 90s and obtained a PhD by using statistical methods to analyze the risk of financial portfolios. He worked in finance, then more than a decade ago moved to become the Chief Scientist of a tech company in Santa Monica, California doing “everything data”. He is the founder/organizer of several meetups in the Los Angeles area (R, data science etc) and the data science community website datascience.la. He is the author of a well-known machine learning benchmark on github (1000+ stars), a frequent speaker at conferences (keynote/invited at KDD, R-finance, Crunch, eRum etc.), and he has developed and taught graduate data science courses at two universities (UCLA and CEU in Europe).
We are sorry that this post was not useful for you!
Let us improve this post!
Tell us how we can improve this post?