{"id":29923,"date":"2022-07-27T06:00:00","date_gmt":"2022-07-27T13:00:00","guid":{"rendered":"https:\/\/insidebigdata.com\/?p=29923"},"modified":"2023-06-23T12:40:35","modified_gmt":"2023-06-23T19:40:35","slug":"research-highlights-why-do-tree-based-models-still-outperform-deep-learning-on-tabular-data","status":"publish","type":"post","link":"https:\/\/insidebigdata.com\/2022\/07\/27\/research-highlights-why-do-tree-based-models-still-outperform-deep-learning-on-tabular-data\/","title":{"rendered":"Research Highlights: Why Do Tree-based Models Still Outperform Deep Learning on Tabular Data?"},"content":{"rendered":"\n<p><strong>Title of paper:<\/strong> <a href=\"https:\/\/arxiv.org\/pdf\/2207.08815.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Why do tree-based models still outperform deep learning on tabular data?<\/a><\/p>\n\n\n\n<p><strong>Author(s):<\/strong> L\u00e9o Grinsztajn, et al<\/p>\n\n\n\n<p><strong>Abstract: <\/strong>While deep learning has enabled tremendous progress on text and image datasets, its superiority on tabular data is not clear. We contribute extensive benchmarks of standard and novel deep learning methods as well as tree-based models such as XGBoost and Random Forests, across a large number of datasets and hyperparameter combinations. We define a standard set of 45 datasets from varied domains with clear characteristics of tabular data and a benchmarking methodology accounting for both fitting models and finding good hyperparameters. Results show that tree-based models remain state-of-the-art on medium-sized data (\u223c10K samples) even without accounting for their superior speed. To understand this gap, we conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs). This leads to a series of challenges which should guide researchers aiming to build tabular-specific NNs: 1. be robust to uninformative features, 2. preserve the orientation of the data, and 3. be able to easily learn irregular functions. To stimulate research on tabular architectures, we contribute a standard benchmark and raw data for baselines: every point of a 20 000 compute hours hyperparameter search for each learner.<\/p>\n\n\n<div class=\"wp-block-image is-style-default\">\n<figure class=\"aligncenter size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"691\" height=\"399\" src=\"https:\/\/insidebigdata.com\/wp-content\/uploads\/2022\/07\/Research_highlights_4.png\" alt=\"\" class=\"wp-image-29924\" srcset=\"https:\/\/insidebigdata.com\/wp-content\/uploads\/2022\/07\/Research_highlights_4.png 691w, https:\/\/insidebigdata.com\/wp-content\/uploads\/2022\/07\/Research_highlights_4-300x173.png 300w, https:\/\/insidebigdata.com\/wp-content\/uploads\/2022\/07\/Research_highlights_4-150x87.png 150w\" sizes=\"(max-width: 691px) 100vw, 691px\" \/><\/figure><\/div>\n\n\n<p><em>Sign up for the free insideBIGDATA&nbsp;<a rel=\"noreferrer noopener\" href=\"http:\/\/insidebigdata.com\/newsletter\/\" target=\"_blank\">newsletter<\/a>.<\/em><\/p>\n\n\n\n<p><em>Join us on Twitter:&nbsp;@InsideBigData1 \u2013 <a href=\"https:\/\/twitter.com\/InsideBigData1\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/twitter.com\/InsideBigData1<\/a><\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this regular column we take a look at highlights for breaking research topics of the day in the areas of big data, data science, machine learning, AI and deep learning. For data scientists, it\u2019s important to keep connected with the research arm of the field in order to understand where the technology is headed. Enjoy!<\/p>\n","protected":false},"author":37,"featured_media":22835,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"footnotes":""},"categories":[184,526,115,182,87,180,67,56,84,1303,1],"tags":[133,264,277,1126,1166,96,857],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.6 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Research Highlights: Why Do Tree-based Models Still Outperform Deep Learning on Tabular Data? - insideBIGDATA<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/insidebigdata.com\/2022\/07\/27\/research-highlights-why-do-tree-based-models-still-outperform-deep-learning-on-tabular-data\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Research Highlights: Why Do Tree-based Models Still Outperform Deep Learning on Tabular Data? - insideBIGDATA\" \/>\n<meta property=\"og:description\" content=\"In this regular column we take a look at highlights for breaking research topics of the day in the areas of big data, data science, machine learning, AI and deep learning. For data scientists, it\u2019s important to keep connected with the research arm of the field in order to understand where the technology is headed. Enjoy!\" \/>\n<meta property=\"og:url\" content=\"https:\/\/insidebigdata.com\/2022\/07\/27\/research-highlights-why-do-tree-based-models-still-outperform-deep-learning-on-tabular-data\/\" \/>\n<meta property=\"og:site_name\" content=\"insideBIGDATA\" \/>\n<meta property=\"article:publisher\" content=\"http:\/\/www.facebook.com\/insidebigdata\" \/>\n<meta property=\"article:published_time\" content=\"2022-07-27T13:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-06-23T19:40:35+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/insidebigdata.com\/wp-content\/uploads\/2019\/06\/Data-Scientist-shutterstock_768047488.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"300\" \/>\n\t<meta property=\"og:image:height\" content=\"200\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Daniel Gutierrez\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@AMULETAnalytics\" \/>\n<meta name=\"twitter:site\" content=\"@insideBigData\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Daniel Gutierrez\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/insidebigdata.com\/2022\/07\/27\/research-highlights-why-do-tree-based-models-still-outperform-deep-learning-on-tabular-data\/\",\"url\":\"https:\/\/insidebigdata.com\/2022\/07\/27\/research-highlights-why-do-tree-based-models-still-outperform-deep-learning-on-tabular-data\/\",\"name\":\"Research Highlights: Why Do Tree-based Models Still Outperform Deep Learning on Tabular Data? - insideBIGDATA\",\"isPartOf\":{\"@id\":\"https:\/\/insidebigdata.com\/#website\"},\"datePublished\":\"2022-07-27T13:00:00+00:00\",\"dateModified\":\"2023-06-23T19:40:35+00:00\",\"author\":{\"@id\":\"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed\"},\"breadcrumb\":{\"@id\":\"https:\/\/insidebigdata.com\/2022\/07\/27\/research-highlights-why-do-tree-based-models-still-outperform-deep-learning-on-tabular-data\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/insidebigdata.com\/2022\/07\/27\/research-highlights-why-do-tree-based-models-still-outperform-deep-learning-on-tabular-data\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/insidebigdata.com\/2022\/07\/27\/research-highlights-why-do-tree-based-models-still-outperform-deep-learning-on-tabular-data\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/insidebigdata.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Research Highlights: Why Do Tree-based Models Still Outperform Deep Learning on Tabular Data?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/insidebigdata.com\/#website\",\"url\":\"https:\/\/insidebigdata.com\/\",\"name\":\"insideBIGDATA\",\"description\":\"Your Source for AI, Data Science, Deep Learning &amp; Machine Learning Strategies\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/insidebigdata.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed\",\"name\":\"Daniel Gutierrez\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/insidebigdata.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g\",\"caption\":\"Daniel Gutierrez\"},\"description\":\"Daniel D. Gutierrez is a Data Scientist with Los Angeles-based AMULET Analytics, a service division of AMULET Development Corp. He's been involved with data science and Big Data long before it came in vogue, so imagine his delight when the Harvard Business Review recently deemed \\\"data scientist\\\" as the sexiest profession for the 21st century. Previously, he taught computer science and database classes at UCLA Extension for over 15 years, and authored three computer industry books on database technology. He also served as technical editor, columnist and writer at a major computer industry monthly publication for 7 years. Follow his data science musings at @AMULETAnalytics.\",\"sameAs\":[\"http:\/\/www.insidebigdata.com\",\"https:\/\/twitter.com\/@AMULETAnalytics\"],\"url\":\"https:\/\/insidebigdata.com\/author\/dangutierrez\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Research Highlights: Why Do Tree-based Models Still Outperform Deep Learning on Tabular Data? - insideBIGDATA","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/insidebigdata.com\/2022\/07\/27\/research-highlights-why-do-tree-based-models-still-outperform-deep-learning-on-tabular-data\/","og_locale":"en_US","og_type":"article","og_title":"Research Highlights: Why Do Tree-based Models Still Outperform Deep Learning on Tabular Data? - insideBIGDATA","og_description":"In this regular column we take a look at highlights for breaking research topics of the day in the areas of big data, data science, machine learning, AI and deep learning. For data scientists, it\u2019s important to keep connected with the research arm of the field in order to understand where the technology is headed. Enjoy!","og_url":"https:\/\/insidebigdata.com\/2022\/07\/27\/research-highlights-why-do-tree-based-models-still-outperform-deep-learning-on-tabular-data\/","og_site_name":"insideBIGDATA","article_publisher":"http:\/\/www.facebook.com\/insidebigdata","article_published_time":"2022-07-27T13:00:00+00:00","article_modified_time":"2023-06-23T19:40:35+00:00","og_image":[{"width":300,"height":200,"url":"https:\/\/insidebigdata.com\/wp-content\/uploads\/2019\/06\/Data-Scientist-shutterstock_768047488.jpg","type":"image\/jpeg"}],"author":"Daniel Gutierrez","twitter_card":"summary_large_image","twitter_creator":"@AMULETAnalytics","twitter_site":"@insideBigData","twitter_misc":{"Written by":"Daniel Gutierrez","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/insidebigdata.com\/2022\/07\/27\/research-highlights-why-do-tree-based-models-still-outperform-deep-learning-on-tabular-data\/","url":"https:\/\/insidebigdata.com\/2022\/07\/27\/research-highlights-why-do-tree-based-models-still-outperform-deep-learning-on-tabular-data\/","name":"Research Highlights: Why Do Tree-based Models Still Outperform Deep Learning on Tabular Data? - insideBIGDATA","isPartOf":{"@id":"https:\/\/insidebigdata.com\/#website"},"datePublished":"2022-07-27T13:00:00+00:00","dateModified":"2023-06-23T19:40:35+00:00","author":{"@id":"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed"},"breadcrumb":{"@id":"https:\/\/insidebigdata.com\/2022\/07\/27\/research-highlights-why-do-tree-based-models-still-outperform-deep-learning-on-tabular-data\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/insidebigdata.com\/2022\/07\/27\/research-highlights-why-do-tree-based-models-still-outperform-deep-learning-on-tabular-data\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/insidebigdata.com\/2022\/07\/27\/research-highlights-why-do-tree-based-models-still-outperform-deep-learning-on-tabular-data\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/insidebigdata.com\/"},{"@type":"ListItem","position":2,"name":"Research Highlights: Why Do Tree-based Models Still Outperform Deep Learning on Tabular Data?"}]},{"@type":"WebSite","@id":"https:\/\/insidebigdata.com\/#website","url":"https:\/\/insidebigdata.com\/","name":"insideBIGDATA","description":"Your Source for AI, Data Science, Deep Learning &amp; Machine Learning Strategies","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/insidebigdata.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed","name":"Daniel Gutierrez","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/insidebigdata.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g","caption":"Daniel Gutierrez"},"description":"Daniel D. Gutierrez is a Data Scientist with Los Angeles-based AMULET Analytics, a service division of AMULET Development Corp. He's been involved with data science and Big Data long before it came in vogue, so imagine his delight when the Harvard Business Review recently deemed \"data scientist\" as the sexiest profession for the 21st century. Previously, he taught computer science and database classes at UCLA Extension for over 15 years, and authored three computer industry books on database technology. He also served as technical editor, columnist and writer at a major computer industry monthly publication for 7 years. Follow his data science musings at @AMULETAnalytics.","sameAs":["http:\/\/www.insidebigdata.com","https:\/\/twitter.com\/@AMULETAnalytics"],"url":"https:\/\/insidebigdata.com\/author\/dangutierrez\/"}]}},"jetpack_featured_media_url":"https:\/\/insidebigdata.com\/wp-content\/uploads\/2019\/06\/Data-Scientist-shutterstock_768047488.jpg","jetpack_shortlink":"https:\/\/wp.me\/p9eA3j-7MD","jetpack-related-posts":[{"id":28903,"url":"https:\/\/insidebigdata.com\/2022\/04\/09\/research-highlights-deep-neural-networks-and-tabular-data-a-survey\/","url_meta":{"origin":29923,"position":0},"title":"Research Highlights: Deep Neural Networks and Tabular Data: A Survey","date":"April 9, 2022","format":false,"excerpt":"In this regular column, we take a look at highlights for important research topics of the day for big data, data science, machine learning, AI and deep learning. It\u2019s important to keep connected with the research arm of the field in order to see where we\u2019re headed. In this edition,\u2026","rel":"","context":"In &quot;AI Deep Learning&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2022\/04\/Research_highlights_2.png?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":26685,"url":"https:\/\/insidebigdata.com\/2021\/07\/19\/best-of-arxiv-org-for-ai-machine-learning-and-deep-learning-june-2021\/","url_meta":{"origin":29923,"position":1},"title":"Best of arXiv.org for AI, Machine Learning, and Deep Learning \u2013 June 2021","date":"July 19, 2021","format":false,"excerpt":"In this recurring monthly feature, we will filter all the recent research papers appearing in the arXiv.org preprint server for subjects relating to AI, machine learning and deep learning \u2013 from disciplines including statistics, mathematics and computer science \u2013 and provide you with a useful \u201cbest of\u201d list for the\u2026","rel":"","context":"In &quot;AI Deep Learning&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2013\/12\/arxiv.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":30739,"url":"https:\/\/insidebigdata.com\/2023\/02\/23\/book-review-tree-based-methods-for-statistical-learning-in-r\/","url_meta":{"origin":29923,"position":2},"title":"Book Review: Tree-based Methods for Statistical Learning in R","date":"February 23, 2023","format":false,"excerpt":"Here's a new title that is a \"must have\" for any data scientist who uses the R language. It's a wonderful learning resource for tree-based techniques in statistical learning, one that's become my go-to text when I find the need to do a deep dive into various ML topic areas\u2026","rel":"","context":"In &quot;Big Data&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2022\/10\/Tree-based-Methods-book.png?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":26235,"url":"https:\/\/insidebigdata.com\/2021\/05\/17\/best-of-arxiv-org-for-ai-machine-learning-and-deep-learning-april-2021\/","url_meta":{"origin":29923,"position":3},"title":"Best of arXiv.org for AI, Machine Learning, and Deep Learning \u2013 April 2021","date":"May 17, 2021","format":false,"excerpt":"In this recurring monthly feature, we will filter all the recent research papers appearing in the arXiv.org preprint server for subjects relating to AI, machine learning and deep learning \u2013 from disciplines including statistics, mathematics and computer science \u2013 and provide you with a useful \u201cbest of\u201d list for the\u2026","rel":"","context":"In &quot;AI Deep Learning&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2013\/12\/arxiv.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":25642,"url":"https:\/\/insidebigdata.com\/2021\/02\/18\/best-of-arxiv-org-for-ai-machine-learning-and-deep-learning-january-2021\/","url_meta":{"origin":29923,"position":4},"title":"Best of arXiv.org for AI, Machine Learning, and Deep Learning \u2013 January 2021","date":"February 18, 2021","format":false,"excerpt":"In this recurring monthly feature, we will filter all the recent research papers appearing in the arXiv.org preprint server for subjects relating to AI, machine learning and deep learning \u2013 from disciplines including statistics, mathematics and computer science \u2013 and provide you with a useful \u201cbest of\u201d list for the\u2026","rel":"","context":"In &quot;AI Deep Learning&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2013\/12\/arxiv.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":26882,"url":"https:\/\/insidebigdata.com\/2021\/08\/16\/best-of-arxiv-org-for-ai-machine-learning-and-deep-learning-july-2021\/","url_meta":{"origin":29923,"position":5},"title":"Best of arXiv.org for AI, Machine Learning, and Deep Learning \u2013 July 2021","date":"August 16, 2021","format":false,"excerpt":"In this recurring monthly feature, we will filter all the recent research papers appearing in the arXiv.org preprint server for subjects relating to AI, machine learning and deep learning \u2013 from disciplines including statistics, mathematics and computer science \u2013 and provide you with a useful \u201cbest of\u201d list for the\u2026","rel":"","context":"In &quot;AI Deep Learning&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2013\/12\/arxiv.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]}],"_links":{"self":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/posts\/29923"}],"collection":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/users\/37"}],"replies":[{"embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/comments?post=29923"}],"version-history":[{"count":0,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/posts\/29923\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/media\/22835"}],"wp:attachment":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/media?parent=29923"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/categories?post=29923"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/tags?post=29923"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}