{"id":12428,"date":"2014-11-26T06:00:52","date_gmt":"2014-11-26T14:00:52","guid":{"rendered":"http:\/\/insidebigdata.com\/?p=12428"},"modified":"2022-09-09T10:55:39","modified_gmt":"2022-09-09T17:55:39","slug":"ask-data-scientist-data-leakage","status":"publish","type":"post","link":"https:\/\/insidebigdata.com\/2014\/11\/26\/ask-data-scientist-data-leakage\/","title":{"rendered":"Ask a Data Scientist: Data Leakage"},"content":{"rendered":"<p>Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d Once a week you\u2019ll see reader submitted questions of varying levels of technical detail answered by a practicing data scientist \u2013 sometimes by me and other times by an Intel data scientist. Think of this new insideBIGDATA feature as a valuable resource for you to get up to speed in this flourishing area of technology. If you have a data science question you\u2019d like answered, please just enter a comment below, or send an e-mail to me at: daniel@insidehpc.com. This week\u2019s question is from a reader who asks for an explanation of data leakage.<\/p>\n<p><strong>Q: What is data leakage?<\/strong><\/p>\n<p><strong>A: <\/strong>As a data scientist, you should always be aware of circumstances that may cause your machine learning algorithms to over-represent their generalization error as this may render them useless in the solution of real-world problems. One such potential problem is called <em>data leakage<\/em> \u2013 when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict. It is undesirable on many levels such as a source of poor generalization and over-estimation of expected performance. Data leakage often occurs subtly and inadvertently and may result in overfitting. A leading text in the field called data leakage as one of the top ten machine learning mistakes.[1]<\/p>\n<p>Data leakage can manifest in many ways including:<\/p>\n<ul>\n<li>Leaking data from the test set into the training set.<\/li>\n<li>Leaking the correct prediction or ground truth into the test data.<\/li>\n<li>Leaking of information from the future into the past.<\/li>\n<li>Reversing obfuscation, randomization or anonymization of data that were intentionally included.<\/li>\n<li>Information from data samples outside of scope of the algorithm&#8217;s intended use.<\/li>\n<li>Any of the above existing in external data coupled with the training set.<\/li>\n<\/ul>\n<p>In general, data leakage comes from two sources in a machine learning algorithm \u2013 the feature variables, and the training set. A trivial example of data leakage would be a model that uses the response variable itself as a predictor, thus concluding for example that \u201cit is sunny on sunny days.\u201d<\/p>\n<p>As a more concrete example, consider the use of a \u201ccustomer service rep name\u201d feature variable in a SaaS company churn prediction algorithm. Using the name of the rep who interviewed a customer when they churned might seem innocent enough until you find out that a specific rep was assigned to take over customer accounts where customers had already indicated they intended to churn. In this case, the resulting algorithm would be highly predictive of whether the customer had churned but would be useless for making predictions on new customers. This is an extreme example &#8211; many more instances of data leakage occur in subtle and hard-to-detect ways. There are war stories of algorithms with data leakage running in production systems for years before the bugs in the data creation or training scripts were detected.<\/p>\n<p>Identifying data leakage beforehand and correcting for it is an important part of improving the definition of a machine learning problem. Many forms of leakage are subtle and are best detected by trying to extract features and train state-of-the-art algorithms on the problem. Here are several strategies to find and eliminate data leakage:<\/p>\n<ul>\n<li>Exploratory data analysis (EDA) can be a powerful tool for identifying data leakage. EDA allows you to become more intimate with the raw data by examining it through statistical and visualization tools. This kind of examination can reveal data leakage as patterns in the data that are surprising.<\/li>\n<li>If the performance of your algorithm is too good to be true, data leakage may be the reason. You need to weigh prior or competing documented results with a certain level of performance for the problem at hand. A substantial divergence from this expected performance merits testing the algorithm more closely to establish legitimacy.<\/li>\n<li>Perform early in-the-field testing of algorithms. Any significant data leakage would be reflected as a difference between estimated and realized out-of-sample performance. This is perhaps the best approach in identifying data leakage, but it is also the most expensive to implement. It can also be challenging to isolate the cause of such performance discrepancy as data leakage since the cause actually could be classical over-fitting, sampling bias, etc.<\/li>\n<\/ul>\n<p>Once data leakage has been identified, the next step is to figure out how fix it (or even if you want to try). For some problems, living with data leakage without attempting to fix it could be acceptable. But if you decide to fix the leakage, care must be taken not to make matters worse. Usually, when there is one leaking feature variable, there are others. Removing the obvious leaks that are detected may exacerbate the effect of undetected ones, and engaging in feature modification in an attempt to plug obvious leaks, could create others. The idea is to try to figure out the legitimacy of specific observations and\/or feature variables and work to plug the leak and hopefully seal it completely. Rectifying data leakage is an active field of research that will likely yield effective results in the near future.<\/p>\n<p>[1] Nisbet, R., Elder, J. and Miner, G. 2009. <em>Handbook of Statistical Analysis and Data Mining Applications<\/em>. Academic Press.<\/p>\n<p>If you have a question you\u2019d like answered, please just enter a comment below, or send an e-mail to me at: daniel@insidehpc.com.<\/p>\n<p><strong>Data Scientist:<\/strong> Daniel D. Gutierrez \u2013 Managing Editor, insideBIGDATA<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d Once a week you\u2019ll see reader submitted questions of varying levels of technical detail answered by a practicing data scientist \u2013 sometimes by me and other times by an Intel data scientist. This week\u2019s question is from a reader who asks for an explanation of data leakage.<\/p>\n","protected":false},"author":37,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"footnotes":""},"categories":[239,87,180,210,67,56,1],"tags":[95],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.6 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Ask a Data Scientist: Data Leakage - insideBIGDATA<\/title>\n<meta name=\"description\" content=\"Ask a Data Scientist: Data Leakage\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/insidebigdata.com\/2014\/11\/26\/ask-data-scientist-data-leakage\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Ask a Data Scientist: Data Leakage - insideBIGDATA\" \/>\n<meta property=\"og:description\" content=\"Ask a Data Scientist: Data Leakage\" \/>\n<meta property=\"og:url\" content=\"https:\/\/insidebigdata.com\/2014\/11\/26\/ask-data-scientist-data-leakage\/\" \/>\n<meta property=\"og:site_name\" content=\"insideBIGDATA\" \/>\n<meta property=\"article:publisher\" content=\"http:\/\/www.facebook.com\/insidebigdata\" \/>\n<meta property=\"article:published_time\" content=\"2014-11-26T14:00:52+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-09-09T17:55:39+00:00\" \/>\n<meta name=\"author\" content=\"Daniel Gutierrez\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@AMULETAnalytics\" \/>\n<meta name=\"twitter:site\" content=\"@insideBigData\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Daniel Gutierrez\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/insidebigdata.com\/2014\/11\/26\/ask-data-scientist-data-leakage\/\",\"url\":\"https:\/\/insidebigdata.com\/2014\/11\/26\/ask-data-scientist-data-leakage\/\",\"name\":\"Ask a Data Scientist: Data Leakage - insideBIGDATA\",\"isPartOf\":{\"@id\":\"https:\/\/insidebigdata.com\/#website\"},\"datePublished\":\"2014-11-26T14:00:52+00:00\",\"dateModified\":\"2022-09-09T17:55:39+00:00\",\"author\":{\"@id\":\"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed\"},\"description\":\"Ask a Data Scientist: Data Leakage\",\"breadcrumb\":{\"@id\":\"https:\/\/insidebigdata.com\/2014\/11\/26\/ask-data-scientist-data-leakage\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/insidebigdata.com\/2014\/11\/26\/ask-data-scientist-data-leakage\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/insidebigdata.com\/2014\/11\/26\/ask-data-scientist-data-leakage\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/insidebigdata.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Ask a Data Scientist: Data Leakage\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/insidebigdata.com\/#website\",\"url\":\"https:\/\/insidebigdata.com\/\",\"name\":\"insideBIGDATA\",\"description\":\"Your Source for AI, Data Science, Deep Learning &amp; Machine Learning Strategies\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/insidebigdata.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed\",\"name\":\"Daniel Gutierrez\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/insidebigdata.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g\",\"caption\":\"Daniel Gutierrez\"},\"description\":\"Daniel D. Gutierrez is a Data Scientist with Los Angeles-based AMULET Analytics, a service division of AMULET Development Corp. He's been involved with data science and Big Data long before it came in vogue, so imagine his delight when the Harvard Business Review recently deemed \\\"data scientist\\\" as the sexiest profession for the 21st century. Previously, he taught computer science and database classes at UCLA Extension for over 15 years, and authored three computer industry books on database technology. He also served as technical editor, columnist and writer at a major computer industry monthly publication for 7 years. Follow his data science musings at @AMULETAnalytics.\",\"sameAs\":[\"http:\/\/www.insidebigdata.com\",\"https:\/\/twitter.com\/@AMULETAnalytics\"],\"url\":\"https:\/\/insidebigdata.com\/author\/dangutierrez\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Ask a Data Scientist: Data Leakage - insideBIGDATA","description":"Ask a Data Scientist: Data Leakage","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/insidebigdata.com\/2014\/11\/26\/ask-data-scientist-data-leakage\/","og_locale":"en_US","og_type":"article","og_title":"Ask a Data Scientist: Data Leakage - insideBIGDATA","og_description":"Ask a Data Scientist: Data Leakage","og_url":"https:\/\/insidebigdata.com\/2014\/11\/26\/ask-data-scientist-data-leakage\/","og_site_name":"insideBIGDATA","article_publisher":"http:\/\/www.facebook.com\/insidebigdata","article_published_time":"2014-11-26T14:00:52+00:00","article_modified_time":"2022-09-09T17:55:39+00:00","author":"Daniel Gutierrez","twitter_card":"summary_large_image","twitter_creator":"@AMULETAnalytics","twitter_site":"@insideBigData","twitter_misc":{"Written by":"Daniel Gutierrez","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/insidebigdata.com\/2014\/11\/26\/ask-data-scientist-data-leakage\/","url":"https:\/\/insidebigdata.com\/2014\/11\/26\/ask-data-scientist-data-leakage\/","name":"Ask a Data Scientist: Data Leakage - insideBIGDATA","isPartOf":{"@id":"https:\/\/insidebigdata.com\/#website"},"datePublished":"2014-11-26T14:00:52+00:00","dateModified":"2022-09-09T17:55:39+00:00","author":{"@id":"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed"},"description":"Ask a Data Scientist: Data Leakage","breadcrumb":{"@id":"https:\/\/insidebigdata.com\/2014\/11\/26\/ask-data-scientist-data-leakage\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/insidebigdata.com\/2014\/11\/26\/ask-data-scientist-data-leakage\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/insidebigdata.com\/2014\/11\/26\/ask-data-scientist-data-leakage\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/insidebigdata.com\/"},{"@type":"ListItem","position":2,"name":"Ask a Data Scientist: Data Leakage"}]},{"@type":"WebSite","@id":"https:\/\/insidebigdata.com\/#website","url":"https:\/\/insidebigdata.com\/","name":"insideBIGDATA","description":"Your Source for AI, Data Science, Deep Learning &amp; Machine Learning Strategies","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/insidebigdata.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed","name":"Daniel Gutierrez","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/insidebigdata.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g","caption":"Daniel Gutierrez"},"description":"Daniel D. Gutierrez is a Data Scientist with Los Angeles-based AMULET Analytics, a service division of AMULET Development Corp. He's been involved with data science and Big Data long before it came in vogue, so imagine his delight when the Harvard Business Review recently deemed \"data scientist\" as the sexiest profession for the 21st century. Previously, he taught computer science and database classes at UCLA Extension for over 15 years, and authored three computer industry books on database technology. He also served as technical editor, columnist and writer at a major computer industry monthly publication for 7 years. Follow his data science musings at @AMULETAnalytics.","sameAs":["http:\/\/www.insidebigdata.com","https:\/\/twitter.com\/@AMULETAnalytics"],"url":"https:\/\/insidebigdata.com\/author\/dangutierrez\/"}]}},"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p9eA3j-3es","jetpack-related-posts":[{"id":20645,"url":"https:\/\/insidebigdata.com\/2018\/06\/30\/insidebigdata-ask-data-scientist-series\/","url_meta":{"origin":12428,"position":0},"title":"insideBIGDATA &#8220;Ask a Data Scientist&#8221; Series","date":"June 30, 2018","format":false,"excerpt":"Welcome to the series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist\u201d from insideBIGDATA's popular Data Science 101 channel. These articles constitute many of our site's most popular resources for newbie data scientists. The 12 articles listed below were from reader submitted questions of varying levels of technical\u2026","rel":"","context":"In &quot;Data Science&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2018\/06\/data-scientist-300x300_insidebigdata.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":12045,"url":"https:\/\/insidebigdata.com\/2014\/10\/15\/ask-data-scientist-curse-dimensionality\/","url_meta":{"origin":12428,"position":1},"title":"Ask a Data Scientist: Curse of Dimensionality","date":"October 15, 2014","format":false,"excerpt":"Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d Once a week you\u2019ll see reader submitted questions of varying levels of technical detail answered by a practicing data scientist \u2013 sometimes by me and other times by an Intel data scientist. This week\u2019s question\u2026","rel":"","context":"In &quot;Ask a Data Scientist&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":12345,"url":"https:\/\/insidebigdata.com\/2014\/11\/12\/ask-data-scientist-data-science-process\/","url_meta":{"origin":12428,"position":2},"title":"Ask a Data Scientist: The Data Science Process","date":"November 12, 2014","format":false,"excerpt":"Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d This week\u2019s question is from a reader who wonders if there is a general process for conducting data science projects.","rel":"","context":"In &quot;Ask a Data Scientist&quot;","img":{"alt_text":"DataScienceProcess","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2014\/11\/DataScienceProcess.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":12352,"url":"https:\/\/insidebigdata.com\/2014\/11\/19\/ask-data-scientist-unsupervised-learning\/","url_meta":{"origin":12428,"position":3},"title":"Ask a Data Scientist: Unsupervised Learning","date":"November 19, 2014","format":false,"excerpt":"Welcome back to the \u201cAsk a Data Scientist\u201d article series. This week\u2019s question is from a reader who asks for an overview of unsupervised machine learning.","rel":"","context":"In &quot;Ask a Data Scientist&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2014\/11\/awwicker.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":12294,"url":"https:\/\/insidebigdata.com\/2014\/10\/29\/ask-data-scientist-handling-missing-data\/","url_meta":{"origin":12428,"position":4},"title":"Ask a Data Scientist: Handling Missing Data","date":"October 29, 2014","format":false,"excerpt":"Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d This week\u2019s question is from a reader who seeks a discussion of missing data handling methods such as imputation.","rel":"","context":"In &quot;Ask a Data Scientist&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":12481,"url":"https:\/\/insidebigdata.com\/2014\/12\/10\/ask-data-scientist-confounding-variables\/","url_meta":{"origin":12428,"position":5},"title":"Ask a Data Scientist: Confounding Variables","date":"December 10, 2014","format":false,"excerpt":"Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d This week\u2019s question is from a reader who asks for an explanation of confounding variables and why they're important in data science projects.","rel":"","context":"In &quot;Ask a Data Scientist&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2018\/06\/data-scientist-300x300_insidebigdata.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]}],"_links":{"self":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/posts\/12428"}],"collection":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/users\/37"}],"replies":[{"embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/comments?post=12428"}],"version-history":[{"count":0,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/posts\/12428\/revisions"}],"wp:attachment":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/media?parent=12428"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/categories?post=12428"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/tags?post=12428"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}