{"id":12294,"date":"2014-10-29T12:56:55","date_gmt":"2014-10-29T19:56:55","guid":{"rendered":"http:\/\/insidebigdata.com\/?p=12294"},"modified":"2018-05-22T14:45:21","modified_gmt":"2018-05-22T21:45:21","slug":"ask-data-scientist-handling-missing-data","status":"publish","type":"post","link":"https:\/\/insidebigdata.com\/2014\/10\/29\/ask-data-scientist-handling-missing-data\/","title":{"rendered":"Ask a Data Scientist: Handling Missing Data"},"content":{"rendered":"<p><img decoding=\"async\" loading=\"lazy\" class=\"alignright size-full wp-image-8806\" src=\"https:\/\/insidebigdata.com\/wp-content\/uploads\/2014\/04\/datascientist2_featured.jpg\" alt=\"\" width=\"195\" height=\"166\" \/>Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d Once a week you\u2019ll see reader submitted questions of varying levels of technical detail answered by a practicing data scientist \u2013 sometimes by me and other times by an Intel data scientist. Think of this new insideBIGDATA feature as a valuable resource for you to get up to speed in this flourishing area of technology. If you have a big data question you\u2019d like answered, please just enter a comment below, or send an e-mail to me at: <a href=\"mailto:daniel@insidebigdata.com\" target=\"_blank\" rel=\"noopener\">daniel@insidebigdata.com<\/a>. This week\u2019s question is from a reader who seeks a discussion of missing data handling methods such as imputation.<\/p>\n<p><strong>Q:<\/strong> <strong>How do you handle missing data? What imputation techniques do you recommend?<\/strong><b><br \/>\n<\/b><\/p>\n<p><strong>A: <\/strong>Handling missing data is an important part of the data munging process that is integral to all data science projects. Incomplete observations can adversely affect the operation of machine learning algorithms so the data scientist must have procedures in place to properly manage this situation. Data imputation is one such procedure \u2013 it is the process of filling in missing values based on other data.<\/p>\n<p>Common imputation methods of dealing with unknown or missing values include:<\/p>\n<ol start=\"1\">\n<li>Removing entire observations containing one or more unknown values<\/li>\n<li>Filling in unknown values with the most frequent values<\/li>\n<li>Filling in unknown values by exploring correlations<\/li>\n<li>Filling in unknown values by exploring similarities between cases<\/li>\n<\/ol>\n<p>How can we systematically fill in missing values? Some machine learning algorithm implementations automatically remove observations containing missing values (which may introduce bias or affect the representativeness of the results), but in many cases you have to impute the data manually before running the function. As outlined above, there are several options here, for example you might want to delete all incomplete observations even though this will decrease the power of the analysis by decreasing the effective sample size.<\/p>\n<p>The simplest imputation method is based on treating every variable individually, ignoring any interrelationships with other variables. One option involves replacing any missing value with the mean or median of that variable for all other observations, which has the benefit of not changing the sample mean for that variable. As an example, consider the following sequence of values:<\/p>\n<p>1,2,3,1,3,1,,,,3<\/p>\n<p>The sequence has three missing values all of which could be replaced by 2, the median of the non-missing values.<\/p>\n<p>Mean imputation, however, diminishes any correlations involving the variable(s) that are imputed. This is because, in observations with imputation, there is guaranteed to be no relationship between the imputed variable and any other measured variables. Thus, mean imputation has some attractive properties for univariate analysis but becomes problematic for multivariate analysis.<\/p>\n<p>In machine learning, it is sometimes possible to train a classifier directly over the original data without imputing it first. This has been shown to yield better performance in cases where the missing data is structurally absent, rather than missing due to measurement noise.<\/p>\n<p>There are many more advanced imputation methods design to address the problem of missing data. These methods exploit interrelationships between variables and impute multiple values rather than a single value. In general, missing data and their treatment is an important data quality issue for data scientists. A suitable solution depends on the computational resource especially for big data scale data sets, as well as tolerance to errors in approximating missing values.<\/p>\n<p>If you have a question you\u2019d like answered, please just enter a comment below, or send an e-mail to me at:\u00a0<a href=\"mailto:daniel@insidebigdata.com\" target=\"_blank\" rel=\"noopener\">daniel@insidebigdata.com<\/a>.<\/p>\n<p>&nbsp;<\/p>\n<p>Daniel D. Gutierrez \u2013 Managing Editor &amp; Resident Data Scientist, insideBIGDATA<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d This week\u2019s question is from a reader who seeks a discussion of missing data handling methods such as imputation.<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"footnotes":""},"categories":[239,189,182,170,60],"tags":[133,261],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.6 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Ask a Data Scientist: Handling Missing Data - insideBIGDATA<\/title>\n<meta name=\"description\" content=\"How do you handle missing data? What imputation techniques do you recommend?\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/insidebigdata.com\/2014\/10\/29\/ask-data-scientist-handling-missing-data\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Ask a Data Scientist: Handling Missing Data - insideBIGDATA\" \/>\n<meta property=\"og:description\" content=\"How do you handle missing data? What imputation techniques do you recommend?\" \/>\n<meta property=\"og:url\" content=\"https:\/\/insidebigdata.com\/2014\/10\/29\/ask-data-scientist-handling-missing-data\/\" \/>\n<meta property=\"og:site_name\" content=\"insideBIGDATA\" \/>\n<meta property=\"article:publisher\" content=\"http:\/\/www.facebook.com\/insidebigdata\" \/>\n<meta property=\"article:author\" content=\"https:\/\/www.facebook.com\/rich.brueckner.insideHPC\" \/>\n<meta property=\"article:published_time\" content=\"2014-10-29T19:56:55+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-05-22T21:45:21+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/insidebigdata.com\/wp-content\/uploads\/2014\/04\/datascientist2_featured.jpg\" \/>\n<meta name=\"author\" content=\"Rich Brueckner\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@insidehpc\" \/>\n<meta name=\"twitter:site\" content=\"@insideBigData\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Rich Brueckner\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/insidebigdata.com\/2014\/10\/29\/ask-data-scientist-handling-missing-data\/\",\"url\":\"https:\/\/insidebigdata.com\/2014\/10\/29\/ask-data-scientist-handling-missing-data\/\",\"name\":\"Ask a Data Scientist: Handling Missing Data - insideBIGDATA\",\"isPartOf\":{\"@id\":\"https:\/\/insidebigdata.com\/#website\"},\"datePublished\":\"2014-10-29T19:56:55+00:00\",\"dateModified\":\"2018-05-22T21:45:21+00:00\",\"author\":{\"@id\":\"https:\/\/insidebigdata.com\/#\/schema\/person\/5694baa4fa557b2df36513cd3099b649\"},\"description\":\"How do you handle missing data? What imputation techniques do you recommend?\",\"breadcrumb\":{\"@id\":\"https:\/\/insidebigdata.com\/2014\/10\/29\/ask-data-scientist-handling-missing-data\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/insidebigdata.com\/2014\/10\/29\/ask-data-scientist-handling-missing-data\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/insidebigdata.com\/2014\/10\/29\/ask-data-scientist-handling-missing-data\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/insidebigdata.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Ask a Data Scientist: Handling Missing Data\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/insidebigdata.com\/#website\",\"url\":\"https:\/\/insidebigdata.com\/\",\"name\":\"insideBIGDATA\",\"description\":\"Your Source for AI, Data Science, Deep Learning &amp; Machine Learning Strategies\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/insidebigdata.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/insidebigdata.com\/#\/schema\/person\/5694baa4fa557b2df36513cd3099b649\",\"name\":\"Rich Brueckner\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/insidebigdata.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/71ad1630f386966fe810adeb2aee1eb0?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/71ad1630f386966fe810adeb2aee1eb0?s=96&d=mm&r=g\",\"caption\":\"Rich Brueckner\"},\"sameAs\":[\"http:\/\/insidebigdata.com\",\"https:\/\/www.facebook.com\/rich.brueckner.insideHPC\",\"https:\/\/twitter.com\/insidehpc\"],\"url\":\"https:\/\/insidebigdata.com\/author\/rich\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Ask a Data Scientist: Handling Missing Data - insideBIGDATA","description":"How do you handle missing data? What imputation techniques do you recommend?","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/insidebigdata.com\/2014\/10\/29\/ask-data-scientist-handling-missing-data\/","og_locale":"en_US","og_type":"article","og_title":"Ask a Data Scientist: Handling Missing Data - insideBIGDATA","og_description":"How do you handle missing data? What imputation techniques do you recommend?","og_url":"https:\/\/insidebigdata.com\/2014\/10\/29\/ask-data-scientist-handling-missing-data\/","og_site_name":"insideBIGDATA","article_publisher":"http:\/\/www.facebook.com\/insidebigdata","article_author":"https:\/\/www.facebook.com\/rich.brueckner.insideHPC","article_published_time":"2014-10-29T19:56:55+00:00","article_modified_time":"2018-05-22T21:45:21+00:00","og_image":[{"url":"https:\/\/insidebigdata.com\/wp-content\/uploads\/2014\/04\/datascientist2_featured.jpg"}],"author":"Rich Brueckner","twitter_card":"summary_large_image","twitter_creator":"@insidehpc","twitter_site":"@insideBigData","twitter_misc":{"Written by":"Rich Brueckner","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/insidebigdata.com\/2014\/10\/29\/ask-data-scientist-handling-missing-data\/","url":"https:\/\/insidebigdata.com\/2014\/10\/29\/ask-data-scientist-handling-missing-data\/","name":"Ask a Data Scientist: Handling Missing Data - insideBIGDATA","isPartOf":{"@id":"https:\/\/insidebigdata.com\/#website"},"datePublished":"2014-10-29T19:56:55+00:00","dateModified":"2018-05-22T21:45:21+00:00","author":{"@id":"https:\/\/insidebigdata.com\/#\/schema\/person\/5694baa4fa557b2df36513cd3099b649"},"description":"How do you handle missing data? What imputation techniques do you recommend?","breadcrumb":{"@id":"https:\/\/insidebigdata.com\/2014\/10\/29\/ask-data-scientist-handling-missing-data\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/insidebigdata.com\/2014\/10\/29\/ask-data-scientist-handling-missing-data\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/insidebigdata.com\/2014\/10\/29\/ask-data-scientist-handling-missing-data\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/insidebigdata.com\/"},{"@type":"ListItem","position":2,"name":"Ask a Data Scientist: Handling Missing Data"}]},{"@type":"WebSite","@id":"https:\/\/insidebigdata.com\/#website","url":"https:\/\/insidebigdata.com\/","name":"insideBIGDATA","description":"Your Source for AI, Data Science, Deep Learning &amp; Machine Learning Strategies","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/insidebigdata.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/insidebigdata.com\/#\/schema\/person\/5694baa4fa557b2df36513cd3099b649","name":"Rich Brueckner","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/insidebigdata.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/71ad1630f386966fe810adeb2aee1eb0?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/71ad1630f386966fe810adeb2aee1eb0?s=96&d=mm&r=g","caption":"Rich Brueckner"},"sameAs":["http:\/\/insidebigdata.com","https:\/\/www.facebook.com\/rich.brueckner.insideHPC","https:\/\/twitter.com\/insidehpc"],"url":"https:\/\/insidebigdata.com\/author\/rich\/"}]}},"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p9eA3j-3ci","jetpack-related-posts":[{"id":20645,"url":"https:\/\/insidebigdata.com\/2018\/06\/30\/insidebigdata-ask-data-scientist-series\/","url_meta":{"origin":12294,"position":0},"title":"insideBIGDATA &#8220;Ask a Data Scientist&#8221; Series","date":"June 30, 2018","format":false,"excerpt":"Welcome to the series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist\u201d from insideBIGDATA's popular Data Science 101 channel. These articles constitute many of our site's most popular resources for newbie data scientists. The 12 articles listed below were from reader submitted questions of varying levels of technical\u2026","rel":"","context":"In &quot;Data Science&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2018\/06\/data-scientist-300x300_insidebigdata.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":20464,"url":"https:\/\/insidebigdata.com\/2018\/05\/31\/data-science-101-handling-missing-data-revisited\/","url_meta":{"origin":12294,"position":1},"title":"Data Science 101: Handling Missing Data (Revisited)","date":"May 31, 2018","format":false,"excerpt":"I recently received the following question on data science methods from an avid reader of insideBIGDATA who hails from Taiwan. I think the topics are very relevant to many folks in our audience so I decided to run it here in our Data Science 101 channel. The issue of missing\u2026","rel":"","context":"In &quot;Data Science&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2018\/05\/Reader_question_fig3.png?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":12345,"url":"https:\/\/insidebigdata.com\/2014\/11\/12\/ask-data-scientist-data-science-process\/","url_meta":{"origin":12294,"position":2},"title":"Ask a Data Scientist: The Data Science Process","date":"November 12, 2014","format":false,"excerpt":"Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d This week\u2019s question is from a reader who wonders if there is a general process for conducting data science projects.","rel":"","context":"In &quot;Ask a Data Scientist&quot;","img":{"alt_text":"DataScienceProcess","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2014\/11\/DataScienceProcess.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":12317,"url":"https:\/\/insidebigdata.com\/2014\/11\/09\/ask-data-scientist-importance-exploratory-data-analysis\/","url_meta":{"origin":12294,"position":3},"title":"Ask a Data Scientist: The Importance of Exploratory Data Analysis","date":"November 9, 2014","format":false,"excerpt":"Q: What is the role of exploratory data analysis in data science?","rel":"","context":"In &quot;Ask a Data Scientist&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":12352,"url":"https:\/\/insidebigdata.com\/2014\/11\/19\/ask-data-scientist-unsupervised-learning\/","url_meta":{"origin":12294,"position":4},"title":"Ask a Data Scientist: Unsupervised Learning","date":"November 19, 2014","format":false,"excerpt":"Welcome back to the \u201cAsk a Data Scientist\u201d article series. This week\u2019s question is from a reader who asks for an overview of unsupervised machine learning.","rel":"","context":"In &quot;Ask a Data Scientist&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2014\/11\/awwicker.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":12045,"url":"https:\/\/insidebigdata.com\/2014\/10\/15\/ask-data-scientist-curse-dimensionality\/","url_meta":{"origin":12294,"position":5},"title":"Ask a Data Scientist: Curse of Dimensionality","date":"October 15, 2014","format":false,"excerpt":"Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d Once a week you\u2019ll see reader submitted questions of varying levels of technical detail answered by a practicing data scientist \u2013 sometimes by me and other times by an Intel data scientist. This week\u2019s question\u2026","rel":"","context":"In &quot;Ask a Data Scientist&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/posts\/12294"}],"collection":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/comments?post=12294"}],"version-history":[{"count":0,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/posts\/12294\/revisions"}],"wp:attachment":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/media?parent=12294"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/categories?post=12294"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/tags?post=12294"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}