{"id":12345,"date":"2014-11-12T06:00:39","date_gmt":"2014-11-12T14:00:39","guid":{"rendered":"http:\/\/insidebigdata.com\/?p=12345"},"modified":"2014-11-12T08:38:20","modified_gmt":"2014-11-12T16:38:20","slug":"ask-data-scientist-data-science-process","status":"publish","type":"post","link":"https:\/\/insidebigdata.com\/2014\/11\/12\/ask-data-scientist-data-science-process\/","title":{"rendered":"Ask a Data Scientist: The Data Science Process"},"content":{"rendered":"<p><a href=\"http:\/\/insidebigdata.com\/wp-content\/uploads\/2014\/04\/datascientist2_featured.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignright size-full wp-image-8806\" src=\"https:\/\/insidebigdata.com\/wp-content\/uploads\/2014\/04\/datascientist2_featured.jpg\" alt=\"datascientist2_featured\" width=\"195\" height=\"166\" \/><\/a>Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d Once a week you\u2019ll see reader submitted questions of varying levels of technical detail answered by a practicing data scientist \u2013 sometimes by me and other times by an Intel data scientist. Think of this new insideBIGDATA feature as a valuable resource for you to get up to speed in this flourishing area of technology. If you have a big data question you\u2019d like answered, please just enter a comment below, or send an e-mail to me at: <a href=\"mailto:daniel@insidehpc.com\" target=\"_blank\">daniel@insidehpc.com<\/a>. This week\u2019s question is from a reader who wonders if there is a general process for conducting data science projects.<\/p>\n<p><strong>Q:<\/strong> I<strong>s there a typical &#8220;data science process?&#8221;<\/strong><b><br \/>\n<\/b><\/p>\n<p><strong>A:<\/strong> Great question! There most definitely is a general formula followed by data scientists in striving to achieve best practices with a data science project. The figure below encapsulates the high-level steps in the so-called \u201cdata sci\u00adence pipeline\u201d that contribute to the success of a project: under\u00adstanding the goal of the project, data access, data munging, exploratory data analysis, feature engi\u00adneering, model selection, model validation, deploy, visualization and communicate the results.<\/p>\n<p><a href=\"http:\/\/insidebigdata.com\/wp-content\/uploads\/2014\/11\/DataScienceProcess.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-12351\" src=\"https:\/\/insidebigdata.com\/wp-content\/uploads\/2014\/11\/DataScienceProcess.jpg\" alt=\"DataScienceProcess\" width=\"582\" height=\"601\" srcset=\"https:\/\/insidebigdata.com\/wp-content\/uploads\/2014\/11\/DataScienceProcess.jpg 582w, https:\/\/insidebigdata.com\/wp-content\/uploads\/2014\/11\/DataScienceProcess-290x300.jpg 290w\" sizes=\"(max-width: 582px) 100vw, 582px\" \/><\/a><\/p>\n<p>Data science involves understanding and preparing the data, defining the statistical learning model, and following the data science process. Statistical learning models can assume many shapes and sizes, depending on their complexity and the application for which they are designed. The first step is to un\u00adderstand what questions you are trying to answer for your organization. The level of detail and com\u00adplexity of your questions will increase as you be\u00adcome more comfortable with the analytic process. The most important steps in the data science process are as follows:<\/p>\n<ul>\n<li>Define the project outcomes and deliverables, state the scope of the effort, establish busi\u00adness objectives, and identify the data sets to be used.<\/li>\n<li>Undertake data collection and data under\u00adstanding. Some data scientists believe that domain knowledge superfluous, but from my experience, having a domain expert available to consult with can be an important factor for success.<\/li>\n<li>Perform data munging \u2013 the process of in\u00adspecting, cleaning, and transforming the data.<\/li>\n<li>Utilize techniques of exploratory data analysis (EDA) \u2013 use graphical techniques with the objective of discovering useful information, arriving at conclusions. Apply statistics to validate the assumptions, hypothesis and test using stan\u00addard statistical methods.<\/li>\n<li>Apply statistical modeling principles to provide the abil\u00adity to automatically create accurate predictive models about the future.<\/li>\n<li>Evaluate the model allowing you to verify the robustness of the chosen model and make mid-course corrections. Test models on exist\u00ading data and apply predictions to new data.<\/li>\n<li>Select a deployment option to open up the analytical results to every day decision making and to get results by automating the decisions based on the modeling.<\/li>\n<\/ul>\n<p>Each of the above steps can be considered itera\u00adtive and may be revisited as needed. It should be noted that the data munging step often is very time-consuming depending on the cleanliness of the incoming data and can take up to 70% of the overall project timeline.<\/p>\n<p>If you have a question you\u2019d like answered, please just enter a comment below, or send an e-mail to me at:\u00a0<a href=\"mailto:daniel@insidehpc.com\" target=\"_blank\">daniel@insidehpc.com<\/a>.<\/p>\n<p><strong>Data Scientist:<\/strong>\u00a0Daniel D. Gutierrez \u2013 Managing Editor, insideBIGDATA<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d This week\u2019s question is from a reader who wonders if there is a general process for conducting data science projects.<\/p>\n","protected":false},"author":37,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"footnotes":""},"categories":[239,182,87,180,210,67,56,1],"tags":[96],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.6 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Ask a Data Scientist: The Data Science Process - insideBIGDATA<\/title>\n<meta name=\"description\" content=\"Ask a Data Scientist: The Data Science Process\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/insidebigdata.com\/2014\/11\/12\/ask-data-scientist-data-science-process\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Ask a Data Scientist: The Data Science Process - insideBIGDATA\" \/>\n<meta property=\"og:description\" content=\"Ask a Data Scientist: The Data Science Process\" \/>\n<meta property=\"og:url\" content=\"https:\/\/insidebigdata.com\/2014\/11\/12\/ask-data-scientist-data-science-process\/\" \/>\n<meta property=\"og:site_name\" content=\"insideBIGDATA\" \/>\n<meta property=\"article:publisher\" content=\"http:\/\/www.facebook.com\/insidebigdata\" \/>\n<meta property=\"article:published_time\" content=\"2014-11-12T14:00:39+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2014-11-12T16:38:20+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/insidebigdata.com\/wp-content\/uploads\/2014\/04\/datascientist2_featured.jpg\" \/>\n<meta name=\"author\" content=\"Daniel Gutierrez\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@AMULETAnalytics\" \/>\n<meta name=\"twitter:site\" content=\"@insideBigData\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Daniel Gutierrez\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/insidebigdata.com\/2014\/11\/12\/ask-data-scientist-data-science-process\/\",\"url\":\"https:\/\/insidebigdata.com\/2014\/11\/12\/ask-data-scientist-data-science-process\/\",\"name\":\"Ask a Data Scientist: The Data Science Process - insideBIGDATA\",\"isPartOf\":{\"@id\":\"https:\/\/insidebigdata.com\/#website\"},\"datePublished\":\"2014-11-12T14:00:39+00:00\",\"dateModified\":\"2014-11-12T16:38:20+00:00\",\"author\":{\"@id\":\"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed\"},\"description\":\"Ask a Data Scientist: The Data Science Process\",\"breadcrumb\":{\"@id\":\"https:\/\/insidebigdata.com\/2014\/11\/12\/ask-data-scientist-data-science-process\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/insidebigdata.com\/2014\/11\/12\/ask-data-scientist-data-science-process\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/insidebigdata.com\/2014\/11\/12\/ask-data-scientist-data-science-process\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/insidebigdata.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Ask a Data Scientist: The Data Science Process\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/insidebigdata.com\/#website\",\"url\":\"https:\/\/insidebigdata.com\/\",\"name\":\"insideBIGDATA\",\"description\":\"Your Source for AI, Data Science, Deep Learning &amp; Machine Learning Strategies\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/insidebigdata.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed\",\"name\":\"Daniel Gutierrez\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/insidebigdata.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g\",\"caption\":\"Daniel Gutierrez\"},\"description\":\"Daniel D. Gutierrez is a Data Scientist with Los Angeles-based AMULET Analytics, a service division of AMULET Development Corp. He's been involved with data science and Big Data long before it came in vogue, so imagine his delight when the Harvard Business Review recently deemed \\\"data scientist\\\" as the sexiest profession for the 21st century. Previously, he taught computer science and database classes at UCLA Extension for over 15 years, and authored three computer industry books on database technology. He also served as technical editor, columnist and writer at a major computer industry monthly publication for 7 years. Follow his data science musings at @AMULETAnalytics.\",\"sameAs\":[\"http:\/\/www.insidebigdata.com\",\"https:\/\/twitter.com\/@AMULETAnalytics\"],\"url\":\"https:\/\/insidebigdata.com\/author\/dangutierrez\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Ask a Data Scientist: The Data Science Process - insideBIGDATA","description":"Ask a Data Scientist: The Data Science Process","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/insidebigdata.com\/2014\/11\/12\/ask-data-scientist-data-science-process\/","og_locale":"en_US","og_type":"article","og_title":"Ask a Data Scientist: The Data Science Process - insideBIGDATA","og_description":"Ask a Data Scientist: The Data Science Process","og_url":"https:\/\/insidebigdata.com\/2014\/11\/12\/ask-data-scientist-data-science-process\/","og_site_name":"insideBIGDATA","article_publisher":"http:\/\/www.facebook.com\/insidebigdata","article_published_time":"2014-11-12T14:00:39+00:00","article_modified_time":"2014-11-12T16:38:20+00:00","og_image":[{"url":"http:\/\/insidebigdata.com\/wp-content\/uploads\/2014\/04\/datascientist2_featured.jpg"}],"author":"Daniel Gutierrez","twitter_card":"summary_large_image","twitter_creator":"@AMULETAnalytics","twitter_site":"@insideBigData","twitter_misc":{"Written by":"Daniel Gutierrez","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/insidebigdata.com\/2014\/11\/12\/ask-data-scientist-data-science-process\/","url":"https:\/\/insidebigdata.com\/2014\/11\/12\/ask-data-scientist-data-science-process\/","name":"Ask a Data Scientist: The Data Science Process - insideBIGDATA","isPartOf":{"@id":"https:\/\/insidebigdata.com\/#website"},"datePublished":"2014-11-12T14:00:39+00:00","dateModified":"2014-11-12T16:38:20+00:00","author":{"@id":"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed"},"description":"Ask a Data Scientist: The Data Science Process","breadcrumb":{"@id":"https:\/\/insidebigdata.com\/2014\/11\/12\/ask-data-scientist-data-science-process\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/insidebigdata.com\/2014\/11\/12\/ask-data-scientist-data-science-process\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/insidebigdata.com\/2014\/11\/12\/ask-data-scientist-data-science-process\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/insidebigdata.com\/"},{"@type":"ListItem","position":2,"name":"Ask a Data Scientist: The Data Science Process"}]},{"@type":"WebSite","@id":"https:\/\/insidebigdata.com\/#website","url":"https:\/\/insidebigdata.com\/","name":"insideBIGDATA","description":"Your Source for AI, Data Science, Deep Learning &amp; Machine Learning Strategies","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/insidebigdata.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed","name":"Daniel Gutierrez","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/insidebigdata.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g","caption":"Daniel Gutierrez"},"description":"Daniel D. Gutierrez is a Data Scientist with Los Angeles-based AMULET Analytics, a service division of AMULET Development Corp. He's been involved with data science and Big Data long before it came in vogue, so imagine his delight when the Harvard Business Review recently deemed \"data scientist\" as the sexiest profession for the 21st century. Previously, he taught computer science and database classes at UCLA Extension for over 15 years, and authored three computer industry books on database technology. He also served as technical editor, columnist and writer at a major computer industry monthly publication for 7 years. Follow his data science musings at @AMULETAnalytics.","sameAs":["http:\/\/www.insidebigdata.com","https:\/\/twitter.com\/@AMULETAnalytics"],"url":"https:\/\/insidebigdata.com\/author\/dangutierrez\/"}]}},"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p9eA3j-3d7","jetpack-related-posts":[{"id":20645,"url":"https:\/\/insidebigdata.com\/2018\/06\/30\/insidebigdata-ask-data-scientist-series\/","url_meta":{"origin":12345,"position":0},"title":"insideBIGDATA &#8220;Ask a Data Scientist&#8221; Series","date":"June 30, 2018","format":false,"excerpt":"Welcome to the series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist\u201d from insideBIGDATA's popular Data Science 101 channel. These articles constitute many of our site's most popular resources for newbie data scientists. The 12 articles listed below were from reader submitted questions of varying levels of technical\u2026","rel":"","context":"In &quot;Data Science&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2018\/06\/data-scientist-300x300_insidebigdata.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":12352,"url":"https:\/\/insidebigdata.com\/2014\/11\/19\/ask-data-scientist-unsupervised-learning\/","url_meta":{"origin":12345,"position":1},"title":"Ask a Data Scientist: Unsupervised Learning","date":"November 19, 2014","format":false,"excerpt":"Welcome back to the \u201cAsk a Data Scientist\u201d article series. This week\u2019s question is from a reader who asks for an overview of unsupervised machine learning.","rel":"","context":"In &quot;Ask a Data Scientist&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2014\/11\/awwicker.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":12045,"url":"https:\/\/insidebigdata.com\/2014\/10\/15\/ask-data-scientist-curse-dimensionality\/","url_meta":{"origin":12345,"position":2},"title":"Ask a Data Scientist: Curse of Dimensionality","date":"October 15, 2014","format":false,"excerpt":"Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d Once a week you\u2019ll see reader submitted questions of varying levels of technical detail answered by a practicing data scientist \u2013 sometimes by me and other times by an Intel data scientist. This week\u2019s question\u2026","rel":"","context":"In &quot;Ask a Data Scientist&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":12294,"url":"https:\/\/insidebigdata.com\/2014\/10\/29\/ask-data-scientist-handling-missing-data\/","url_meta":{"origin":12345,"position":3},"title":"Ask a Data Scientist: Handling Missing Data","date":"October 29, 2014","format":false,"excerpt":"Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d This week\u2019s question is from a reader who seeks a discussion of missing data handling methods such as imputation.","rel":"","context":"In &quot;Ask a Data Scientist&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":12317,"url":"https:\/\/insidebigdata.com\/2014\/11\/09\/ask-data-scientist-importance-exploratory-data-analysis\/","url_meta":{"origin":12345,"position":4},"title":"Ask a Data Scientist: The Importance of Exploratory Data Analysis","date":"November 9, 2014","format":false,"excerpt":"Q: What is the role of exploratory data analysis in data science?","rel":"","context":"In &quot;Ask a Data Scientist&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":12555,"url":"https:\/\/insidebigdata.com\/2014\/12\/31\/ask-data-scientist-becoming-data-scientist\/","url_meta":{"origin":12345,"position":5},"title":"Ask a Data Scientist: Becoming a Data Scientist","date":"December 31, 2014","format":false,"excerpt":"Welcome back to our series of articles sponsored by Intel \u2013 \u201cAsk a Data Scientist.\u201d This is the last of our reader submitted questions of varying levels of technical detail answered by a practicing data scientist. This week\u2019s question is from a reader who asks about becoming a data scientist.","rel":"","context":"In &quot;Ask a Data Scientist&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2018\/06\/data-scientist-300x300_insidebigdata.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]}],"_links":{"self":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/posts\/12345"}],"collection":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/users\/37"}],"replies":[{"embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/comments?post=12345"}],"version-history":[{"count":0,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/posts\/12345\/revisions"}],"wp:attachment":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/media?parent=12345"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/categories?post=12345"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/tags?post=12345"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}