{"id":29230,"date":"2022-05-06T06:00:00","date_gmt":"2022-05-06T13:00:00","guid":{"rendered":"https:\/\/insidebigdata.com\/?p=29230"},"modified":"2023-06-23T12:41:24","modified_gmt":"2023-06-23T19:41:24","slug":"research-highlights-transformer-feed-forward-layers-are-key-value-memories","status":"publish","type":"post","link":"https:\/\/insidebigdata.com\/2022\/05\/06\/research-highlights-transformer-feed-forward-layers-are-key-value-memories\/","title":{"rendered":"Research Highlights: Transformer Feed-Forward Layers Are Key-Value Memories"},"content":{"rendered":"\n<p><strong>Title of Paper:<\/strong> <a href=\"https:\/\/arxiv.org\/pdf\/2012.14913.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Transformer Feed-Forward Layers Are Key-Value Memories<\/a><\/p>\n\n\n\n<p><strong>Author(s):<\/strong> Mor Geva, Roei Schuster, Jonathan Berant, Omer Levy<\/p>\n\n\n\n<p><strong>Abstract<\/strong>: Feed-forward layers constitute two-thirds of a transformer model&#8217;s parameters, yet their role in the network remains under-explored. We show that feed-forward layers in transformer-based language models operate as key-value memories, where each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary. Our experiments show that the learned patterns are human-interpretable, and that lower layers tend to capture shallow patterns, while upper layers learn more semantic ones. The values complement the keys&#8217; input patterns by inducing output distributions that concentrate probability mass on tokens likely to appear immediately after each pattern, particularly in the upper layers. Finally, we demonstrate that the output of a feed-forward layer is a composition of its memories, which is subsequently refined throughout the model&#8217;s layers via residual connections to produce the final output distribution.<\/p>\n\n\n<div class=\"wp-block-image is-style-default\">\n<figure class=\"aligncenter size-large\"><img decoding=\"async\" loading=\"lazy\" width=\"462\" height=\"669\" src=\"https:\/\/insidebigdata.com\/wp-content\/uploads\/2022\/05\/Research_highlights_3.png\" alt=\"\" class=\"wp-image-29243\" srcset=\"https:\/\/insidebigdata.com\/wp-content\/uploads\/2022\/05\/Research_highlights_3.png 462w, https:\/\/insidebigdata.com\/wp-content\/uploads\/2022\/05\/Research_highlights_3-104x150.png 104w, https:\/\/insidebigdata.com\/wp-content\/uploads\/2022\/05\/Research_highlights_3-207x300.png 207w\" sizes=\"(max-width: 462px) 100vw, 462px\" \/><\/figure><\/div>\n\n\n<p><em>Sign up for the free insideBIGDATA&nbsp;<a rel=\"noreferrer noopener\" href=\"http:\/\/insidebigdata.com\/newsletter\/\" target=\"_blank\">newsletter<\/a>.<\/em><\/p>\n\n\n\n<p><em>Join us on Twitter:&nbsp;@InsideBigData1 \u2013 <a href=\"https:\/\/twitter.com\/InsideBigData1\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/twitter.com\/InsideBigData1<\/a><\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this regular column, we take a look at highlights for important research topics of the day for big data, data science, machine learning, AI and deep learning. It\u2019s important to keep connected with the research arm of the field in order to see where we\u2019re headed. In this edition, if you (like me) have wondered what the feed-forward layers in transformer models are actually doing, this is a pretty interesting paper on that topic. Enjoy!<\/p>\n","protected":false},"author":37,"featured_media":22835,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"footnotes":""},"categories":[526,115,182,87,180,67,56,84,1303,1],"tags":[437,264,1131,96],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.6 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Research Highlights: Transformer Feed-Forward Layers Are Key-Value Memories - insideBIGDATA<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/insidebigdata.com\/2022\/05\/06\/research-highlights-transformer-feed-forward-layers-are-key-value-memories\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Research Highlights: Transformer Feed-Forward Layers Are Key-Value Memories - insideBIGDATA\" \/>\n<meta property=\"og:description\" content=\"In this regular column, we take a look at highlights for important research topics of the day for big data, data science, machine learning, AI and deep learning. It\u2019s important to keep connected with the research arm of the field in order to see where we\u2019re headed. In this edition, if you (like me) have wondered what the feed-forward layers in transformer models are actually doing, this is a pretty interesting paper on that topic. Enjoy!\" \/>\n<meta property=\"og:url\" content=\"https:\/\/insidebigdata.com\/2022\/05\/06\/research-highlights-transformer-feed-forward-layers-are-key-value-memories\/\" \/>\n<meta property=\"og:site_name\" content=\"insideBIGDATA\" \/>\n<meta property=\"article:publisher\" content=\"http:\/\/www.facebook.com\/insidebigdata\" \/>\n<meta property=\"article:published_time\" content=\"2022-05-06T13:00:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-06-23T19:41:24+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/insidebigdata.com\/wp-content\/uploads\/2019\/06\/Data-Scientist-shutterstock_768047488.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"300\" \/>\n\t<meta property=\"og:image:height\" content=\"200\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Daniel Gutierrez\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@AMULETAnalytics\" \/>\n<meta name=\"twitter:site\" content=\"@insideBigData\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Daniel Gutierrez\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"1 minute\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/insidebigdata.com\/2022\/05\/06\/research-highlights-transformer-feed-forward-layers-are-key-value-memories\/\",\"url\":\"https:\/\/insidebigdata.com\/2022\/05\/06\/research-highlights-transformer-feed-forward-layers-are-key-value-memories\/\",\"name\":\"Research Highlights: Transformer Feed-Forward Layers Are Key-Value Memories - insideBIGDATA\",\"isPartOf\":{\"@id\":\"https:\/\/insidebigdata.com\/#website\"},\"datePublished\":\"2022-05-06T13:00:00+00:00\",\"dateModified\":\"2023-06-23T19:41:24+00:00\",\"author\":{\"@id\":\"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed\"},\"breadcrumb\":{\"@id\":\"https:\/\/insidebigdata.com\/2022\/05\/06\/research-highlights-transformer-feed-forward-layers-are-key-value-memories\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/insidebigdata.com\/2022\/05\/06\/research-highlights-transformer-feed-forward-layers-are-key-value-memories\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/insidebigdata.com\/2022\/05\/06\/research-highlights-transformer-feed-forward-layers-are-key-value-memories\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/insidebigdata.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Research Highlights: Transformer Feed-Forward Layers Are Key-Value Memories\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/insidebigdata.com\/#website\",\"url\":\"https:\/\/insidebigdata.com\/\",\"name\":\"insideBIGDATA\",\"description\":\"Your Source for AI, Data Science, Deep Learning &amp; Machine Learning Strategies\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/insidebigdata.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed\",\"name\":\"Daniel Gutierrez\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/insidebigdata.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g\",\"caption\":\"Daniel Gutierrez\"},\"description\":\"Daniel D. Gutierrez is a Data Scientist with Los Angeles-based AMULET Analytics, a service division of AMULET Development Corp. He's been involved with data science and Big Data long before it came in vogue, so imagine his delight when the Harvard Business Review recently deemed \\\"data scientist\\\" as the sexiest profession for the 21st century. Previously, he taught computer science and database classes at UCLA Extension for over 15 years, and authored three computer industry books on database technology. He also served as technical editor, columnist and writer at a major computer industry monthly publication for 7 years. Follow his data science musings at @AMULETAnalytics.\",\"sameAs\":[\"http:\/\/www.insidebigdata.com\",\"https:\/\/twitter.com\/@AMULETAnalytics\"],\"url\":\"https:\/\/insidebigdata.com\/author\/dangutierrez\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Research Highlights: Transformer Feed-Forward Layers Are Key-Value Memories - insideBIGDATA","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/insidebigdata.com\/2022\/05\/06\/research-highlights-transformer-feed-forward-layers-are-key-value-memories\/","og_locale":"en_US","og_type":"article","og_title":"Research Highlights: Transformer Feed-Forward Layers Are Key-Value Memories - insideBIGDATA","og_description":"In this regular column, we take a look at highlights for important research topics of the day for big data, data science, machine learning, AI and deep learning. It\u2019s important to keep connected with the research arm of the field in order to see where we\u2019re headed. In this edition, if you (like me) have wondered what the feed-forward layers in transformer models are actually doing, this is a pretty interesting paper on that topic. Enjoy!","og_url":"https:\/\/insidebigdata.com\/2022\/05\/06\/research-highlights-transformer-feed-forward-layers-are-key-value-memories\/","og_site_name":"insideBIGDATA","article_publisher":"http:\/\/www.facebook.com\/insidebigdata","article_published_time":"2022-05-06T13:00:00+00:00","article_modified_time":"2023-06-23T19:41:24+00:00","og_image":[{"width":300,"height":200,"url":"https:\/\/insidebigdata.com\/wp-content\/uploads\/2019\/06\/Data-Scientist-shutterstock_768047488.jpg","type":"image\/jpeg"}],"author":"Daniel Gutierrez","twitter_card":"summary_large_image","twitter_creator":"@AMULETAnalytics","twitter_site":"@insideBigData","twitter_misc":{"Written by":"Daniel Gutierrez","Est. reading time":"1 minute"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/insidebigdata.com\/2022\/05\/06\/research-highlights-transformer-feed-forward-layers-are-key-value-memories\/","url":"https:\/\/insidebigdata.com\/2022\/05\/06\/research-highlights-transformer-feed-forward-layers-are-key-value-memories\/","name":"Research Highlights: Transformer Feed-Forward Layers Are Key-Value Memories - insideBIGDATA","isPartOf":{"@id":"https:\/\/insidebigdata.com\/#website"},"datePublished":"2022-05-06T13:00:00+00:00","dateModified":"2023-06-23T19:41:24+00:00","author":{"@id":"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed"},"breadcrumb":{"@id":"https:\/\/insidebigdata.com\/2022\/05\/06\/research-highlights-transformer-feed-forward-layers-are-key-value-memories\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/insidebigdata.com\/2022\/05\/06\/research-highlights-transformer-feed-forward-layers-are-key-value-memories\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/insidebigdata.com\/2022\/05\/06\/research-highlights-transformer-feed-forward-layers-are-key-value-memories\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/insidebigdata.com\/"},{"@type":"ListItem","position":2,"name":"Research Highlights: Transformer Feed-Forward Layers Are Key-Value Memories"}]},{"@type":"WebSite","@id":"https:\/\/insidebigdata.com\/#website","url":"https:\/\/insidebigdata.com\/","name":"insideBIGDATA","description":"Your Source for AI, Data Science, Deep Learning &amp; Machine Learning Strategies","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/insidebigdata.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/insidebigdata.com\/#\/schema\/person\/2540da209c83a68f4f5922848f7376ed","name":"Daniel Gutierrez","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/insidebigdata.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/5780282e7e567e2a502233e948464542?s=96&d=mm&r=g","caption":"Daniel Gutierrez"},"description":"Daniel D. Gutierrez is a Data Scientist with Los Angeles-based AMULET Analytics, a service division of AMULET Development Corp. He's been involved with data science and Big Data long before it came in vogue, so imagine his delight when the Harvard Business Review recently deemed \"data scientist\" as the sexiest profession for the 21st century. Previously, he taught computer science and database classes at UCLA Extension for over 15 years, and authored three computer industry books on database technology. He also served as technical editor, columnist and writer at a major computer industry monthly publication for 7 years. Follow his data science musings at @AMULETAnalytics.","sameAs":["http:\/\/www.insidebigdata.com","https:\/\/twitter.com\/@AMULETAnalytics"],"url":"https:\/\/insidebigdata.com\/author\/dangutierrez\/"}]}},"jetpack_featured_media_url":"https:\/\/insidebigdata.com\/wp-content\/uploads\/2019\/06\/Data-Scientist-shutterstock_768047488.jpg","jetpack_shortlink":"https:\/\/wp.me\/p9eA3j-7Bs","jetpack-related-posts":[{"id":23163,"url":"https:\/\/insidebigdata.com\/2019\/08\/28\/best-of-arxiv-org-for-ai-machine-learning-and-deep-learning-july-2019\/","url_meta":{"origin":29230,"position":0},"title":"Best of arXiv.org for AI, Machine Learning, and Deep Learning \u2013 July 2019","date":"August 28, 2019","format":false,"excerpt":"In this recurring monthly feature, we will filter all the recent research papers appearing in the arXiv.org preprint server for subjects relating to AI, machine learning and deep learning \u2013 from disciplines including statistics, mathematics and computer science \u2013 and provide you with a useful \u201cbest of\u201d list for the\u2026","rel":"","context":"In &quot;AI Deep Learning&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2013\/12\/arxiv.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":24721,"url":"https:\/\/insidebigdata.com\/2020\/07\/11\/research-highlights-exbert\/","url_meta":{"origin":29230,"position":1},"title":"Research Highlights: ExBERT","date":"July 11, 2020","format":false,"excerpt":"In the insideBIGDATA Research Highlights column we take a look at new and upcoming results from the research community for data science, machine learning, AI and deep learning. Our readers need to get a glimpse for technology coming down the pipeline that will make their efforts more strategic and competitive.\u2026","rel":"","context":"In &quot;AI Deep Learning&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2020\/07\/exBERT_fig.png?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":29034,"url":"https:\/\/insidebigdata.com\/2022\/04\/13\/habana-labs-and-hugging-face-partner-to-accelerate-transformer-model-training\/","url_meta":{"origin":29230,"position":2},"title":"Habana Labs and Hugging Face Partner to Accelerate Transformer Model Training","date":"April 13, 2022","format":false,"excerpt":"Habana\u00ae Labs, a pioneer in high-efficiency, purpose-built deep learning processors, and Hugging Face, the home of Transformer models, announced that they\u2019re joining forces to make it easier and quicker to train high-quality transformer models. Thanks to the integration of Habana\u2019s SynapseAI software suite with the Hugging Face Optimum open-source library,\u2026","rel":"","context":"In &quot;AI Deep Learning&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":31764,"url":"https:\/\/insidebigdata.com\/2023\/03\/03\/research-highlights-sparsegpt-prune-llms-accurately-in-one-shot\/","url_meta":{"origin":29230,"position":3},"title":"Research Highlights: SparseGPT: Prune LLMs Accurately in One-Shot","date":"March 3, 2023","format":false,"excerpt":"A new research paper shows that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive\u2026","rel":"","context":"In &quot;AI Deep Learning&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2023\/02\/Neural_Magic_paper_fig.png?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":31639,"url":"https:\/\/insidebigdata.com\/2023\/02\/15\/video-highlights-attention-is-all-you-need-paper-explained\/","url_meta":{"origin":29230,"position":4},"title":"Video Highlights: Attention Is All You Need &#8211; Paper Explained","date":"February 15, 2023","format":false,"excerpt":"In this video presentation, Mohammad Namvarpour presents a comprehensive study on Ashish Vaswani and his coauthors' renowned paper, \u201cAttention Is All You Need.\u201d This paper is a major turning point in deep learning research. The transformer architecture, which was introduced in this paper, is now used in a variety of\u2026","rel":"","context":"In &quot;AI Deep Learning&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2019\/10\/NLP_shutterstock_299138114.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":32878,"url":"https:\/\/insidebigdata.com\/2023\/07\/25\/video-highlights-generative-ai-with-large-language-models\/","url_meta":{"origin":29230,"position":5},"title":"Video Highlights: Generative AI with Large Language Models","date":"July 25, 2023","format":false,"excerpt":"At an unprecedented pace, Large Language Models like GPT-4 are transforming the world in general and the field of data science in particular. This two-hour training video presentation by Jon Krohn, Co-Founder and Chief Data Scientist at the machine learning company Nebula, introduces deep learning transformer architectures including LLMs.","rel":"","context":"In &quot;AI Deep Learning&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/insidebigdata.com\/wp-content\/uploads\/2023\/06\/GenerativeAI_shutterstock_2313909647_special.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]}],"_links":{"self":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/posts\/29230"}],"collection":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/users\/37"}],"replies":[{"embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/comments?post=29230"}],"version-history":[{"count":0,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/posts\/29230\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/media\/22835"}],"wp:attachment":[{"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/media?parent=29230"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/categories?post=29230"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/insidebigdata.com\/wp-json\/wp\/v2\/tags?post=29230"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}