{"id":364,"date":"2014-08-12T18:17:29","date_gmt":"2014-08-12T17:17:29","guid":{"rendered":"http:\/\/www.sebcharrot.com\/blog\/?p=364"},"modified":"2014-08-12T18:17:29","modified_gmt":"2014-08-12T17:17:29","slug":"removing-duplicate-rows-sas","status":"publish","type":"post","link":"http:\/\/www.sebcharrot.com\/blog\/removing-duplicate-rows-sas\/","title":{"rendered":"Removing duplicate rows in base SAS"},"content":{"rendered":"<p>If you ever need to remove duplicate rows from a SAS dataset, here&#8217;s an approach I use quite often.<\/p>\n<h2>Get your data.<\/h2>\n<p>Let&#8217;s assume it&#8217;s in the following format:<br \/>\n<span style=\"color: rgb(255, 255, 255);\">&#8211;<\/span><\/p>\n<table style=\"height: 90px;\" border=\"1\" width=\"287\">\n<tbody>\n<tr>\n<td><strong>ID<\/strong><\/td>\n<td><strong>Name<\/strong><\/td>\n<\/tr>\n<tr>\n<td>123<\/td>\n<td>John<\/td>\n<\/tr>\n<tr>\n<td>456<\/td>\n<td>Bob<\/td>\n<\/tr>\n<tr>\n<td>123<\/td>\n<td>John<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Sort your\u00a0data.<br \/>\n<span style=\"color: rgb(255, 255, 255);\">&#8211;<\/span><\/h2>\n<pre><span style=\"color: #339966;\">\/* Step 1 - Sort data *\/<\/span>\nproc sort data=my_lib.my_dataset;\n\n<span style=\"color: #339966;\">   \/* Sort by a field which you want to be unique, \n   and which will be the same for duplicate rows *\/<\/span>\n   by id; \n\nrun;<\/pre>\n<p>Which should give you the following:<br \/>\n<span style=\"color: rgb(255, 255, 255);\">&#8211;<\/span><\/p>\n<table style=\"height: 90px;\" border=\"1\" width=\"287\">\n<tbody>\n<tr>\n<td><strong>ID<\/strong><\/td>\n<td><strong>Name<\/strong><\/td>\n<\/tr>\n<tr>\n<td>123<\/td>\n<td>John<\/td>\n<\/tr>\n<tr>\n<td>123<\/td>\n<td>John<\/td>\n<\/tr>\n<tr>\n<td>456<\/td>\n<td>Bob<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2><span style=\"color: rgb(255, 255, 255);\">&#8211;<\/span><br \/>\nRemove Duplicates<\/h2>\n<p>Now that the data is in order, we can remove the duplicates, by only ever keeping the first entry which matches our unique ID.<br \/>\n<span style=\"color: #ffffff;\">-.<\/span><\/p>\n<pre><span style=\"color: #339966;\">\/* Step 2 - Get rid of duplicates *\/<\/span>\ndata my_lib.my_dataset;\n\n<span style=\"color: #339966;\">   \/* Iterate through this dataset row by row *\/<\/span>\n   set my_lib.my_dataset;\n<span style=\"color: #339966;\">   \/* Grouping each row by the field we sorted on *\/<\/span> \n   by id; \n<span style=\"color: #339966;\">   \/* And only keep a row if it\u2019s the first *\/<\/span>\n   if first.id; \n\nrun;<\/pre>\n<p><span style=\"color: rgb(255, 255, 255);\">&#8211;<\/span><\/p>\n<table style=\"height: 90px;\" border=\"1\" width=\"287\">\n<tbody>\n<tr>\n<td><strong>ID<\/strong><\/td>\n<td><strong>Name<\/strong><\/td>\n<\/tr>\n<tr>\n<td>123<\/td>\n<td>John<\/td>\n<\/tr>\n<tr>\n<td>456<\/td>\n<td>Bob<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span style=\"color: rgb(255, 255, 255);\">&#8211;<br \/>\n<\/span>Tadaa!<\/p>\n<h2>What happened there?<\/h2>\n<p>This approach has 3 facets:<\/p>\n<ol>\n<li>Grouping<\/li>\n<li>SAS&#8217; special first.variable<\/li>\n<li>SAS&#8217; feature of only appending (or &#8220;outputting&#8221;) a row to a dataset if there are no non-assignment statements which evaluate to false.<\/li>\n<\/ol>\n<p><span style=\"color: #ffffff;\">&#8211;<\/span><br \/>\n<strong>Grouping:\u00a0<\/strong>So we effectively rearranged our data so that all identical IDs were grouped together. Given that the rows are identical and you only want to keep one of them, we choose to keep the first of each group.<\/p>\n<p><strong>First.variable:\u00a0<\/strong>During execution, SAS will iterate through each row of my_data_set and adding it to a new dataset (which it will eventually overwrite my_data_set with). During each iteration, if it hits a row with the first use of an ID value (for example, 123), it will set first.id to true. On the next run, because it&#8217;s already seen the value 123 before, first.id is set to false. This gives us a handy flag which will only ever be toggled on unique rows.<\/p>\n<p><strong>Funny Statement Stuff:\u00a0<\/strong>So how do we flag to SAS that when this value is set to true, to keep the row?\u00a0When evaluating a data row if at any point we make any floating statement (i.e. not assigning a variable, or in an if or do loop) which evaluates to false, SAS will take that as a sign that it shouldn&#8217;t output that row i.e. in this case, it shouldn&#8217;t keep it.<\/p>\n<p>So in simple terms, we&#8217;re saying &#8211; if you&#8217;ve seen this value before, don&#8217;t save it again.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you ever need to remove duplicate rows from a SAS dataset, here&#8217;s an approach I use quite often. Get your data. Let&#8217;s assume it&#8217;s in the following format: &#8211; ID Name 123 John 456 Bob 123 John Sort your\u00a0data. &#8211; \/* Step 1 &#8211; Sort data *\/ proc sort data=my_lib.my_dataset; \/* Sort by a&hellip;&nbsp;<a href=\"http:\/\/www.sebcharrot.com\/blog\/removing-duplicate-rows-sas\/\" class=\"\" rel=\"bookmark\">Read More &raquo;<span class=\"screen-reader-text\">Removing duplicate rows in base SAS<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"neve_meta_sidebar":"","neve_meta_container":"","neve_meta_enable_content_width":"","neve_meta_content_width":0,"neve_meta_title_alignment":"","neve_meta_author_avatar":"","neve_post_elements_order":"","neve_meta_disable_header":"","neve_meta_disable_footer":"","neve_meta_disable_title":""},"categories":[23,24],"tags":[70,147,151,152],"_links":{"self":[{"href":"http:\/\/www.sebcharrot.com\/blog\/wp-json\/wp\/v2\/posts\/364"}],"collection":[{"href":"http:\/\/www.sebcharrot.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.sebcharrot.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.sebcharrot.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.sebcharrot.com\/blog\/wp-json\/wp\/v2\/comments?post=364"}],"version-history":[{"count":0,"href":"http:\/\/www.sebcharrot.com\/blog\/wp-json\/wp\/v2\/posts\/364\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.sebcharrot.com\/blog\/wp-json\/wp\/v2\/media?parent=364"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.sebcharrot.com\/blog\/wp-json\/wp\/v2\/categories?post=364"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.sebcharrot.com\/blog\/wp-json\/wp\/v2\/tags?post=364"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}