If you ever need to remove duplicate rows from a SAS dataset, here’s an approach I use quite often.
Get your data.
Let’s assume it’s in the following format:
–
ID | Name |
123 | John |
456 | Bob |
123 | John |
Sort your data.
–
/* Step 1 - Sort data */ proc sort data=my_lib.my_dataset; /* Sort by a field which you want to be unique, and which will be the same for duplicate rows */ by id; run;
Which should give you the following:
–
ID | Name |
123 | John |
123 | John |
456 | Bob |
–
Remove Duplicates
Now that the data is in order, we can remove the duplicates, by only ever keeping the first entry which matches our unique ID.
-.
/* Step 2 - Get rid of duplicates */ data my_lib.my_dataset; /* Iterate through this dataset row by row */ set my_lib.my_dataset; /* Grouping each row by the field we sorted on */ by id; /* And only keep a row if it’s the first */ if first.id; run;
–
ID | Name |
123 | John |
456 | Bob |
–
Tadaa!
What happened there?
This approach has 3 facets:
- Grouping
- SAS’ special first.variable
- SAS’ feature of only appending (or “outputting”) a row to a dataset if there are no non-assignment statements which evaluate to false.
–
Grouping: So we effectively rearranged our data so that all identical IDs were grouped together. Given that the rows are identical and you only want to keep one of them, we choose to keep the first of each group.
First.variable: During execution, SAS will iterate through each row of my_data_set and adding it to a new dataset (which it will eventually overwrite my_data_set with). During each iteration, if it hits a row with the first use of an ID value (for example, 123), it will set first.id to true. On the next run, because it’s already seen the value 123 before, first.id is set to false. This gives us a handy flag which will only ever be toggled on unique rows.
Funny Statement Stuff: So how do we flag to SAS that when this value is set to true, to keep the row? When evaluating a data row if at any point we make any floating statement (i.e. not assigning a variable, or in an if or do loop) which evaluates to false, SAS will take that as a sign that it shouldn’t output that row i.e. in this case, it shouldn’t keep it.
So in simple terms, we’re saying – if you’ve seen this value before, don’t save it again.