Programming SAS

Removing duplicate rows in base SAS

If you ever need to remove duplicate rows from a SAS dataset, here’s an approach I use quite often.

Get your data.

Let’s assume it’s in the following format:

ID Name
123 John
456 Bob
123 John

Sort your data.

/* Step 1 - Sort data */
proc sort data=my_lib.my_dataset;

   /* Sort by a field which you want to be unique, 
   and which will be the same for duplicate rows */
   by id; 


Which should give you the following:

ID Name
123 John
123 John
456 Bob

Remove Duplicates

Now that the data is in order, we can remove the duplicates, by only ever keeping the first entry which matches our unique ID.

/* Step 2 - Get rid of duplicates */
data my_lib.my_dataset;

   /* Iterate through this dataset row by row */
   set my_lib.my_dataset;
   /* Grouping each row by the field we sorted on */ 
   by id; 
   /* And only keep a row if it’s the first */


ID Name
123 John
456 Bob


What happened there?

This approach has 3 facets:

  1. Grouping
  2. SAS’ special first.variable
  3. SAS’ feature of only appending (or “outputting”) a row to a dataset if there are no non-assignment statements which evaluate to false.

Grouping: So we effectively rearranged our data so that all identical IDs were grouped together. Given that the rows are identical and you only want to keep one of them, we choose to keep the first of each group.

First.variable: During execution, SAS will iterate through each row of my_data_set and adding it to a new dataset (which it will eventually overwrite my_data_set with). During each iteration, if it hits a row with the first use of an ID value (for example, 123), it will set to true. On the next run, because it’s already seen the value 123 before, is set to false. This gives us a handy flag which will only ever be toggled on unique rows.

Funny Statement Stuff: So how do we flag to SAS that when this value is set to true, to keep the row? When evaluating a data row if at any point we make any floating statement (i.e. not assigning a variable, or in an if or do loop) which evaluates to false, SAS will take that as a sign that it shouldn’t output that row i.e. in this case, it shouldn’t keep it.

So in simple terms, we’re saying – if you’ve seen this value before, don’t save it again.


How to remove security from a PDF

Photo by JuliaJP CC BY

I recently sat through an eLearning course which involved reading through a 1300-page PDF file. Instead of deforesting the Glasgow area to provide the requisite amount of paper for printing, I thought I’d read and annotate it on-screen (something I wouldn’t recommend unless it’s to save the environment)

Annoyingly, annotation is prevented on PDFs with security applied, so I had to find a way around it.

Firstly note that if you Google it, there are several tools and methods for removing security from a PDF, but they are a slow, frustrating hassle. I wasted a lot of time on crazy methods involving re-printing the whole thing to virtual PDF printers, etc – trust me the method below is the only sane way to go about it.

So… how do I remove security from a PDF?

  1. Open your PDF in Google Chrome
  2. Save the PDF from Google Chrome

Tadaa. Unbelievably simple, but very useful.