Friday, May 25, 2012

SAS and VBA (1): Conditional formatting

“Traffic lighting” applies distinctive colors to any numeric variables to indicate the ranges, which is particularly important for Excel reporting. In SAS, it can be easily realized by a user defined format. For example, if I want to add yellow color attribute to all numeric variables, which are great than 60, I can create a color format and cast it toward the target Excel file created by ODS destination.

data class;
   set sashelp.class;

* Create a user defined format
proc format; 
   value range 
      60 - high = 'yellow'
      other = 'white'; 

* Apply the color format for all numeric variables
ods html file = "c:\tmp\label_excel.xls" style = minimal;
proc print data = class nobs;
   var _character_ ;
   var _numeric_ / style = [background = range.];
ods html close;
Similarly, a VBA subroutine can do the global search based on the specified ranges with a looping structure around the numeric variables. The looking of the resulting Excel files by either SAS or VBA are essentially identical.

Sub global_label()
    Dim Cell As Object
    Dim myCell As Range
    Dim myRange As Range
'   Specify selection ranges
    If TypeName(Selection) <> "Range" Then Exit Sub
    If Selection.CountLarge = 1 Then
       Set myRange = ActiveSheet.UsedRange
       Set myRange = Application.Intersect(Selection, ActiveSheet.UsedRange)
    End If
'   Only search numeric cells 
    On Error Resume Next
    Set myRange = myRange.SpecialCells(xlConstants, xlNumbers)
    If myRange Is Nothing Then Exit Sub
    On Error GoTo 0
'   Aggregate cells
    For Each Cell In myRange
        If Cell.Value > 60 Then
            If myCell Is Nothing Then
                Set myCell = Cell
                Set myCell = Application.Union(myCell, Cell)
            End If
        End If
    Next Cell

'   Label qualified cells
    If myCell Is Nothing Then
        MsgBox "No matching cell is found"
        With Selection.Interior
            .Pattern = xlSolid
            .Color = 65535
        End With
    End If
End Sub

Friday, May 18, 2012

Use the set operator UNION in PROC SQL

SQL syntax contains a few set operators, such as UNION, EXCEPT and INTERSECT. The UNION operator concatenates the results of multiple SQL queries vertically into a single table for all matching rows, which I found particularly useful in PROC SQL while using SAS to manage data. Here come two examples.

Example 1 – Transpose data
PROC SQL can transform a dataset to any desired structure, without referring to DATA step or PROC TRANSPOSE. For example, SASHELP.CLASS can be transposed from wide to long by the UNION ALL clause, and reversely from long to wide by the MAX function and the GROUP clause.

From wide to long

data wide;
   set sashelp.class;
   drop sex;

proc sql;
   create table long as
   select name, 'Age' as var, age as col1
   from wide
   union all
   select name, 'Weight' as var, weight as col1
   from wide
   union all
   select name, 'Height' as var, height as col1
   from wide
From long to wide

proc sql;
   create table wide as
   select name,
      max(case when var = 'Age' then col1 end) as Age,
      max(case when var = 'Weight' then col1 end) as Weight,
      max(case when var = 'Height' then col1 end) as Height
   from long
   group by name;
Example 2 – Aggregate data into a cube
In SAS, PROC SQL doesn’t support the ROLLUP or CUBE clause. However, we can apply multiple UNION operators to simulate such functionality. For example, we can create a cube table to list all possible summations of the ACTUAL variable by STATE, PRODUCT, YEAR in the SASHELP.PRDSAL2 dataset. Afterward we can easily query this multi-dimensional data structure to look for interesting aggregation information, without running any other aggregating procedure again.

data prdsal2;
   set sashelp.prdsal2;

proc sql;
   create table cube as
   select state,product, year, 'total by state, prodcut and year' as category,
   sum(actual) as actual
   from prdsal2
   group by state, product, year
   select state, product, ., 'total by state and prodcuct', sum(actual)
   from prdsal2
   group by state, product
   select state,'', year, 'total by state and year', sum(actual)
   from prdsal2
   group by state, year
   select '',product, year, 'total by product and year', sum(actual)
   from prdsal2
   group by product, year
   select '' ,'', year, 'total by year', sum(actual)
   from prdsal2
   group by year
   select state, '',. , 'total by state', sum(actual)
   from prdsal2   
   group by state
   select '',  product, ., 'total by product', sum(actual)
   from prdsal2
   select '', '', ., 'grand total', sum(actual)
   from prdsal2
   order by state, product, year

Thursday, May 10, 2012

Transform a SAS data set to an Excel pivot table by VBA

Pivot Table in Excel is the popular data report format, which is similar to an OLAP cube that aggregates data at any dimensions. To create a pivot table for a table with lots of columns, it usually takes 100+ drags and clicks to get job done, which is somehow annoying.

I didn't try the SAS’s Add-in for Microsoft Office or SAS Enterprise Guide yet. However, an easy solution to transform a SAS data set toward an Excel pivot table is possibly to use some VBA scripts under Excel. For example, SASHELP.PRDSAL2, which is a free data set shipped with SAS, records the furniture sales in 64 states of the three countries from 1995 to 1998, and has total 23,040 observations and 11 variables. This data set can be transformed to an Excel pivot table very quickly by two simple steps.

Step 1

In SAS, a data set can be exported toward an XLS file through ODS destination. Although it is still based on HTLM format, it can be opened by Excel.
ods html file = 'c:\tmp\prdsal2.xls' style = minimal;
proc print data = sashelp.prdsal2 noobs;
ods html close;

Step 2

Next step we click on this file, press ALT + F11 to enter VBA editor, paste the VBA code below and run it. Then the pivot table is created. The good thing about this method is that the pivot table can be replicated anywhere by such a VBA subroutine, and it is customizable for particular needs. The example was finished in Excel 2007.

Sub createPT()
'   Set storage path for the pivot table 
    myDataset = "sashelp.prdsal2"
    myFilepath = "c:\tmp\" & myDataset & "_" & Format(Date, "dd-mm-yyyy") & ".xlsx"
    Dim myPTCache As PivotCache
    Dim myPT As PivotTable

'   Delete the sheet containing the previous pivot table   
    Application.ScreenUpdating = False
    On Error Resume Next
    Application.DisplayAlerts = False
    On Error GoTo 0
'   Create the cache
    Set myPTCache = ActiveWorkbook.PivotCaches.Create( _
        SourceType:=xlDatabase, SourceData:=Range("A1").CurrentRegion)
'   Add a new sheet for the pivot table
    ActiveSheet.Name = "Pivot_Table_Sheet"
'   Create the pivot table
    Set myPT = ActiveSheet.PivotTables.Add( _
        PivotCache:=myPTCache, TableDestination:=Range("A5"))
    With myPT
        .PivotFields("COUNTRY").Orientation = xlPageField
        .PivotFields("STATE").Orientation = xlRowField
        .PivotFields("PRODTYPE").Orientation = xlRowField
        .PivotFields("PRODUCT").Orientation = xlRowField
        .PivotFields("YEAR").Orientation = xlColumnField
        .PivotFields("QUARTER").Orientation = xlColumnField
        .PivotFields("MONTH").Orientation = xlColumnField
        .PivotFields("ACTUAL").Orientation = xlDataField
        .PivotFields("PREDICT").Orientation = xlDataField
        .DataPivotField.Orientation = xlRowField
        '  Add a calculated field to compare the predicted value and the actual value
        .CalculatedFields.Add "DIFF", "=PREDICT-ACTUAL"
        .PivotFields("DIFF").Orientation = xlDataField
        ' Specify a number format
        .DataBodyRange.NumberFormat = "$#, ##0.00"
        ' Apply a style for pivot table
        .TableStyle2 = "PivotStyleLight18"
    End With
    Range("A1").FormulaR1C1 = "Pivot table made from data set" & " " & myDataset
    Range("A2").FormulaR1C1 = "Prepared by WWW.SASANALYSIS.COM on " & Date
    ActiveWorkbook.SaveAs Filename:=myFilepath, _
        FileFormat:=xlOpenXMLWorkbook, CreateBackup:=False
End Sub

Thursday, May 3, 2012

Top 10 tips and tricks about PROC SQL


PROC SQL is the implementation of the SQL syntax in SAS. It first appeared in SAS 6.0, and since then has been very popular for SAS users. SAS ships with a few sample data sets in its HELP library, and SASHELP.CLASS is one of them. This dataset contains 5 variables including name, weight, height, sex and age for 19 simulated teenagers, and in this paper I primarily use it for the demonstration purpose. Here I summarize the 10 interesting tricks and tips using PROC SQL. At the beginning, I first make a copy of SASHELP.CLASS at the WORK library and transform the row number of the data set to a new variable obs.
data class;
   set sashelp.class;
   /* Give an index for each child*/
   obs = _n_;

1. Calculate the median of a variable

With the aggregating HAVING clause and some self-join techniques, PROC SQL can easily calculate the median for a variable.

proc sql;
   select avg(weight) as Median
   from (select e.weight
   from class e, class d
   group by e.weight
   having sum(case when e.weight = d.weight then 1 else 0 end)
      >= abs(sum(sign(e.weight - d.weight))));

2. Draw a horizontal histogram
A histogram visualizes the distribution pattern of a variable. PROC SQL can draw a horizontal histogram by showing the frequency bars with a few asterisks for each level of the variable age.

proc sql;
   select age, repeat('*',count(*)*4) as Frequency
   from class
   group by age
   order by age;

3. Return the running total for a variable
A running total is the summation of a sequence of numbers which is updated each time with the increase of the observations. In the example below, I calculate the running total and save them as a new variable Running_total by the SUM function and a conditional statement, which logically is similar to an example in SAS/IML[1]. 

proc sql;
   select name, weight,
      (select sum(a.weight) from class as
      a where a.obs <= b.obs) as Running_total
   from class as b;

4. Report the total number for a variable
PROC SQL is a flexible way to find the total number for any variable by its set operator UNION and the SUM function. In the example, the total number of the variable weight is reported at the bottom of the output table.

proc sql;
   select name, weight
   from class
   union all
   select 'Total', sum(weight)
   from class;
5. Retrieve the metadata for a data set
SAS stores the metadata at its DICTIONARY data sets. PROC SQL can visit the directory, retrieve the column detail, and return the information to the users.
proc sql;
   select name, type, varnum
   from sashelp.vcolumn
   where libname = 'WORK' and memname = 'CLASS';
6. Rank a variable 
Besides the designated ranking procedure PROC RANK in SAS, PROC SQL can also do some simple ranking as well.

proc sql;
   select name, a.weight, (select count(distinct b.weight)
   from class b
   /* Rank by the ascending order for the weight variable*/
   where b.weight <= a.weight) as rank
   from class a;
7. Simple random sampling 
PROC SQL is widely used in simple random sampling. For example, I randomly choose 8 observations by the OUTOBS option at the PROC statement. The randomization process is realized by the RANUNI function at the ORDER BY statement with a seed 1234.

proc sql outobs = 8;
   select *
   from class
   order by ranuni(1234);
8. Replicate a data set without data
In PROC SQL, it is a fairly straightforward one-line statement to create a new empty data set while keeps all the structure of the original data set.
proc sql;
   create table class2 like class;

9. Transpose data
Usually DATA step ARRAY and PROC TRANSPOSE allow SAS users to restructure the data set, while PROC SQL sometimes is an alternative solution. For instance, if we need a wide-to-long operation to list the names of the children by their gender in the CLASS date set, then PROC SQL can fulfill the functionality through the combinations of some queries and subqueries.

proc sql;
   select max(case when sex='F'
      then name else ' ' end) as Female,
      max(case when sex='M'
      then name else ' ' end) as Male
   from (select,,
      (select count(*) from class d
      where and e.obs < d.obs) as level
      from class e)
   group by level;
10. Count the missing values
Another advantage of PROC SQL is that its NMISS function works for both numeric and character variables [2], which makes PROC SQL an ideal tool for missing value detection.

proc sql;
   select count(*) 'Total', nmiss(weight)
      'Number of missing values for weight'
   from class;
The combination of SAS’s powerful functions and the SQL procedure will benefit SAS users in data management and descriptive statistics.

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...