Tuesday, November 30, 2010

Why I like Mainframe SAS


I was horrified by Mainframe SAS first time when I saw it. I struggled many times to let SAS output ‘Hello World’ to the log and failed, and then I started to sweat. Finally my colleague came to rescue me by teaching me to how to insert JCL before the SAS codes. In a world where every programmer is spoiled by shinning GUI or IDE, I felt fairly uncomfortable to code flashing characters in a floating window by ISPF, a seemly unfriendly text editor. To make things even worse, SAS’s interactive mode was forbidden in my team. Then to test a program is like fumbling for the end in a dark tunnel: you don’t know what your result will be like until the system told you it is ready after minutes’ waiting. As the result, I just found myself totally lost. Eventually I survived in the mainframe kingdom and now I start to like MF SAS. Even though it is a relic from ancient ‘punch-card’ time, MF SAS still has some unique strength, comparing with its more popular peers, such as UNIX SAS and PC SAS.

The power of MF SAS comes from its combination with z/OS, ISPF and JCL as a whole. (1) Nature of sharing. Since we are dealing with a distant server through 3270 terminal, all datasets and program codes can be easily shared. Thus, it is not necessary to set up any FTP or file sharing server. Thousands of coworkers can seamlessly share their progress and keep synchronized. (2) Native sorting and concatenation. JCL can concatenate multiple files together simply by naming them in Data Definition step. System-wise Sort program is also very handy. (3) Quick setup of system options. JCL can help SAS specify some system options, which cannot be switched on/off within SAS itself. For example, the SASLIST and SASLOG files can be outputted to somewhere we want to check later. (4) Scheduling of a job. The code in JES, part of JCL, can order SAS to run at certain time without human’s intervention. That is a really cool feature: how about that SAS starts to run from 2am and I check the overnight result in the morning. (5) Pipeline statement in DD. From SAS 9.1, SAS supports pipeline, by which a library in MF can be translated into an equivalent library in SAS. It will save time to define datasets. (6) Subroutines in calling ISPF commands. SAS provides some subroutines, such as CALL ISPEXEC and CALL ISPLINK, to integrate ISPF. Besides those advantages above, another good thing is that in ISPF the datasets can be manually modified. That is a blessing for people tends to change small mistakes by themselves, instead of running program once again. In conclusion, SAS thrived from the mainframe period, and MF SAS is still visible and alive. Let us enjoy it and hope it to flourish again.

Reference:1. SAS®9.1.3 Companion for z/OS. SAS Publishing. 2004
2. Introduction to the New Mainframe: z/OS Basics. IBM Press. 2009

**********************AUTHOR(DAPANGMAO)----HCHAO8@GMAIL.COM***********************************;
//************MERGE THREE FILES AND SORT**************************
//MYSTEP01 JOB 1 'DAPANGMAO', CLASS=C, MSGCLASS=X, MSGLEVEL=(1,1),
// NOTIFY=*
// EXEC PGM=SORT
//SORTIN DD DSN=MYUNIT.MYLIB.MYDATA(CAT),DISP=SHR
// DD DSN=MYUNIT.MYLIB.MYDATA(DOG),DISP=SHR
// DD DSN=MYUNIT.MYLIB.MYDATA(PIG),DISP=SHR
//SORTOUT DD DSN=MYUNIT.MYLIB.MYDATA(MERGE),UNIT=STORAGE
// DISP=(NEW,CATLG),SPACE=(TRK,(25,5),RLSE),
// DCB=(RECFM=FB,LRECL=250, BLKSIZE=25000)
//SYSOUT DD SYSOUT=*
//SYSIN DD *
SORT FILEDS=(1,10,CH,A)
/*
//*******************SAS PROCEDURE TO CALCULATE MEANS*************
//MYSTEP02 JOB 1 'DAPANGMAO', CLASS=C, MSGCLASS=X,MSGCLEVEL=(1,1)
// NOTIFY=*
//*MAIN DEADLINE=(0400,A,12152010)
// EXEC SAS,
// OPTIONS='LOG=OUTLOG PRINT=OUTPRINT'
//*ALTERNAT EXEC SAS, CONFIG= MYUNIT.MYLIB.MYCONFIG
//MYINPUT DD DSN=&&MERGE,DISP=SHR
//OUTLOG DD DSN=MYUNIT.MYLIB.LOGDATA.OUTLOG,DISP=OLD
//OUTPRINT DD DSN=MYUNIT.MYLIB.LOGDATA.OUTPRINT,DISP=OLD

DATA MERGE1;
INFILE MYINPUT;
INPUT ANIMAL &9. TYPE &5. NUMBER 5.;
RUN;

PROC MEANS DATA=MERGE1;
CLASS TYPE;
VAR NUMBER;
RUN;
/*
//*******************BUILD A PIPE LIBRARY FOR SAS*************
//MYSTEP03 JOB 1 'DAPANGMAO', CLASS=C, MSGCLASS=X, MASGLEVEL=(1,1),
// NOTIFY=*
// EXEC SAS
//PIPELIB DD DSN=MYUNIT.MYLIB.MYDATA,LRECL=6144,RECFM=F,DSORG=PS,
// SUBSYS=(BP01,CLOSESYNC,ERC=DUMMY),LABEL=(,,,OUT)
//ADDON DD DSN=MYUNIT.MYLIB.MYDATA(FOX),DISP=SHR

DATA PIPELIB.CAT;
INFILE ADDON;
OUTPUT;
RUN;

DATA _NULL_;
CALL ISPEXEC(’SELECT PANEL(MAINDEV)’);
IF PLIRETV = 0 THEN PUT PLIRETV=;
RUN;
/*

Sunday, November 28, 2010

The efficiency of five SAS methods in multi-dataset merging


Introduction: Merging two or multiple datasets is essential for many ‘data people’. Yes, it is a dirty and routine job. Everyone wants to get it done quick and accurate. Actually, SAS has many ways to tackle this job[3]. In two competing papers from SAS Global Conference 2009, Qinfeng Liang[1] described five ways to marge a base table and a lookup table regarding the healthcare industry, while David Franklin[2] pictured eight methods to combine patient and effect datasets in a typical pharmaceutical scenario. Here I would like to extend the discussion further: one base table and two lookup tables. I would like to see which one of the solutions would cost less hardware resource and, most importantly, system time.

Method: The base table was generated with 10 million sequential numbers. Two subset tables were randomly chosen from the base table and kept unsorted until a method was applied, and each contains 1 million records. Five methods, Proc SQL, Data step Merge, Proc Format, Data step Hash object and Data step key-set were utilized and compared. System time was summed for each method. The requirement for memory was recorded.

Result: As expected, methods related to data step consume less memory. Key-data structure methods, including Proc SQL, Proc Format and Data step Hash object, ask much more memory. However, in-memory processing does not deliver much help to the time. Proc Format and Data step Hash object still spend above-average time, while obviously Data step set-key costs most time in waiting.

Discussion: Amazingly, even as an ancient technology (maybe 40 years old), Data step Merge is the winner in this competition with both satisfied time and least memory usage. It is also code-efficient. Proc SQL is the second choice. Hash object doesn’t show its edge as other authors suggested. Proc format and Data step key-set are the least favored ones. I also tried Data step Array, and I found that it was very difficult to load the lookup table and eventually I gave up the attempts. The solution by Proc Format is hard to code for multiple table joining, since each has to build an individual format and Proc Format has no batch mode. In conclusion, the choice of the best method depends on specific needs or situation. Old methods, like Data step Merge, can still perform as well as others do, sometimes even better.

References: 1. Qingfeng Liang. Choosing the Right Technique to Merge Large Data Sets Efficiently. SAS Global Forum 2009
2. David Franklin. Merging Data Eight Different Ways. SAS Global Forum 2009
3. SAS® Certiļ¬cation Prep Guide: Advanced Programming for SAS®9. SAS Institute Inc. 2007

********************GENERATE THREE TABLES TO JOIN BY SIMULATION****************;
********************1. GENERATE THE BASE TABLE;
data base;
do number=1 to 1E7;
output;
end;
run;

********************2. GENERATE THE SUB TABLE TO BE JOINED;
proc sql outobs=1000000;
create table sub1 as
select number 
from base 
order by ranuni(43234);
create table sub2 as
select number 
from base 
order by ranuni(45954);
quit;
********************END OF SIMULATION ****************;


*******************MERGE STEPS******************;
*******************1. PROC SQL;
proc sql;
create table sqlmerge as
select a.number, b.number as var1, c.number as var2
from base as a left join sub1 as b on a.number = b.number
left join sub2 as c on  a.number=c.number;
quit;

*******************2. DATA STEP MERGE;
proc sort data=sub1 out=sub1_s;
by number;
run;
proc sort data=sub2 out=sub2_s;
by number;
run;
data datastepmerge;
merge base(in=a) sub1_s(in=b) sub2_s(in=c);
by number;
if b then var1=number;
if c then var2=number;
if a;
run;
/*ALTERNATIVE 1 --  USE DATA SET INDEX TO APPLY DATA STEP MERGE*/
/*
proc sql;
create index number on base;
create index number on sub1;
create index number on sub2;
quit;
data datastepmerge;
merge base(in=a) sub1(in=b) sub2(in=c);
by number;
if b then var1=number;
if c then var2=number;
if a;
run;
*/
/*ALTERNATIVE 2 --  USE DATA SET VIEW TO APPLY DATA STEP MERGE*/
/*
proc sql;
create view  Vbase as select * from base order by number ;
create view  Vsub1 as select * from sub1 order by number;
create view  Vsub2 as select * from sub2 order by number;
quit;
data datastepmerge;
merge Vbase(in=a) Vsub1(in=b) Vsub2(in=c);
by number;
if b then var1=number;
if c then var2=number;
if a;
run;
*/

*******************3. PROC FORMAT;
data sub1_fmt;
length start $12; 
retain fmtname "sub1_fmt";
set sub1 end=lastobs;
do _n_=1 until (lastobs);
start=input(number, $12.);
label=start;
drop number;
output;
end;
if lastobs then do;
start ='other' ;
label= '.';
output;
end;
run;
proc format cntlin=sub1_fmt;
run;

data sub2_fmt;
length start $12; 
retain fmtname "sub2_fmt";
set sub2 end=lastobs;
do _n_=1 until (lastobs);
start=input(number, $12.);
label=start;
drop number;
output;
end;
if lastobs then do;
start ='other' ;
label= '.';
output;
end;
run;
proc format cntlin=sub2_fmt;
run;

data formatmerge;
set base;
var1=put(number, sub1_fmt.);
var2=put(number, sub2_fmt.);
run;

*******************4. DATA STEP HASH OBJECT;
data hashobjmerge;
if _n_=1 then do;
declare hash sub1_h(dataset: 'sub1');
sub1_h.defineKey('number');
sub1_h.defineDone();
declare hash sub2_h(dataset: 'sub2');
sub2_h.defineKey('number');
sub2_h.defineDone();
end;
set base;
var1=number; 
var2=number;
if sub1_h.find() then call missing(var1); 
if sub2_h.find() then call missing(var2);
run;

**********5. DATA STEP KEY-SET;
proc sql;
create table sub1_ix as
select *, number as var1
from sub1;
create index number on sub1_ix;
quit;
data sub2_ix(index=(number /unique /nomiss));
set sub2;
var2=number;
run;
data keysetmerge;
set base;
set sub1_ix KEY=number /unique;
if _IORC_ then do;
_ERROR_=0;
var1='';
end;
set sub2_ix key=number/unique;
if _IORC_ then do;
_ERROR_=0;
var2='';
end;
run;
quit;
*******************END OF COMPARISION************;

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...