Saturday, December 31, 2011

Using SQL for data management


Every weekday I pull out data from IBM DB2 database and do analysis in SAS. I use SQLite to record my personal chores, such as credit card transactions. And I am taking courses of database development by Microsoft SQL Server. Then I have pretty enough time to compare the 4 softwares. Recently I have the rough feeling -- the relational database end and the analytics end look more and more similar, while the databases brings more analysis procedures and every statically packages is eager to introduce SQL feature as enhancement. Thus, Structured Query Language (SQL) tends to be the universal language in the data world.

SQLs in DB2, Microsoft SQL Server (T-SQL), SQLite and SAS (PROC SQL) follow the same logic but contain different characteristics. We can treat them as the branches originated from the ANSI SQL standard. I play around their own SQL dialects and try to learn how to manage data in different ways.

Data
Using the SQL syntax below,  the SASHELP.CLASS data set, which contains the gender, name, age, weight and height information of 18 fake teenagers, will be able to be transferred to any database system.
create table class (name char(8), sex char(1), 
   age numeric, height numeric, weight numeric );
insert into class values('ALFRED','M',14,69,112.5);
insert into class values('ALICE','F',13,56.5,84);
insert into class values('BARBARA','F',13,65.3,98);
insert into class values('CAROL','F',14,62.8,102.5);
insert into class values('HENRY','M',14,63.5,102.5);
insert into class values('JAMES','M',12,57.3,83);
insert into class values('JANE','F',12,59.8,84.5);
insert into class values('JEFFREY','M',13,62.5,84);
insert into class values('JOHN','M',12,59,99.5);
insert into class values('JOYCE','F',11,51.3,50.5);
insert into class values('JUDY','F',14,64.3,90);
insert into class values('LOUISE','F',12,56.3,77);
insert into class values('MARY','F',15,66.5,112);
insert into class values('PHILIP','M',16,72,150);
insert into class values('ROBERT','M',12,64.8,128);
insert into class values('RONALD','M',15,67,133);
insert into class values('THOMAS','M',11,57.5,85);
insert into class values('WILLIAM','M',15,66.5,112);

Random sampling
I randomly select 8 out of the 18 persons. The realization of this purpose mainly relies on the software's' random function and output control.

/********************** a. SQL Server **********************/
select top (8) *
from class
order by NEWID();

/********************** b. DB2 **********************/
select *
from class
order by rand() 
fetch first 8 rows only;

/********************** c. SQLite **********************/
select *
from class
order by random() 
limit 8;

/********************** d. SAS **********************/
proc sql outobs = 8;
select *
from class
order by ranuni(0);

Report subtotal of a variable as the last observation
It is a quite common task to make a table with subtotal for reporting. DB2 and SQL Server both have ROLLUP operator/function, while SAS may utilize its macro facility.

/********************** a. SQL Server **********************/
select coalesce(name, 'Total') as name, SUM(weight) as weight
from class
group by name with rollup;

/********************** b. DB2 **********************/
select case grouping(name) 
      when 0 then name
      else 'Total'
   end name,
sum(weight) as weight
from class
group by rollup(name);

/********************** c. SQLite **********************/
select name, sum(weight) as weight
from class
group by name
union all
select 'Total', sum(weight) 
from class;

/********************** d. SAS **********************/
select sum(weight) into: sum_weight from class;
create table class2 as select name, weight from class;
insert into class2 values('Total', &sum_weight);

Retrieve column information of a table
Each database tries to store the metadata somewhere, so that we are able to retrieve the information of the columns/variables, such as their name, data type and position at the table. Only SQLite, a portable database, has no designated table to store this information. However, running the command line shell of SQLite has the same effect.

/********************** a. SQL Server **********************/
select column_name, data_type, ordinal_position
from information_schema.columns
where table_name = 'class';

/********************** b. DB2 **********************/
select colname, typename, colno
from syscat.columns
where tabname = 'class';

/********************** c. SQLite **********************/
pragma table_info(class);

/********************** d. SAS **********************/
select name, type, varnum
from sashelp.vcolumn
where libname = 'WORK' and memname = 'CLASS';

Obtain the running total of a variable
It seems like a difficult job for native SQL syntax. SQL Server and DB2 evolve with their ranking functions -- DB2 has sum() over() and SQL Server has row_number() over(), which can be used to solve this problem. In SAS, the better way is to use DATA step instead.

/********************** a. SQL Server **********************/
select name, weight, sum(weight) over (order by weight) as running_total
from class
order by weight

/********************** b. DB2 **********************/
create view v_class as
select ROW_NUMBER() OVER(ORDER BY name) as rownum, *
from class;
select a.name, a.weight, 
   (select SUM(b.weight) from v_class b
   where b.rownum <= a.rownum) as running_total
from v_class as a

/********************** d. SAS **********************/
data class2;
   class;
   retain running_total;
   running_total + weight;
   keep name weight running_total; 
run;

Good math, bad engineering

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s abilit...