Monday, December 18, 2006

Reading compressed data

Sometimes large datasets are available as ASCII files. However, they are too large to be put on HSPH unix system because of limited disk space. These files can be compressed and put on Unix system and uncompressed ‘on the go’. While 7Z and gzip compressed files can be read as well, the simple compressed files are most easily read. There are two ways to compress a file,

compress filename &

Replace "filename" with the name of the data file you wish to compress. This creates a new file with the extension ".Z". For example, if you compress a file called "unit.dat," a compressed file called "unit.dat.Z" would be created, replacing the original file (unit.dat).

gzip filename &

This creates a new file with the extension ".gz". For example, if you compress a file called "unit.dat," a compressed file called "unit.dat.gz" would be created, replacing the original file (unit.dat).

To read the .Z compressed file into SAS without having to uncompress it beforehand, you should add the following to your program:

FILENAME pipedata PIPE 'zcat /usr2/users/student/mkaushik/ne/unit.z' ; /* pipedata is user choosen word */
DATA weights;
INFILE pipedata ; /* This pipedata is user choosen word */
input cogscore race hosmokin htn hochf age female charlson hrtdis pf36 mh36 pn36 enr36 sf36 nyha;
run;

/* Zcat decompresses the data of the input file, and writes the result on the standard output. This data is piped into SAS for being read in. */

Gzipped files can also be read in using following commands.

FILENAME pipedata1 PIPE 'gzcat /usr2/users/student/mkaushik/ne/page.gz' ; /* pipedata1 is user choosen word */
DATA weights;
INFILE pipedata1;
input cogscore race hosmokin htn hochf age female charlson hrtdis pf36 mh36 pn36 enr36 sf36 nyha;
run;
/* This would only work on HSPH system and not on Channing system */

 

FILENAME pipedata1 PIPE 'gzip –dc /usr2/users/student/mkaushik/ne/page.gz' ; /* pipedata1 is user choosen word */
DATA weights;
INFILE pipedata1;
input cogscore race hosmokin htn hochf age female charlson hrtdis pf36 mh36 pn36 enr36 sf36 nyha;
run;
/* This would work on both HSPH and Channing system */