|
Downloading Data From The Web
Different websites provide data in several different formats ū ASCII, SAS, SPSS, Stata and others. Not all data are available in all formats, though, so you need to choose which best suits your needs. Sometimes, you will find that the data you want are not available in the format you want. DonĘt worry about this, your data can always be converted. Here are some tips:
- Data and related files are often bundled together and compressed into what are often called a "zip" file. Common file extensions for these are ".zip" on Windows and ".gz" or ".tar" on Unix. You will need to "unzip" the files before you can do anything else with them. WinZip is a good Windows program, and the "gunzip" command can be used on Unix.
- If there is an ASCII (i.e., plain text) data set with a program file for the statistical package you intend to use, then select that option. Sometimes there is an option for data files already in the format you want ("system," "portable," "transport"), but these may have some "glitches" due to differences in the type of machine they were created on and the type you are using. It's rare, but it does happen.
- If there is not an ASCII data set and setup file in the package you want to use, but there is one for another package, then use the other package to create a system file and then convert it. For example, if you like to use SPSS, but there is only an option for SAS, use SAS to read and create a SAS data file, then convert the SAS data file to SPSS using StatTransfer.
- If you are downloading data from a geospatial data site, the file may be in "Dbase" format and have an extension of ".dbf". This is the format used by ArcView (a "shape" file is actually a set of files, one or more of which is a .dbf file). These files can be read directly into SAS, Stata, SPSS and Excel. We have more information on using the statistical packages with ArcView.
- The setup files are written to read the entire data set, which you may not need. Rather than editing the program to read only the variables/observations you want, let the program read the entire data set, then just add drop and/or keep statements in the appropriate place to retain what you want. Make absolutely sure that you select all identification and weighting variables. If you are not sure if you want a particular variable, keep it. It's easier to ignore or drop a variable later than it is to go back and add it to your dataset.
- Sometimes the programs have large sections "commented out" so those statements are not executed. If you do want these statements to be executed, then be sure to un-comment them. Typically, these are statements to convert missing value codes (such as "999") to system-missing codes.
- If possible, run some descriptive statistics on your data and compare them to the codebook or some other source to make sure you have read the data correctly.
|
|