Introduction to SAS #
What is SAS? #
SAS (Statistical Analysis System) is a comprehensive software suite used for advanced analytics, business intelligence, data management, and predictive modeling. It’s a powerful tool employed across various industries, including healthcare, finance, and research, to process, analyze, and visualize large datasets. SAS offers a programming language (also called SAS) along with a graphical user interface (GUI) for ease of use. While it’s known for its statistical capabilities, its functionalities extend far beyond basic statistics, encompassing data warehousing, data mining, and creating reports and dashboards.
Why learn SAS? #
Learning SAS offers numerous benefits for aspiring data scientists, analysts, and anyone working with large datasets. These include:
- High demand: SAS professionals are in high demand across many industries, leading to excellent career prospects and competitive salaries.
- Powerful analytical capabilities: SAS provides a robust set of tools for performing complex statistical analyses, data manipulation, and predictive modeling.
- Extensive data management features: SAS excels at managing and manipulating large, complex datasets with ease, handling both structured and unstructured data.
- Industry-standard software: Many organizations rely on SAS for their data analysis needs, making SAS proficiency a valuable skill.
- Comprehensive documentation and support: SAS provides extensive documentation, tutorials, and community support, making it easier to learn and troubleshoot issues.
- Integration with other tools: SAS integrates well with other software and programming languages, enhancing its versatility and power.
SAS System Architecture #
The SAS system consists of several key components working together:
- SAS DATA Step: This is the core programming language used to manipulate data, perform calculations, and create new datasets.
- SAS Procedures (PROCS): These are pre-written programs that perform specific statistical analyses, data reporting, and other tasks. They provide a higher-level interface compared to the DATA step.
- SAS Server: This manages the processing and storage of data. It’s especially important when working with very large datasets.
- SAS Library: A location where SAS datasets are stored and accessed.
- SAS Workspace: The environment where users interact with the SAS software. This encompasses the editor, log, output, and other windows.
- SAS User Interface (GUI): A graphical interface for users who prefer a visual approach to data analysis (e.g., SAS Studio).
Setting up your SAS environment #
Setting up your SAS environment depends on whether you’re using a locally installed version or a cloud-based version (e.g., SAS OnDemand for Academics or SAS Viya). Generally, the process involves:
Obtaining access: This might involve installing the software on your computer, obtaining a license, or accessing a cloud-based platform through a web browser. Instructions will vary depending on your access method.
Installation (if applicable): Follow the installation instructions provided by SAS. This will typically involve downloading the installation files and running the installer.
License activation (if applicable): After installation, you may need to activate your SAS license using a license key.
Configuration (if applicable): You may need to configure your SAS environment to connect to specific databases or servers.
Launching SAS: Once the installation and configuration are complete, you can launch the SAS software. The specific method will depend on the version and your installation method.
Familiarization: After launching SAS, explore the different components of the software. Get familiar with the editor, log, output windows and the available user interface elements. Many versions come with introductory tutorials or help files.
Remember to consult the official SAS documentation for detailed, version-specific instructions on setting up your environment.
SAS Programming Basics #
The SAS DATA step #
The SAS DATA step is the fundamental building block of SAS programming. It’s a procedural programming language used to create, modify, and manage SAS datasets. The DATA step executes one observation at a time, reading data, performing calculations, and writing the results to a new dataset. The basic structure of a DATA step is as follows:
DATA new_dataset;
* statements to process data;
RUN;
DATA new_dataset;
This statement begins the DATA step and specifies the name of the new dataset to be created.
* statements to process data;
This section contains the SAS statements that perform operations on the data.
RUN;
This statement signals the end of the DATA step and initiates the execution.
Each DATA step processes data in an iterative manner, processing each observation individually until all observations have been processed.
Inputting data #
Data can be inputted into a SAS DATA step from various sources, including:
- External files: Data can be read from text files (e.g., CSV, txt), Excel spreadsheets, or databases. The
INPUT
statement is used to read data from these sources, specifying the variable names and formats. For example:
DATA mydata;
INPUT Name $ Age Height Weight;
DATALINES;
John 30 72 180
Jane 25 65 140
;
RUN;
Direct input: The
DATALINES
statement allows you to directly input data within the DATA step.SAS datasets: Data can be read from existing SAS datasets using a
SET
statement.
Data manipulation with SAS DATA step #
The SAS DATA step offers powerful capabilities for data manipulation. Common operations include:
- Creating new variables: Use assignment statements (
=
) to create new variables based on existing ones. For example:
DATA mydata;
INPUT Name $ Age;
BMI = Weight / (Height/100)**2; /* Calculate BMI */
DATALINES;
John 30 70 180
Jane 25 65 140
;
RUN;
Modifying existing variables: Change values, apply transformations (e.g., logarithmic transformations), and recode variables.
Filtering data: Use
WHERE
statement to select specific observations based on conditions.Sorting data: Use the
PROC SORT
procedure to sort a dataset based on one or more variables.
Understanding SAS variables #
SAS variables have a name, a type (numeric or character), and a length.
Variable names: Must start with a letter and can contain letters, numbers, and underscores. They are case-insensitive.
Variable types: Numeric variables store numbers, while character variables store text. Character variables are identified by a dollar sign ($) at the end of the variable name in the
INPUT
statement.Variable length: Specifies the maximum number of characters for character variables. Numeric variables have a default length.
Working with SAS datasets #
SAS datasets are tabular data structures organized into observations (rows) and variables (columns). They are stored in a proprietary format. Key operations include:
Creating datasets: The
DATA
statement creates a new SAS dataset.Accessing datasets: Use
SET
,MERGE
, or other statements to access data from existing datasets within a DATA step.Modifying datasets: Change data, add or delete variables, filter observations within a DATA step.
Outputting data #
After processing data within the DATA step, the results are typically written to a new SAS dataset. You can also output data to:
External files: Export data to text files (CSV, TXT), Excel spreadsheets, or databases using procedures like
PROC EXPORT
.Printed output: Generate reports and tables using procedures like
PROC PRINT
.SAS datasets: The processed data is automatically saved as a new SAS dataset unless explicitly otherwise specified.
Essential SAS Procedures #
PROC PRINT #
PROC PRINT
is a fundamental SAS procedure used to display the contents of a SAS dataset. It provides a simple and quick way to view the data, check for errors, and understand the structure of the dataset.
Basic Syntax:
PROC PRINT DATA=dataset_name;
RUN;
Replace dataset_name
with the name of the SAS dataset you want to display. Options can be added to customize the output (e.g., specifying variables to display, controlling the format).
Example:
PROC PRINT DATA=mydata;
RUN;
This will print all variables and observations from the dataset mydata
.
PROC MEANS #
PROC MEANS
is a powerful procedure used to calculate descriptive statistics for numeric variables in a SAS dataset. It can compute summary statistics such as the mean, standard deviation, minimum, maximum, and more.
Basic Syntax:
PROC MEANS DATA=dataset_name MEAN STD MIN MAX;
VAR variable1 variable2 ...;
RUN;
Replace dataset_name
with your dataset name, and list the variables you want to calculate statistics for in the VAR
statement. You can specify other statistics beyond the ones shown (e.g., median, percentiles).
Example:
PROC MEANS DATA=sales_data MEAN STD MIN MAX;
VAR Sales Revenue Profit;
RUN;
This will compute the mean, standard deviation, minimum, and maximum for the Sales
, Revenue
, and Profit
variables in the sales_data
dataset.
PROC FREQ #
PROC FREQ
is used to generate frequency tables and perform chi-square tests. It’s particularly useful for categorical variables, showing the number of observations for each category and calculating percentages.
Basic Syntax:
PROC FREQ DATA=dataset_name;
TABLES variable1 variable2 ...;
RUN;
List the categorical variables for which you want frequency tables in the TABLES
statement. Options can be added to control the output (e.g., to calculate percentages, perform chi-square tests).
Example:
PROC FREQ DATA=customer_data;
TABLES Gender*Region / CHISQ; /* Chi-square test for association between Gender and Region */
RUN;
PROC SQL #
PROC SQL
allows you to use SQL (Structured Query Language) statements to access and manipulate SAS datasets. It’s a very powerful tool for complex data manipulations that might be difficult or cumbersome to achieve using only the DATA step.
Basic Syntax:
PROC SQL;
SELECT statement;
QUIT;
The SELECT
statement follows standard SQL syntax. You can use WHERE
clauses for filtering, JOIN
statements to combine datasets, and other SQL commands.
Example:
PROC SQL;
CREATE TABLE HighSales AS
SELECT *
FROM SalesData
WHERE Sales > 10000;
QUIT;
This creates a new table HighSales
containing only the observations from SalesData
where Sales
is greater than 10000.
PROC SORT #
PROC SORT
is used to sort the observations in a SAS dataset based on one or more variables. Sorting is crucial for many data analysis tasks, such as preparing data for merging or specific analyses.
Basic Syntax:
PROC SORT DATA=dataset_name OUT=sorted_dataset_name;
BY variable1 variable2 ...;
RUN;
Specify the input dataset (dataset_name
) and the output dataset (sorted_dataset_name
). List the variables to sort by in the BY
statement. The observations will be sorted in ascending order by default; you can use the DESCENDING
option to sort in descending order.
Example:
PROC SORT DATA=customer_data OUT=sorted_customer_data;
BY LastName FirstName;
RUN;
This sorts the customer_data
dataset by LastName
(ascending) and then by FirstName
(ascending) within each LastName
group, creating a new sorted dataset called sorted_customer_data
.
Data Manipulation Techniques #
Data Cleaning #
Data cleaning is a crucial step in any data analysis project. It involves identifying and correcting or removing errors, inconsistencies, and inaccuracies in the data. Common data cleaning tasks in SAS include:
Handling missing values: Missing values can be represented differently depending on the data source. SAS represents missing numeric values as a dot (.) and missing character values as a blank. Techniques for handling missing values include:
- Deletion: Removing observations with missing values (using a
WHERE
statement in the DATA step). This is only suitable if the missing data is a small percentage and not systematically biased. - Imputation: Replacing missing values with estimated values (e.g., using the mean, median, or a more sophisticated imputation method).
- Indicator variables: Creating a new variable that indicates whether a value was missing.
- Deletion: Removing observations with missing values (using a
Identifying and correcting outliers: Outliers are extreme values that differ significantly from other values in the dataset. They can be identified through visual inspection (e.g., using PROC UNIVARIATE or PROC GPLOT) or statistical methods. Corrections might involve removing outliers (if they are due to errors) or transforming the data.
Dealing with inconsistent data: This includes inconsistencies in data formats (e.g., dates, numbers), spelling errors, and duplicate entries. In SAS, you can use functions like
INPUT
with various formats to standardize data formats, and procedures likePROC COMPARE
to identify discrepancies between datasets. Duplicate entries can be identified and removed using thePROC SQL
DISTINCT
keyword.Data validation: Checking data against known constraints (e.g., ensuring values are within a valid range, checking data types). This can be done through data step programming using
IF-THEN-ELSE
statements or by using validation rules within data input procedures.
Data Transformation #
Data transformation involves modifying the data to make it suitable for analysis. Common transformations in SAS include:
Creating new variables: Calculating new variables from existing ones (e.g., creating a BMI variable from height and weight, calculating ratios or percentages). This is usually done within a DATA step using assignment statements.
Recoding variables: Changing the values of a variable (e.g., recoding categorical variables into numerical codes, grouping values into categories). The
IF-THEN-ELSE
statements or theRETAIN
statement can be used in the DATA step.Scaling variables: Adjusting the scale of variables, such as standardizing or normalizing variables to have a mean of 0 and a standard deviation of 1. Procedures like
PROC STANDARD
can be used.Data aggregation: Summarizing data at a higher level (e.g., calculating the average sales per region, summing values across groups).
PROC MEANS
orPROC SUMMARY
can be used for this.
Data Merging #
Data merging involves combining data from multiple datasets based on common variables (keys). In SAS, several approaches exist:
One-to-one merge: Combining two datasets where each observation in one dataset corresponds to exactly one observation in the other dataset.
One-to-many merge: Combining datasets where one observation in one dataset can match multiple observations in the other dataset (e.g., merging customer data with order data).
Many-to-many merge: Combining datasets where multiple observations in one dataset can match multiple observations in the other dataset.
The MERGE
statement in the DATA step is typically used for merging datasets. The BY
statement specifies the common variables used for matching.
Data Subsetting #
Data subsetting involves selecting a portion of a dataset based on specified criteria. This can be achieved through:
- WHERE statement: Using a
WHERE
statement in the DATA step to select observations based on conditions. For example:
DATA subset_data;
SET original_data;
WHERE Age > 30 AND Gender = 'Male';
RUN;
IF-THEN-ELSE statements: Using
IF-THEN-ELSE
statements in a DATA step to conditionally select and process observations.PROC SQL: Using
WHERE
clauses inPROC SQL
statements to select observations.Using
PROC SURVEYSELECT
: For creating random samples from a dataset. This is useful for working with a large dataset where analysis on a representative subset is sufficient.
Working with SAS Libraries #
Creating Libraries #
SAS libraries are collections of SAS datasets stored in a specific location on your computer or server. Creating a library allows you to organize your datasets efficiently and manage them more effectively. There are several ways to create SAS libraries:
- Using the LIBNAME statement: This is the most common method. The
LIBNAME
statement assigns a libref (library reference) to a physical location on your file system. The libref is a shorthand name you’ll use to refer to the library in your SAS code.
LIBNAME mylib 'C:\MySASData'; /* For Windows */
LIBNAME mylib '/home/user/MySASData'; /* For Linux/macOS */
Replace 'C:\MySASData'
or '/home/user/MySASData'
with the actual path to the directory where you want to store your datasets. The directory must exist before you run the LIBNAME
statement.
Using SAS Enterprise Guide or SAS Studio: These graphical user interfaces provide menus and dialogs to create libraries without writing code. Consult your specific software’s documentation for details.
Using the
PROC CATALOG
procedure: This procedure provides more advanced options for managing libraries, including creating libraries with specific permissions and attributes.
Managing Libraries #
Managing libraries involves tasks such as:
- Listing library contents: The
PROC CONTENTS
procedure lists the datasets within a specified library.
PROC CONTENTS DATA=mylib.mydata;
RUN;
This would show the contents of the mydata
dataset in the library mylib
. Omitting the dataset name will list all datasets in the library.
- Deleting datasets: Use the
DELETE
statement within aPROC DATASETS
to delete datasets from a library.
PROC DATASETS LIBRARY=mylib;
DELETE mydata;
QUIT;
- Renaming datasets: Use the
PROC DATASETS
to rename datasets within a library.
PROC DATASETS LIBRARY=mylib;
CHANGE mydata=new_mydata;
QUIT;
- Deleting Libraries: To delete a library you must first remove all datasets within the library and then remove the library reference. Note that this does not delete the physical directory on your hard drive.
Accessing Data in Libraries #
Once a library is created and a libref is assigned, you can easily access datasets within that library by using the libref followed by a period (.) and the dataset name.
- In DATA steps:
DATA new_dataset;
SET mylib.mydata;
RUN;
This reads data from the mydata
dataset in the mylib
library.
- In PROC steps:
PROC PRINT DATA=mylib.mydata;
RUN;
This prints the contents of the mydata
dataset in the mylib
library.
- In PROC SQL:
PROC SQL;
SELECT * FROM mylib.mydata;
QUIT;
This selects all data from the mydata
dataset in the mylib
library using PROC SQL. Using librefs makes your code more portable and easier to manage, as you only need to change the LIBNAME
statement to point to a different directory if you move or copy your data. Avoid hardcoding file paths directly into your code.
Advanced SAS Concepts (Brief Overview) #
Macros #
SAS macros are powerful tools that allow you to automate repetitive tasks and create reusable code blocks. A macro is essentially a piece of code that can be called multiple times with different parameters. They enhance code modularity and readability, especially when dealing with similar operations on multiple datasets or variables.
Basic Structure:
A macro definition begins with a %MACRO
statement, followed by the macro code, and ends with a %MEND
statement. Macros are invoked using a %
symbol followed by the macro name.
%MACRO mymacro(param1, param2);
/* Macro code here, using ¶m1 and ¶m2 */
%MEND mymacro;
%mymacro(value1, value2); /* Invoking the macro */
Macros are particularly useful for creating loops, generating code dynamically, and performing conditional processing within SAS programs. They can significantly reduce the amount of code required for complex tasks and improve maintainability. However, mastering macro programming involves a steeper learning curve compared to basic DATA step programming.
Arrays #
SAS arrays provide a way to store and manipulate multiple variables as a single unit. This allows for efficient processing of variables with similar characteristics, avoiding repetitive code. Arrays can be used within DATA steps to perform calculations, manipulations, or conditional processing on groups of variables simultaneously.
Example:
DATA mydata;
INPUT var1 var2 var3 var4;
ARRAY vars(4) var1-var4;
DO i = 1 TO 4;
sum_vars + vars(i);
END;
DATALINES;
1 2 3 4
5 6 7 8
;
RUN;
This code creates an array vars
containing four variables (var1
to var4
) and calculates their sum. Arrays greatly simplify the process of working with many variables that share a common structure or purpose, making code more concise and efficient.
Formats and Informats #
Formats control how data is displayed, while informats determine how data is read into SAS. They are crucial for ensuring data is displayed appropriately, converting data types, and providing user-friendly output. SAS provides many pre-defined formats and informats, but you can also create custom ones to meet specific needs.
Example:
The DATE9.
format displays dates in the YYYYMMDD format. You can create a custom format to display dates in a different style, or to represent numeric codes with descriptive labels.
PROC FORMAT;
VALUE genderfmt 1='Male' 2='Female';
RUN;
DATA mydata;
INPUT gender code;
format gender genderfmt.; /* Applying the custom format */
DATALINES;
1 101
2 102
;
RUN;
This creates a custom format genderfmt
mapping numeric codes to gender labels. Proper use of formats improves the readability and interpretability of your results, especially when presenting data in reports or dashboards.
Custom Functions #
SAS allows you to create your own custom functions to encapsulate specific operations or algorithms. These functions enhance code reusability and make programs easier to understand and maintain. Custom functions can be created using the PROC FCMP
procedure.
Example:
You might create a custom function to calculate a complex statistical measure, to apply a specific data transformation, or to perform a repeated sequence of operations. This enhances code organization and readability by abstracting away complex logic into smaller, more manageable units. A well-designed set of custom functions makes your code more modular and easier to debug and maintain, especially in large projects.
The creation and use of custom functions represent a more advanced stage of SAS programming, but offer significant advantages in terms of code quality and reusability.
Practical Examples and Exercises #
Example 1: Analyzing a Simple Dataset #
This example demonstrates basic data analysis using a small dataset. We’ll use a dataset containing information about students’ scores in different subjects.
Dataset (student_scores.csv):
StudentID,Math,Science,English
1,85,92,78
2,76,88,85
3,90,85,95
4,82,79,80
5,70,80,75
SAS Code:
proc import datafile="student_scores.csv"
out=student_scores
dbms=csv replace;
getnames=yes;
run;
proc means data=student_scores mean std min max;
var Math Science English;
run;
proc print data=student_scores;
run;
This code first imports the CSV data into a SAS dataset. Then, PROC MEANS
calculates descriptive statistics (mean, standard deviation, minimum, maximum) for each subject. Finally, PROC PRINT
displays the entire dataset. This example showcases data import, descriptive statistics calculation, and data display.
Example 2: Data Cleaning and Transformation #
This example shows how to clean and transform data. We’ll use a dataset with missing values and inconsistent data formats.
Dataset (sales_data.csv):
Salesperson,Region,Sales,Date
John,North,10000,2023-10-26
Jane,South,15000,2023-10-27
Peter,North,.,2023-10-28
Mary,East,12000,2023/10/29
David,South,8000,2023-10-30
SAS Code:
proc import datafile="sales_data.csv"
out=sales_data
dbms=csv replace;
getnames=yes;
run;
data clean_sales;
set sales_data;
if missing(Sales) then Sales = 0; /* Impute missing sales with 0 */
Date = input(Date,anydtdtm.); /* Standardize date format */
format Date date9.;
run;
proc print data=clean_sales;
run;
This code imports the data, imputes missing sales values with 0, standardizes the date format, and displays the cleaned data. This demonstrates data imputation and format standardization, common data cleaning tasks.
Example 3: Creating Reports Using SAS Procedures #
This example demonstrates creating reports using PROC PRINT
and PROC MEANS
with formatting options. We’ll use the student_scores
dataset from Example 1.
SAS Code:
proc print data=student_scores label;
title "Student Scores";
var StudentID Math Science English;
label Math="Math Score" Science="Science Score" English="English Score";
run;
proc means data=student_scores mean std;
var Math Science English;
output out=summary_stats mean=;
title "Summary Statistics";
run;
proc print data=summary_stats;
title "Mean and Standard Deviation";
run;
This generates a formatted report with titles and labels, showing both the raw data and summary statistics. It demonstrates how to enhance reports using titles, labels, and outputting summary statistics to a separate dataset for further manipulation or reporting.
Practice Exercises #
Data Analysis: Create a SAS program to analyze a dataset containing customer demographics (age, gender, income, location) and purchase history (product category, purchase amount, purchase date). Calculate the average purchase amount per customer, the number of purchases per product category, and the average income per region.
Data Cleaning: Obtain a dataset with missing values and outliers. Write a SAS program to handle missing values (using imputation or deletion) and identify and handle outliers (using a method of your choice).
Data Transformation: Transform a dataset by creating new variables (e.g., calculate a customer’s total expenditure, create an age group variable from age), standardizing variables, and recoding variables.
Report Generation: Create a report summarizing the results of your data analysis (from exercise 1), using appropriate SAS procedures and formatting options for clear presentation. Remember to include descriptive titles, labels, and formatted output. Consider incorporating graphs.
These exercises encourage applying the concepts learned to real-world data analysis scenarios, fostering a deeper understanding of SAS programming. Remember to consult the SAS documentation and online resources for help and further exploration.
Conclusion and Further Learning #
Key Takeaways #
This beginner’s guide has provided a foundational understanding of SAS programming and data analysis. Key takeaways include:
Fundamentals of SAS programming: You’ve learned the basic structure of SAS DATA steps, how to input, manipulate, and output data, and how to work with SAS datasets and libraries.
Essential SAS procedures: You’ve gained familiarity with core SAS procedures like
PROC PRINT
,PROC MEANS
,PROC FREQ
,PROC SQL
, andPROC SORT
, enabling you to perform common data analysis tasks.Data manipulation techniques: You’ve acquired skills in data cleaning, transformation, merging, and subsetting, crucial for preparing data for analysis.
Working with SAS libraries: You’ve learned how to manage and access data stored in SAS libraries, promoting efficient data organization and code reusability.
Advanced SAS concepts (overview): You have a basic understanding of macros, arrays, formats/informats, and custom functions, paving the way for more advanced SAS programming.
Resources for Further Learning #
To continue your SAS learning journey, explore these resources:
Official SAS documentation: The official SAS documentation is a comprehensive resource covering all aspects of the SAS system. It provides detailed explanations, examples, and tutorials.
SAS online learning: SAS offers various online courses and tutorials catering to different skill levels. These courses provide structured learning paths and hands-on exercises.
SAS communities and forums: Engage with the SAS community through online forums and communities. These platforms offer support, discussion, and opportunities to learn from experienced SAS users.
Books and publications: Numerous books and publications are available on SAS programming and data analysis. These resources provide in-depth coverage of specific topics and advanced techniques.
Online tutorials and courses (external): Several websites and online platforms offer SAS tutorials and courses, providing alternative learning pathways. Always check the credibility and up-to-dateness of these resources.
SAS Certifications #
SAS offers various certifications to validate your SAS skills and enhance your career prospects. These certifications demonstrate your proficiency in SAS and can significantly improve your job prospects. Consider pursuing certifications relevant to your career goals. Some popular SAS certifications include:
Base Programming: This certification validates fundamental SAS programming skills.
Advanced Programming: This demonstrates proficiency in advanced SAS programming techniques.
Statistical Business Analyst: This focuses on applying SAS for business analysis and reporting.
Data Mining and Machine Learning: This specialization verifies expertise in using SAS for data mining and machine learning tasks.
Obtaining SAS certifications can significantly boost your resume and demonstrates your commitment to mastering this powerful analytical tool. The specific certification path you choose will depend on your career aspirations and current skill level. Check the official SAS website for the most up-to-date information on available certifications and their requirements.