Chapter 8: Validating and Cleaning Data
- Data errors occur when data values are not appropriate for the SAS statements that are specified in a program. SAS detects data errors during program execution.
- The
freq
produce can show if any genders are notF
orM
and if any countries are notAU
orUS
. - The
mean
procedure can show if any salaries are not in the range of 24000 to 500000. The
univariate
procedure can show if any salaries are not in the range of 24000 to 500000.123456789101112131415161718192021222324252627282930data work.nonsales;length Employee_ID 8 First $ 12Last $ 18 Gender $ 1Salary Job_Title $ 25Country $ 2 Birth_DateHire_Date 8;infile 'nonsales.csv' dlm=',';input Employee_ID First $ Last $Gender $ Salary Job_Title $Country $ Birth_Date :date9.Hire_Date :date9.;format Birth_Date Hire_Date ddmmyy10.;run;proc print data=work.nonsales;var Employee_ID Job_Title Birth_Date Hire_Date;where Job_Title = ' ' or Birth_Date > Hire_Date;run;proc freq data=work.nonsales;tables Gender Country;run;proc means data=work.nonsales n nmiss min max;var Salary;run;proc univariate data=work.nonsales;var Salary;run;During the processing of every
data
step, SAS automatically creates the following temporary variable:
_N_
variable, which counts the number of times thedata
step begins to iterate._ERROR_
variable, which signals the occurrence of an error caused by the data during execution. 0 indicates no error exist.
- Which statement best descries the invalid data? b:
- The data in the raw data file is bad
- The programmer incorrectly read the data
To write a SAS date constant, enclose a date in quotation marks in the form
ddmmyyyy
and immediately follow the final quotation mark with the letterd
. Example: January 1, 1974 is'01JAN1974'd
1234proc print data=orion.nonsales;var Employee_ID Birth_Date Hire_Date;where Hire_Date < '01JAN1974'd;run;The
freq
procedure produces one-way to n-way frequency tables.
- The
tables
statement specifies the frequency tables to produce. Without it,proc freq
produces a frequency table for each variable. - The
nlevels
option displays a table that provides the number of distinct values for each variable named in thetables
statement.123proc freq data=orion.nonsales nlevels;tables Gender Country Employee_ID;run;
- The
means
procedure produces summary reports displayed descriptive statistics.
- The
var
statement specifies the analysis variables and their order in the result. - By default, the
means
procedure creates a report withN
,mean
,stddev
,min
andmax
1234567891011proc means data=orion.nonsales n nmiss min max;var Salary;run;```10. The `univariate` procedure produces summary reports displaying descriptive statistics.+ The `var` statement specifies the analysis variables and their order in the results.+ Without the `var` statement, SAS will analysis all numeric variables.```sasproc univariate data=orion.nonsales;var Salary;run;
- Interactively cleaning data: the
Viewtable
window enables you to browse, edit, or create SAS data sets interactively. - Programmatically cleaning data: The
data
step can be used to programmatically clean the invalid data.
- The assignment statement evaluates an expression and assigns the resulting value to a variable:
variable = expression;
Salary = 26960;
Hire_Date = '21JAN1995'd;
Country = upcase(Country);
The
if-then-else
statement executes a SAS statement for observations that meet specific conditions.12345678910data work.clean;set orion.nonsales;Country=upcase(Country);if Employee_ID=120106 then Salary=26960;else if Employee_ID=120115 then Salary=26500;else if Employee_ID=120191 then Salary=24015;else if Employee_ID=120107 then Hire_Date='21JAN1995'd;else if Employee_ID=120111 then Hire_Date='01NOV1978'd;else if Employee_ID=121011 then Hire_Date='01JAN1998'd;run;What are the two phases of DATA step processing?: Compilation and Execution
- What is a program data vector (PDV)?: A logical area in memory where SAS holds the current observation
- What is an instruction that SAS uses to read data values into a variable?: An informat
- When would you use a : modifier?: You use a : modifier with nonstandard raw data that requires list input and an informat
Chapter 9: Manipulating Data
- If an operand is missing for an arithmetic operator, the result is missing. Example:
var1 = .
,var2 = 10
, thennum = var1 + var2 / 2
,num
is.
(missing). sum
: return the sum of all arguments.year
,qtr
,month
,day
,weekday
: extract pieces from a SAS date.today()
: return the current date as a SAS date value.mdy(month, day, year)
: return a SAS date value.
AnnivBonus=mdy(month(Hire_Date),15,2008);
Given the following code, are the correct results produced when the drop statement is placed after the set statement?
1234567data work.comp;set orion.sales;drop Gender Salary Job_Title Country Birth_Date Hire_Date;Bonus=500;Compensation=sum(Salary,Bonus);BonusMonth=month(Hire_Date);run;Yes, the drop statement specifies the names of the variables to omit from the output data set
- The
drop
andkeep
statements select variables after they are brought into the program data vector. Alternatives to the
drop
andkeep
statements are thedrop=
andkeep=
data set options placed in thedata
statement.123456data work.comp(drop=Salary Hire_Date);set orion.sales(keep=Employee_ID First_Name Last_Name Salary Hire_Date);Bonus=500;Compensation=sum(Salary,Bonus);BonusMonth=month(Hire_Date);run;Multiple executable statements are allowed in
if-then do / else do ... end
statements.123456789101112data work.bonus;set orion.sales;length Freq $ 12;if Country='US' then do;Bonus=500;Freq='Once a Year';end;else do;Bonus=300;Freq='Twice a Year';end;run;if-then delete
: an alternative to the subsettingif
statement is thedelete
statement on anif-then
statement.
if BonusMonth ne 12 then delete;
is equivalent to:if BounsMonth = 12;
Chapter 10: Combining SAS Data Sets
1.