Research methods in education using SPSS (Notes)
This lecture will help you get some basics on how to analyze your survey data using a statistical package (SPSS)
When designing data collection instruments (e.g. questionnaire), it is important to remember that the information collected will need to be processed and analysed when it is completed and returned.
The following need to be taken into consideration
In most cases, the information contained in the questionnaire will need to be entered into a computer package which allows it to be analysed.
Before the questionnaire is entered into the package, it will need a data template designing so that each questionnaire is entered in the same way.
Also, a coding frame will need to be developed which gives the rules for data entry.
During the design of the questionnaire, the processing and analysis need to be considered, to help them run more smoothly.
Data analysis packages
Commonly used packages include
- R
- Mat lab
- SciPy
- Excel
- Spreadsheet
Statistical Analysis Software (SAS)
- SPSS,
- PSPP
- Stata
Opening PSPP
There is an icon on the desktop for PSPP or it can be selected from the list of programs available.
Once double clicked, will take you into the opening screen
SPSS
Stands for Statistical Package for Social Scientists. Comes in various versions.
Analysis packages
Some of these packages are simple to use, but have limitations in terms of statistical analysis unless you can use more complicated programming.
Many statisticians use SPSS, PSPP or SAS.
- The Process
- Code your data
- Enter your data
- Analyze your the data
Preparing a Codebook
SPSS-format data files are organized by cases (rows) and variables (columns).
Cases represent individual respondents to a survey.
Variables represent responses to each question asked in the survey.
Every response item in the questionnaire needs to be entered as a numbered code (except narrative text).
To do this, assign numbers to your responses prior to entering your data.
Use ID Numbers
Each questionnaire is given an ID number so that it can be easily identified and filed.
It is helpful to leave a place on the questionnaire so that this can be added in the same place for each questionnaire.
You may use incremental numbers such as 001,002, 003
Write these numbers on the corner of each paper questionnaire
Labeling variables
For narrow columns use the number of questions e.g. Q1, Q2, Q3, Q4
OR Use descriptive header that encapsulates each question’s meaning e.g. “Do you smoke?” the column header could be “Smokes”
Keep track of the header you give to every question.
A good way is to take a blank questionnaire and write the header next to each question.
This is your codebook
Rules for naming of variables
Variable names
Must be unique ( that is, each variable in a data set must have a different name)
Can only have a specified number of characters
Must begin with a letter (not a number)
Cannot include full stops, blanks or other characters(! , ? * “); and
Cannot include words used as commands by PSPP/SPSS (all, ne, eq, to, le, lt, by, or, gt, and, not, ge, with)
Coding
Consistency
Be consistent with the use of codes
Codes should be consistent in all questions for example Yes=1, No=2, Don’t know=3.
This is particularly important when using scales for example if
1= very satisfied and 5 = very dissatisfied for one question, it should be the same for all questions about satisfaction where this is appropriate.
Check returned questionnaires
Particularly on self completion questionnaires - include questions being completed incorrectly, being missed (either accidentally or deliberately), or not following routing properly.
Where possible, either the researcher can check the missing or incorrect information, or the respondent can be asked directly if this is possible.
If it is not possible to correct the missing information, the coding frame should include instructions of how to handle these types of problems.
Format of the Data and Data Validation
Some packages require the format to be preset e.g. Access, SPSS, although others allow us to enter data without setting the format e.g. Excel and Spreadsheet.
For example the data may be
- Date
- Numeric
- Dollar
- String etc.
It is also possible to set ‘validation’ on some packages, this means that the package only allows data is within the correct range.
For example, that an answer to a question is either 1, 2, or 3, and not allowing any other answer to be input.
Setting the format can therefore assist us in reducing data entry error and keeping data consistent.
However, if validation is set on certain variables, it must allow for missing or data, or incorrect responses.
Care should also be taken with some packages for example if numeric data is entered as text and sorted in SPSS, it will sort as 1,100,2,200 giving us real problems with analysis.
Coding open ended questions
For open ended questions where the respondents can provide their own answers), coding is slightly complicated.
Example: what do you think are the effects of smoking?
To code responses to this question you will need to pass through the questionnaires and look for common themes.
You might notice that many respondents might answer the effects as: being avoided, loss of money, health problems, etc.
You can then list these major groups of responses under one variable name, say effects and assign numbers to each item (being avoided =1, loss of money=2, health problems=3, etc.)
You may also add another code for responses that did not fall into these listed categories, (call this Other= 4)
Analysis
Now that the data are entered, you are ready to analyze
Think about what you would like to do with your results
Who will read or use the data?
What do they want to know?
What type of analysis will they want? What will be of most interest?
Will you want charts or graphs to illustrate your findings?
Frequencies and Percents
Data Editor
The Data Editor displays the contents of the active data file. The information in the Data Editor consists of variables and cases.
In Data View, columns represent variables, and rows represent cases (observations).
In Variable View, each row is a variable, and each column is an attribute that is associated with that variable.
Variables are used to represent the different types of data that you have compiled.
A common analogy is that of a survey.
The response to each question on a survey is equivalent to a variable.
Variables come in many different types, including numbers, strings, currency, and dates.
Ways to Enter the Data
Variable types
The types which we will consider are:
Numeric
This type will be used for any number data.
Date
This type is used for entering dates or dates and times.
String
This type can be used for any other data.
Variable Size
We define the variable size using
Width
This is the total width of the field for numeric or string types. It is not used for dates.
Decimals
This is only used for numeric types and is the number of decimal places which should be included in the width.
Entering Data Directly (Define the Variable)
Variable labels
Variable names are typically short and indicate what the variable is to be used for.
PSPP allows us to use variable labels which can be longer and give more information about the variable. E.g. we might have a variable called ‘Age’ and give it a label of ‘Age at onset of disease’ so that we know exactly what ‘Age’ means in our study.
Define the Variable - Name
Define the Variable - TypeDefine the Variable - Labels
Variable Values
SPSS gives this facility so that if we choose to code a particular variable then we can give descriptions of each of the codes within the SPSS file.
Unfortunately does not use this information to help when entering the data.
It does use them in the analysis reports so that all values can be explained.
Define the Variable (Column Format)
Align
Align is used within a variable definition to determine how it is to be displayed on the screen.
The information may be aligned
Left
So that it is in the leftmost part of the field
Right
So that it is in the rightmost part of the field
Center
So that it is exactly in the middle of the field
Missing
Missing or invalid data are generally too common to ignore.
Survey respondents may refuse to answer certain questions, may not know the answer, or may answer in an unexpected format.
If you don't filter or identify these data, your analysis may not provide accurate results.
Define the Variable (Missing Values)
For numeric data, empty data fields or fields containing invalid entries are converted to system missing, which is identifiable by a single period.The reason a value is missing may be important to your analysis.
For numeric variables
Select Discrete missing values.
Type 999 in the first text box and leave the other two text boxes empty.
However, unlike numeric variables, empty fields in string variables are not designated as system missing.
Rather, they are interpreted as an empty string.
Select Discrete missing values.
Type NR in the first text box.
NR stands for No Response
Measurements: Nominal
Categorical. Data with a limited number of distinct values or categories (for example, gender or marital status).
A nominal level is one where the variable values do not have a natural ranking, eg names of countries.
Categorical data where there is no inherent order to the categories. For example, a job category of sales is not higher or lower than a job category of marketing or research.
Measurements: Ordinal
An ordinal level is one where the variable values have a natural order but differences between values are not meaningful. Eg importance of political position coded “low”, “medium”, and “high”.
Categorical data where there is a meaningful order of categories, but there is not a measurable distance between categories.
For example, there is an order to the values high, medium, and low, but the "distance" between the values cannot be calculated.
Scale
A scale is the one where the differences between variable values are comparable, eg age in years.
Data measured on an interval or ratio scale, where the data values indicate both the order of values and the distance between values.
For example, a salary of $72,195 is higher than a salary of $52,398, and the distance between the two values is $19,797.
SPSS can suggest measurements for variables.
The suggested measurement level is based on empirical rules and is not a substitute for user judgment.
SPSS uses measurement level in some cases to determine whether the variable defines categories in a table or graph or is to be summarized.
Summary measures
For categorical data, the most typical summary measure is the number or percentage of cases in each category.
The mode is the category with the greatest number of cases.
For ordinal data, the median (the value at which half of the cases fall above and below) may also be a useful summary measure if there is a large number of categories.
Scale variables
There are many summary measures available for scale variables, including:
Measures of central tendency.
The most common measures of central tendency are the mean (arithmetic average) and median (value at which half the cases fall above and below).
Measures of dispersion.
Statistics that measure the amount of variation or spread in the data include the standard deviation, minimum, and maximum.
Cross tabulation
The purpose of a cross tabulation is to show the relationship (or lack thereof) between two variables.
A number of tests are available to determine if the relationship between two cross tabulated variables is significant.
One of the more common tests is Chi square.
One of the advantages of Chi square is that it is appropriate for almost any kind of data.
The significance value has the information we're looking for.
The lower the significance value, the less likely it is that the two variables are independent (unrelated).
Post a Comment