ࡱ> Z\Y@ hjbjbqq "z(Yl\\\8P$t,", >\\\p\\pTTTT\\ V%@\\ ½.e ,,qxGuide to Basic Statistical Analysis with SPSS David B. Marshall November 2004 Here it is a GUIDE written by David with highlights, underlining and a few amendations by Helena. To really LEARN whats in here, read each paragraph with the survey data set in front of you and work out how that paragraph applies to the actual data. Questions? E-mail either of us. Helena First: Examine First, examine your data: what types of Variables do you have in your data set? How are the data distributed across the range of values of the variables? The three general types of variables are Nominal, Ordinal, and Scale. The tests and analyses you can perform will depend on which type(s) of variable(s) you have. (Variable types are set by clicking on the Variable View tab at the bottom left when you have a data set loaded into the SPSS Data Editor, then setting the type in the column on the far right called Measure.) Scale variables, also called Interval variables, are numerical variables in a regular sequence with equal intervals between points, such as distance, time, and temperature. Scale variables are relatively uncommon in social science, but age is often treated as a scale variable. (Other common physiological scale variables are height, weight, blood pressure, etc.) Ordinal variables are those with an ordered sequence of numerical values, but where the interval between values is not necessarily equal. In other words, a higher value of an ordinal variable indicates more of something that a lower value, but how much higher cannot be quantitatively defined. The most common example of this in social science are psychological variables as represented by a Likert scale, for instance, degrees of satisfaction (very unsatisfied, unsatisfied, neutral, satisfied, very satisfied). While it is clear that someone marking unsatisfied on a survey item is less satisfied than someone marking very satisfied, we cannot say that the amount of the difference in satisfaction is more, less, or the same as the difference in satisfaction between someone marking very unsatisfied and someone marking satisfied. So, while we use numbers to represent the points on the satisfaction scale such as -2 for very unsatisfied to +2 for very satisfied, the numbers merely represent the order of values, not the distance or difference between them. Nominal variables, also called categorical variables, are those where there is no quantitative meaning whatsoever, neither distance nor order, such as gender, race, geographical region, etc. While we often use numbers as labels for these categories, this is just for convenience. The different categories of nominal variables can thus be represented in data sets by strings (words or letters) or by numbers (e.g. F or 1 for females, M or 2 for males in a gender variable). To run certain tests, such as the nonparametric tests for independent samples, we sometimes need to label the values of nominal variables with sequential integers (e.g., 1 for Asians, 2 for Native Americans, ), but be careful not to later confuse these number-labeled nominal variables with a scale or ordinal variable in a particular test. For this reason, it is especially important to define the variable type correctly in your data set. Once you have determined or set correctly the variable types, you need to examine the Distributions of each variable you are interested in exploring. There are a variety of display tools in SPSS that are available, but the best thing to do first is just to look at the raw distribution of the values, that is, how many cases in your data set indicate a particular value of each variable category, the frequencies. You can print out the frequencies for any variable by clicking on the SPSS menus at the top of the Data Editor: Analyze ( Descriptive Statistics ( Frequencies. In the Frequencies dialog box, highlight the variables of interest, click the right arrow to move them into the Variable(s) box, then click OK. A table of frequencies and percentages should appear in the Output viewer. Another very useful thing to do is to also view the bar charts of these frequencies, which you can do by clicking the Charts button at the bottom of the Frequencies dialog box, make sure the Bar charts radio button is set, clicking Continue, then OK. If you want to explore the distributions of your variables further, play with the options in (menu choices) Analyze Descriptive Statistics (Descriptives, and Analyze (Descriptive Statistics ( Explore. The Explore dialog is particularly useful, because it allows you to numerically and visually compare the distributions of variables you put into the Dependent List among different subsets of your data, defined by the variables you put into the Factor List (for instance, the difference in the values of a Dependent variable such as a satisfaction question between different categories of a factor variable such as gender). Another thing you will probably want to do is modify some of your variables or to construct new ones. A simple and common example is, having collected the age of respondents, to construct an age group variable: to put people into age buckets based on age group. To create a new variable based on the values of an existing variable, you can use the menu Transform ( Recode( Into Different Variables. The existing variable is the Input variable that you move into the dialog box, and then you name your Output variable (remember eight or fewer characters, no spaces for SPSS variable names) and click Change. You should also enter a more descriptive Label. Clicking on Old and New Values takes you to a dialog box where you tell the program how to translate the old values into the new ones. You need to click Add to enter each choice, then Continue to return to the previous box, then OK to actually create the new variable. If you want to simply change the values of an existing variable, such as recoding missing values to an index number so you can include them in a later comparison, this can be done through Transform ( Recode ( Into Same Variables. This recoding procedure is also useful if you need to change the values of nominal variables from letters to consecutive integers for purposes of using the nominal variable as a Grouping Variable in a statistical test (see below). Much of the value of social science data analysis is in the simple reporting of frequencies from the data that you have collected: nothing fancier is needed to provide much food for thought. This is another (and the most important) reason you should first conduct a thorough examination of your data, especially the frequencies of responses for the variables. Then: Determine Distributions One of the most important conclusions or judgments you need to make after examining your variables is their distribution: the way in which the data values are distributed among all the possible values of the variables. Most of the commonly-used statistical tests, such as the t-test of the difference in means of two samples, assume that the variables follow a normal distribution, the familiar bell curve. (A sample in our context is a subset of the data, as defined by the categories of a factor variable such as gender the two samples are the sample of all males in the data set and the sample of all females in the data set. You might want to test, for example, that the mean satisfaction on a particular question is significantly higher or lower for females than for males.) If your variables are reasonably or even rather vaguely normal, that is, the bar chart display for the variable shows a single hump that tails off to lower frequencies on either side of the hump, then it is best to use the statistical tests designed for normal distributions, such as the t-test or the multiple-sample version of the t-test called One-Way ANOVA. This is because, by making the assumption that the distribution is normal, the statistical test can detect smaller real differences between samples than tests that do not make any assumptions about the shape of the distribution. In statistical jargon, the normal distribution tests have higher power to detect differences. The problem, however, is that very frequently in social science data, particularly for Likert-scale questions, the distributions are far from normal: for instance, the highest frequency is observed for a value at or near one of the ends of the scale, instead of at or near the middle of the scale. Even worse, there can be two humps in the data: higher frequencies at or near both ends of the scale, with lower frequencies at or near the middle. When either of these situations occur, in general, the tests that assume normal distributions will give incorrect results. In these cases, it is better to use the nonparametric tests described below: these tests are called nonparametric because they do not assume that the distributions can be simply described by using parameters such as the mean and variance. These tests sacrifice some power, i.e., they cannot detect or declare statistically significant small differences between samples, in favor of being more robust: they are less likely to give incorrect results than tests that assume normal distributions. There are a wide variety of other tests that have been developed for situations where the distribution is known but not normal, e.g., Poisson, gamma, etc.; there are also a series of tests you can perform to help you decide if a variable distribution is normal or something else but these tests are well beyond the scope of this basic guide. The best simple thing you can do is look at your variable distributions, and if the frequencies are highest anywhere but in or near the middle of the value range for the variable, assume that the distribution is not normal, and use the nonparametric tests. Next: Compare Once you have examined your variables, assured yourself of their appropriate types, and made judgments about which ones are normally distributed, youre ready to make comparisons. These can be between variables for all the cases in your data set, such as, which one of a number of services are people the most or least satisfied with, or they can be between subsets of your data set, such as, are males more or less satisfied with a particular service than females. For simple comparisons without making statistical tests of significance, it does not matter so much how the variables are distributed. For instance, you can still compare the means of a series of satisfaction items, even though the items are not normally distributed. The mean will still give you a simple, single-number description of the responses to the item. (If the number of variable values or categories is particularly large, or if the variable really is a scale variable, then you will probably also want to examine and report the median as well as the mean. For highly skewed variables, those with a high frequency towards one end of the scale and a long tail out towards the other end, the median is a more realistic single-number description of the most typical value of the distribution.) You can make these basic descriptions and comparisons using the menu choices youve already used to examine your variables: Analyze ( Descriptive Statistics ( Frequencies/Descriptives/Explore. Another very useful and common comparison tool, for comparing two or more variables that do not have more than five to ten possible values or categories, is a Crosstab(ulation), accessed through Analyze ( Descriptive Statistics ( Crosstabs. This procedure tabulates the values of one or more Row(s) variable(s) versus the values of one or more Column(s) variable(s). By choosing options through the Cells button, you can display percentages as well as the raw frequencies, and for instance, determine the percentage of males and females for each race in your data set, or the percentages of males and females that responded very satisfied (and all other values) to a satisfaction question. Lastly, Test for the SIGNIFICANCE of the patterns you think you are seeing. Once you have made comparisons and observed some differences, it is time to make statistical tests of the observed differences. Differences between variables and between groups on a single variable naturally occur simply due to random fluctuations if we flip a fair coin one hundred times, we do not expect the results to always be exactly fifty heads and fifty tails. The question is, given an observed difference between the number of heads and the number of tails, how likely is it that this difference could have been observed just due to chance? This is a huge topic which is the foundation for all of statistics, and we cannot reasonably expect to learn much about this topic in this basic guide. There are a very wide variety of different types of patterns that we see in nature (including human nature), and each of these different types of patterns has its own possible mathematical description, characteristic distributions of variables, and more or less appropriate statistical tests. Most of these tests are designed to answer the question: What is the probability or likelihood that the observed pattern or correlation could have appeared simply due to chance? {Time to reveal a dirty little secret about statistics: although statistics are also used to justify causal and associational arguments (because a correlation between variables was observed, one variable affects the other), these sorts of judgments were controversial at the beginning of statistics, and remain controversial to this day. Statistical methods are the most reliable when they are used to disprove the presence of a meaningful difference. Our minds are designed to make sense of the world, to detect patterns, and we have a tendency to over-interpret, to see patterns where none really exist. Statistics can usually be relied upon to tell us which observed patterns could have occurred merely due to random chance; the other uses of statistics are still the subject of much research, debate, and controversy.} So, what tests to use? What follows is a (very) simple guide to the tests that are the most commonly used on social science data. Each of these tests is an example of a bivariate test, a test involving two variables. These bivariate tests are used to investigate the basic question: Is there a significant difference in variable X for different values of variable Y? (There are a wide variety of more sophisticated multivariate tests and models such as linear regression, logistic regression, survival analysis, etc. that investigate the question: For a number of variables X1, X2, X3, etc., which have a significant effect on the values of variable Y, to what degree, and how do they interact and influence the degree of each others effect on Y? These multivariate models are the subject of a course or courses in statistics.) It is beyond the scope of this basic guide to describe the origins and mathematical derivations of each of the statistical tests, but all of these tests are based on various assumptions about the shape of the observed patterns, and how much this shape can be expected to vary due to random fluctuations, given the assumptions. All the tests described below are interpreted in a similar way. Each of the tests will report a probability or significance (labeled P-value, or Sig., or Asymptotic Sig.) that the observed pattern or difference could have occurred due to random chance. This chance is never absolutely zero, but at some threshold, most the probability becomes so small that most would accept that the pattern or difference is significant. For most situations, long years of custom have settled on the threshold of p = 0.05: a 5 % likelihood that the observed pattern or difference could be due to chance. Therefore, you would report differences as statistically significant only if the probability value from the test was 0.05 or less. This is not a hard and fast rule: probabilities of 0.10 or even 0.15 are often reported as approaching significance, particularly if the observed difference or pattern makes theoretical sense. Social scientists typically get excited if they calculate probabilities of 0.01 or less, and really excited if they see values of 0.001 or less. Two Scale Variables: IF comparing two scale variables, youre usually asking if there is a correlation between them: if the value for one of the variables is high, is it more or less likely that the value of the other variable will also be high (positive correlation) or low (negative correlation)? Test by: Analyze ( Correlate ( Bivariate In the Bivariate Correlations dialog box, move any number of scale variables into the Variables: test box. If both variables are normally distributed, the default Pearson Correlation Coefficient choice is fine; if the variables have a single hump but one or both is quite skewy, Spearman or Kendalls tau-b coefficients are more robust to deviations from a normal distribution (more trustworthy); if there is more than one peak in the distribution, Kendalls tau-b should be used. If you dont have any convincing reason why the correlation should be in one direction or another, use the default Two-Tailed Test of Significance. If you know the possible correlation can only be in one direction, i.e., the number of children a woman has can only increase (precluding fatalities) with her increasing age, select the One-Tailed Test of Significance. One Scale and One Ordinal Variable: IF comparing an ordinal variable that has at least five possible values or categories, you can treat it as a scale variable and test a correlation as above, but only use the Kendalls tau-b correlation coefficient test. IF the ordinal variable has fewer than five values, and the scale variable is normally distributed, you can run whats called an analysis of variance, using an F-test of significance, which is the equivalent of a multiple-sample t-test: Analyze ( Compare Means ( One-Way Anova. (More sophisticated ordinal regression methods exist for this situation, but are beyond the scope of this basic guide.) The choices for the One-Way Anova method are complex, but just using the defaults is fine in most cases. The Anova test is not as sensitive to deviations from normal distributions as other tests, so you usually dont have to worry if the distribution of your scale variable is rather skewy. One Scale and One Nominal Variable: IF the scale variable is normally distributed, use the t-test for a two-category nominal variable (Analyze Compare Means Independent Samples t-test, where the nominal variable is the Grouping Variable; you Define Groups by indicating which two labels or categories of the nominal variable are to be compared in the analysis), and the One-Way Anova for a nominal variable with more than two categories. IF the scale variable is not normally distributed, use Nonparametric Tests 2 Independent Samples for the nonparametric equivalent of a t-test, with the Mann-Whitney U as the Test Type, and Nonparametric Tests K Independent Samples for the nonparametric equivalent of a One-Way Anova, with the Kruskal-Wallis H as the Test Type. For both tests, the nominal variable is the Grouping Variable and you need to Define Groups or Define Range as the number or range of the categories you wish to compare. (You will not be able to even see the nominal variable in the list of variables if you have not coded the categories of the nominal variable as consecutive integers see Transform (Recode ( Into Same Variables procedure described above.) Two Ordinal Variables: One way to compare is to run a correlation as IF both variables were really scale variables and you have five or more possible values and you use Kendalls tau-b as the correlation coefficient test. A better and more descriptive way is to crosstabulate the variables, using one of the ordinal variable choices for the test statistic: in the Analyze ( Descriptive Statistics ( Crosstabs dialog box, click Statistics and select the gamma or Kendalls tau-b under the Ordinal test choices. One Ordinal and One Nominal Variable: IF the number of possible values for the ordinal variable is five or higher, you can get away with treating the ordinal variable as if it were a scale variable, but only use the nonparametric tests described above, not the t-test or One-Way Anova. IF the number of possible values is less than five, treat the variable as nominal, but pay attention to the crosstabulation to see if there is indeed a more or less consistent trend: an increase or decrease in the percentage of a particular category of the nominal variable as the values of the ordinal variable increase. Use the Nominal by Interval test in the Crosstabs Statistics dialog box. Two Nominal Variables: Use Analyze ( Descriptive Statistics ( Crosstabs, and the Chi-Square statistic. The Chi-Square statistic determines if the observed frequencies in each cell of the table are reasonably consistent with the frequencies expected for no difference: For example, if the entire data set contains 60% females and 40% males, then you would expect just based on random chance that about 60% of those saying they were very satisfied on a particular satisfaction question would be female, and 40% male, and the same roughly 60-40 split for satisfied, dissatisfied, etc. (This is the assumption that there is no significant gender difference in satisfaction.) The Chi-Square significance is the probability that the observed variation is was due to chance, so again, when that probability is 0.05 or less, we call the results significant. Warning: the Chi-Square and other nominal tests fall apart and are unreliable unless there is an expected percentage of 5% or more for most of the cells in the table. If the number of categories is large in one or both of the nominal variables, this condition may be violated and the test results useless: SPSS prints a warning when this happens, so pay attention. To obtain a reliable test, you may have to do some variable collapsing (combining many categories into fewer categories, using Transform Recode Into Different Variables to create a new, collapsed variable). You should also be aware that a calculated probability of 0.05 or less only indicates that there is a significant difference somewhere in the crosstable. Even if you do not get a warning that the test may be invalid, you may want to collapse categories of nominal variables to zero in on a particular comparison that you wish to test.  While we did not do this, there is another trick that can often be played, particularly if you have a series of questions that ask for the same thing on a variety of items: degree of satisfaction with a list of different services etc. You can create an index variable by simply adding up the responses for each case on each of the satisfaction items. The resulting index is often close enough to a normally-distributed variable that normal distribution tests can be used. The loss is that any statements you make about significant differences among e.g. different age groups or races, are then limited to statements about overall satisfaction, not satisfaction with the individual items that make up the index. (If you want to play with this, you can create an index variable by the menu choice Transform ( Compute. This brings up a dialog box that allows you to define a new Target variable by some mathematical combination or transformation of existing variables, such as a simple addition of existing variables.)  IF YOU WANT TO KNOW ABOUT SAMPLE SIZE EFFECTS WHEN GOVERNMENTS AND CORPORATIONS REPORT RESULTS READ THIS: Time for another dirty little secret about statistical tests: all differences or patterns, no matter how small, are statistically significant if the sample size is large enough. For most social science, with sample sizes of thirty to several hundred, this is rarely an issue you need to worry about. Increasingly, however, large administrative data sets from corporations and governments containing many thousands or millions of records are being data mined and a large number of silly statements and claims about significance are being made based on these findings. For this and other reasons, you should never base your conclusions, thinking, or reporting simply on whether or not a statistical test says an observed difference is significant. After the statistical test has given you some assurance that the observed pattern might be real and not due simply to chance, you must always ask the questions, how much of a difference? and does this difference make sense? (Statistics does also include calculations of what is called effect size to help answer the how much questions.) PAGE  Guide to Basic Statistical Analysis with SPSS David B. Marshall November 2004 p.  PAGE 8 z*j  %  % 8 H  FWz8Er%'))Oao%5q06f` c [!! 56>* B*ph3f j66>*55:>*Z.PQPQqryz%&   EFst !h^h$a$.PQPQqryz%&   EFsta b c Z&[&**,,,,,22}5~5555m:n:=8BkFlFKKNN|S}S}U~U,X-X[[%_&_(_'cggghhhhhNa b c Z&[&**,,,,,22}5~5555m:n:=8BkFlFKK!!!"#%y&&''(),,,,--].}..W/2222`3s33333V4]4f44557789::;3<==S>c>??????@@6F*H* j>* j0JU66>*5ZKNN|S}S}U~U,X-X[[%_&_(_'cggggghhhhh]h&`#$NN]OPPP RR/R* 56>*Owhxh~hhhhhhCJ 0JCJmH0JCJj0JCJU 1h/ =!"#$% i8@8 NormalCJ_HaJmH sH tH <A@< Default Paragraph Font, @, Footer  !&)@& Page Number,, Header  !.@". Footnote Text8&@18Footnote ReferenceH*&i@b*bz z z z z z z z z &&n4l@~Ob^< !Nwhh59;<Kh68:h7!!AA OO(Y>YBYaawbbbb'kp";"@""%%&&,`-j--=?>B>>(YZZn[t[aawbbbb:::::::: Helena Meyer-KnappiMacintosh HD:Users:Helena:Documents:Microsoft User Data:Saved Attachments:Guide to Basic & al Analysis.docHelena Meyer-KnappWMacintosh HD:Users:Helena:Documents:Microsoft User Data:AutoRecovery save of Guide to BHelena Meyer-KnappWMacintosh HD:Users:Helena:Documents:Microsoft User Data:AutoRecovery save of Guide to BHelena Meyer-KnappWMacintosh HD:Users:Helena:Documents:Microsoft User Data:AutoRecovery save of Guide to BHelena Meyer-KnappAMacintosh HD:Users:Helena:Desktop:Guide to Basic & al Analysis.doc Maggie Foran6Macintosh HD:Users:maggie:Desktop:Guide to Basic &#8234+@G&q&d7|1u^`o(. ^`hH. pLp^p`LhH. @ @ ^@ `hH. ^`hH. L^`LhH. ^`hH. ^`hH. PLP^P`LhH.^`o(. ^`hH. pLp^p`LhH. @ @ ^@ `hH. ^`hH. L^`LhH. ^`hH. ^`hH. PLP^P`LhH.^`o(. ^`hH. pLp^p`LhH. @ @ ^@ `hH. ^`hH. L^`LhH. ^`hH. ^`hH. PLP^P`LhH.&q&4+7                           @Fb@ @GTimes New Roman5Symbol3 Arial;Wingdings"1hSF I%K}Q)#90dGZ_3QH#Guide to Basic Statistical AnalysisDavid B. Marshall Maggie Foran Oh+'0  4@ \ h t '$Guide to Basic Statistical AnalysisuidDavid B. MarshalltiaviaviNormal. Maggie Foranhal3ggMicrosoft Word 10.1@F#@k@\@k I ՜.+,D՜.+,h$ hp  'Children's Administrationrs%GZ $Guide to Basic Statistical Analysis Title8_AdHocReviewCycleID_EmailSubject _AuthorEmail_AuthorEmailDisplayName'Q!O stats guideMDAV300@dshs.wa.govMarshall, David  !"#$%&'()*+,-./0123456789:;<=?@ABCDEFGHJKLMNOPRSTUVWX[Root Entry Feh]1Table>WordDocument"zSummaryInformation(IDocumentSummaryInformation8QCompObjX FMicrosoft Word DocumentNB6WWord.Document.8