IPT550Stats: February 2010

Friday, February 26, 2010

Correlation and Regression

Covariance is x and y in the same formula as variance is x squared.

Table B3 determines whether or not the observed pearson r is a "rare event" unlikely to have occured by chance.

A large sample size usually makes r values significant.

The formula and calculation for comparing r's will not be required on test.

Regression

Uses the classic equation for a line: y = mx + b, but the letters are different in stats. It is y=bx+a where b is the slope and a is the y intercept.

Slope = rise over run or y1-y2 divided by x1-x2

Prediction comes from graphing the line and predicting x, y coordinates on the line.

Data that can be described as a line is known as perfectly linear relationship.

Best-fitting line is known as the regression line.

Method of least-square creates the regression line (or best-fitting line): It is the line that minimizes the overall distance between the regression line and the data points.

Calculation is not required for test.

Wednesday, February 24, 2010

Correlation

Correlation: Indicates the relationship between 2 or more variables like smoking and lung cancer. Strength of relationship and direction of relationship (positive or negative).

Scale data - Pearson Product-Moment Correlation (aka Pearson Correlation)
Ordinal data - Spearman Rank-Order Correlation
Nominal data - Phi Coefficient

Pearson Correlation: Range is from 1 to -1. Closer to 1 or -1 the stronger the relationship. At 0, no linear relationship whatsoever. Scatter-graph that looks like a line is a strong relationship.

Correlation coefficient = r, r is a standard index from 1 to -1.

Important caveats about Pearson r:

1. Not all important or interesting relationships are linear. (Yerkes-Dodson Law)

2. Watch out for spurious correlations (counterfeit correlation)

A. Restricted range (see handout) - full range shows relationship where restricted range shows counterfeit correlation.

B. Combined groups: combining groups may off-set or wipe out a correlation that exists when the groups are not combined. Breaking out groups by demographics or gender or something helps avoid this problem.

C. Outliers: outliers through off calculations. Why is there an outlier? You have to explain the outliers.

Correlation does not equal causation, it equals a degree of covarying.

Correlaiton does not tell us:
x -> y
y -> x
z -> x and y
coincedence

Pearson r formula is covariance divided by total variability

Monday, February 22, 2010

SPSS introduction

Data, Select cases - Look at a sub-set of the data

Transform, Recode into different variable - Change the data like gender as 1 and 2 could be changed to 3 and 4, or grades like everything above C is 1 and below is 2.

Transform, Compute variable - take several variables and calculate a new variable.

Analyze is where SPSS is powerful.

Descriptive statistics
ANOVA
T-test
General linear model
Correlate
Regression
Nonparametric tests
Scale

Friday, February 19, 2010

Hypothesis testing - probability

Alpha level (:z-critical) aka cut-off: the likelihood of obtaining a type 1 error (errorenously rejecting Ho or random sampling error), traditional level is 0.05.

Directional: words like below or above or more or less are used.

Non-directional: words like difference or change or impact are used.

P score (:z-observed) aka observed score: Alpha is set by you like 0.05. P score is the probability of the observed score from your sample given that H0 (null hypothesis) is true.

"If your P score is less probable than alpha, you have a score to reject H0 (null hypothesis)".

Decision Errors

Type 1 error is false positive (errorneously rejecting H0) - the likelihood is alpha.
Type 2 error is false negative (errorneously failing to reject H0) - the likelihood is called beta. (Beta is not taught in this class)

As alpha decreases beta increases and visa versa.

Power: Probability that the test will lead to a reject H0 when H0 is actually false. You rejected H0 when you should reject H0.

Telescope example...type 2 error is a telescope that doesn't have enough power to see the asteriod that exists. If it has enough power, then you correctly reject H0.

Tuesday, February 16, 2010

Hypothesis testing - Standard error of the means

"If you criticize something, you are obligated to know it better than those that espouse it."

Raw data -> Summarized, Organized, Simplified (Descriptive statistics: s, x-bar, s-squared) -> Sample to population inferences (Inferential statistics: p, z, t, F, q)

Hypothesis testing

1. Simple random sampling: used for statistical inference, where populations are inaccessible, and are often more accurate. All units in the population have an equal chance of being selected.

2. Proportional Stratified Random Sample: sample maps exactly onto the population in terms of proportions of sub-groups (e.g. population has 10% seniors and sample has 10% seniors)

3. "Errors" in sampling (sampling error and non-sampling error) must be dealt with. Samples and population don't match-up. Non-sampling errors include question text and framing that creates confusion. Other things like cultural issues can cause non-sampling error.

Sampling Distributions

How do you detect how much error (sampling error) is in the sample? Use a standard-deviation-like calculation (spread of scores with respect to the mean).

By selecting multiple samples and calculating the means of those samples, then using the means in the place of raw scores and calculate the standard deviation of the means-like statsitic called the standard error of the means (s-sub-x-bar).

Standard error of the means = sample standard deviation divided by the square root of the number of observations in the sample. (s / sqrt n)

or theoretical (sigma-sub-x-bar = sigma / sqrt N)

A sampling (sample of means) distribution is normally distributed when it is drawn from a normally distributed population or the size of the samples is reasonably large (at least 30).

Friday, February 12, 2010

Paper writing tips

Writing tips:

Begin with the end in mind (goals of the writing)
Flow with the end in mind (each sentence and paragraph has "end" purpose)
Look for gaps or open space between sentences and paragraph where the connections are weak (readers willingness to move on)
Claims must be supported (claims are supported by logic both yours and others)
Abstracts are the lean and mean of here's what we did and here's what we got. No lit review stuff in abstracts.
Don't forget about the bridge from the literature and your hypotheses.

Writing abstracts:

Opening (one sentence)
Purpose of the study (include hypotheses)
Research design/method description
Results (brief description)
Conclusion (one sentence)

* Don't put any sentence in the abstract that could be cut and pasted into another abstract.

Probability

P score or probability is the area under the distribution curve towards the tails from the z-score.

Plotting z scores is important for conceptually understanding z-score relationships.

Wednesday, February 10, 2010

Relative standing

Relative standing: Where does a score reside compared to everyone else?

Percentiles or ordinal position or rank: is not sensitive to the variability of raw scores, just ranking.

Standard scores (z-scores): Statistical approach to standardizing scores in a standard scale of measurement. The relative standing of one score can be compared to another, even when they are measured on different scales (GMAT, CPA, GRE, etc.). A numerical index of relative standing expressed in standard deviation units. The mean has a z-score of 0 standard deviation units.

Calculated by (score - population mean)/population standard deviation also known as the distance from the mean expressed in standard deviation units.

Formula structure is "observed" minus "expected" divided by "error".

Was Wayne Gretsky a better scorer than Michael Jordan? (Compare z-scores)

If you don't know the population standard deviation, generally you don't calculate z-scores.

Z-distribution (Appendix B, pg B1-B5)

Table calculates the (a) area between the Z score and Mean and (b) area beyond the Z score in a normal distribution.

The "area beyond the Z score" is the probability score.

Monday, February 8, 2010

Distributions and variability (standard deviation)

Mode uses frequency, median uses rank, and mean uses score value.

Distributions:

Normal distribution is where the mean, median, and mode are identical; is symmetrical; and tails never touch x-axis. Characteristics in nature are thought to be normally distributed. Parametric statistics are appropriate.

Non-Normal distributions is not symmetical; is either positively skewed (tailed) or negatively skewed; and is peaked (leptokurtic or less variability) or flat (platykurtic or more variability). Non-parametric statistics are appropriate.

Population vs. Sample (subset of population)

Sample statistic vs. population parameter (like "x bar" and "mew" for mean). If you know parameters, you do not need statistics.

Variability:

Variability is the spread or dispersion of scores in a distribution.

1. Range is highest score value minus lowest score value. Range is not sensitive to inside variability. Two sets of data can have the same range, but very different standard deviations.

2. Variance is an index that considers all scores (including inside variability). Sample is read as "s squared" and population is read as "sigma squared". Variance is the average distance of all the scores from the mean of the scores. Sum of (x - mean) squared/(n - 1) = s squared. Variance squares the distance from the mean because without squaring it, the answer would be zero.

3. Standard deviation is the square root of variance. Standard deviation is an index of variability that is expressed in the original counting units (variance is expressed in squared counting units). It is known as the "spread-out-ness". It is read as "s" for samples and "sigma" for population.

(On exam: A calculation of standard deviation will be required by hand)

4. Median Absolute Deviation is used for ordinal (ranked) data and skewed data (see pg. 103).

*If you have scaled (interval) data, then use mean and standard deviation. If you have ranked (ordinal) data, then use median and median absolute deviation. If you have nominal data, then use mode and a frequency comparison. (See table 5.5 page 107).

Friday, February 5, 2010

Centrality and spread

Lit review and method due next Wednesday. Bring copies for 3 peer reviewers.

Practice the summation notation.

Order of operations:

Parathesis
Exponents/Square roots
Multiple/Divide
Add/Subtract

All the statistics for our purposes is based on two concepts: centrality and spread.

Centrality: (aka Central tendency)

Mode (used for categorical data from nominal data or frequency data) - most frequently occuring value. (bi-modal, multi-modal, amodal - no mode, constant scores)
Median (used for ordinal data or ranked data) - know as 50th percentile or half the scores above and half below. For even number of cases, take the mean of the two middle scores.
Mean (used for scale data or equal interval data) - sum of scores / number of cases

Properties of the mean:

Sum of the deviation about the mean: sum of (x-mean) = 0
Square of sum of (x - mean) is a minimum. The actual mean will bring this equation to the lowest answer. If you substitute the mean with any other number, the equation will give you a greater answer.
Grand mean is not a mean of means. It is calculated based on weighting (due to sample size of each sample mean) Grand mean = (n1*mean1 + n2*mean1)/n1+n2

IPT550Stats