Basics of Statistics 1: Confidence Interval 1


Hi everyone, today I am starting a series of articles on the basics of Statistics, that every statistician must have a knowledge of. I will cover the following things in this series:

  • Confidence Intervals
  • Hypothesis Testing
  • Simple Linear Regression
  • Multiple Linear Regression

Today, let’s start with Confidence Intervals.

What is Confidence Interval?

In Statistics, we often try to estimate a certain parameter based on the data obtained from a sample representing the population. One such example would be the population mean. We try to estimate the population mean using the mean obtained from a smaller sample. Now, giving the sample mean as an estimate is often not adequate in the sense that it might not give enough information about the parameter. We might want to get a region with a known probability to capture the true population parameter.

Interpretation of confidence Interval: (LB, UB) is a 95 \% Confidence Interval of true parameter estimate \theta implies that if we sample multiple times, and create multiple confidence intervals, then 95 \% of those confidence intervals will contain the true parameter \theta.

Some important Cases:

  • Population Mean: Suppose we have data X_1, X_2, \cdots, X_n. We want to find 100(1-\alpha) \% Confidence Interval.
    1. \sigma 
    (Population Variance) known:
    If X_1, X_2, \cdots, X_n can be approximated by normal distribution : \bar{x} \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}.
    If normality cannot be assumed, then the above formula can be used, but only if n is large enough (say > 30). However, for small n, different approaches have to be taken.
    2. \sigma (Population Variance) unknown:
    If X_1, X_2, \cdots, X_n can be approximated by normal distribution : \bar{x} \pm t_{\alpha/2,n-1}\frac{s}{\sqrt{n}} (where s is the sample variance)
    If normality cannot be assumed, but if n is large enough (say > 30), then we can give the Confidence Interval by \bar{x} \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}} (using the Central Limit Theorem).  However, for small n, different approaches have to be taken.
  • Proportion: 100(1-\alpha) \% Confidence Interval for proportion p.
      1. Wald \hat{p} \pm z_{\alpha/2} \sqrt{\hat{p}(1-\hat{p})/n}
      2. Wilson Score: \frac{\hat{p}+z_{\alpha/2}^2/(2n)}{1+z_{\alpha/2}^2/n} \pm z_{\alpha/2}\frac{\sqrt{\hat{p}(1-\hat{p})/n + z_{\alpha/2}^2/(4n^2)}}{1+z_{\alpha/2}^2/n}
    3. Agresti-Coull: In Wald’s equation, replace x with x+z_{\alpha/2}^2/2 and n by n+z_{\alpha/2}^2. For \alpha=0.05, replace x with x+2, and n with n+4.
  • Difference of Mean: X_{11}, \cdots, X_{1n_1} and X_{21}, \cdots, X_{2n_2} be two samples. Population  parameter: \theta=\mu_1-\mu_2, Estimate: \hat{\theta}=\bar{x}_1 -\bar{x}_2. Then we can get the 100(1-\alpha) \% Confidence Interval as follows:
     1. Variance known, normality can be assumed:  \hat{\theta} \pm z_{\alpha/2} \sqrt{\sigma_1^2/n_1 + \sigma_2^2/n_2}
     2. For Large n_1, n_2 (>40) : \hat{\theta} \pm z_{\alpha/2} \sqrt{S_1^2/n_1 + S_2^2/n_2}, where S_1, S_2 are sample standard deviations.
     3. Normality can be assumed, but variances are unknown: \hat{\theta} \pm t_{\nu,\alpha/2} \sqrt{S_1^2/n_1 + S_2^2/n_2}where \nu= \frac{(S_1^2/n_1 + S_2^2/n_2)^2}{\frac{(S_1^2/n_1)^2}{n_1-1}+\frac{(S_2^2/n_2)^2}{n_2-1}}.
     4. Normality can be assumed, but variances are unknown, but can be assumed to be equal: \hat{\theta} \pm t_{n_1+n_2-2 , \alpha/2} S_p \sqrt{1/n_1 + 1/n_2} where S_p^2 = \frac{(n_1-1)S_1^2+(n_2-1)S_2^2}{n_1+n_2-2} is the pooled variance.
  • Difference of Proportions: Let B= # observations where first sample is success and second sample is failure, and C=# observations where first sample is failure and second sample is success. Then the 100(1-\alpha) \% Confidence Interval is given by: \hat{\theta} \pm z_{\alpha/2}SE(\hat{\theta}) , where \hat{\theta} = \frac{B-C}{n} and we can get the standard error as follows:
    1. Independent Sampling: \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}
    2. Paired Sampling: \frac{\sqrt{B+C}}{n}

Well, that was a summary of the primary types of confidence intervals we require in statistical analysis. Next day, I will talk about Hypothesis Testing, and more. Till then, Good-Bye.

 


Leave a comment

Your email address will not be published. Required fields are marked *

One thought on “Basics of Statistics 1: Confidence Interval

  • LastVanessa

    I have noticed you don’t monetize your site, don’t waste your traffic, you can earn extra
    bucks every month because you’ve got high quality content.
    If you want to know how to make extra $$$, search for: Mertiso’s tips best
    adsense alternative