Exploratory Data Analysis
Box Plot




By Rebecca Su     Source Web Statistics
[Method]
Box plot can be used to provide important information about the location and dispersion of the data, especially in comparison with different populations. Some commonly used statistics to describe the location and dispersion of the data are”G
”@1. Mean
”@2. Median
”@3. Percentile
”@4. Quartile
”@5. Range
”@6. Interquartile Range
”@7. Variance and Standard Deviation
”@8. Box Plot
”@
  1. Mean
    Sample Mean ( sample statistic )”G

    Population Mean ( population parameter )

    Intuitively, sample mean is the "center" of the data.
    Example:
    13 observations”G 39 32 20 34 40 33 31 29 25 30 31 32 22
    Then,

  2. Median”G
    Arrange the data in increasing order”G
    1. if n is odd, the median is the value of the ((n+1)/2)'th data.
    2. if n is even, the median is the average of the values of the (n/2)'th and the ((n+1)/2)'th data.
    Note that sample mean is more sensitive to the observation with exterme value than the sample median.
    For example, for the following data: 1 3 5 7 9 2 4 6 8 100, then the median is 5.5, but the mean is 14.5.

    Example
    12 observations”G 33 30 36 45 34 28 25 32 29 34 35 31
    Arrange in increasing order: 25 28 29 30 31 32 33 34 34 35 36 45
    Median is”]32+33)/2 = 32.5

  3. Percentile”G
    The p percentile is some value such that p% of the data are smaller or equal to .
    Procedure to find p percentile”G
    1. Arrange the data in increasing order.
    2. Compute .
    3. If i is not an integer, the (i+1)'th data is the p percentile.
    ”@ If i is not an integer, the average of the values of the i'th and the (i+1)'th data is the p perentile.

    Note that 50 percentile = meidan.

  4. Quartile
    The data are divided into 4 parts and the division points are the quartiles.
    That is
    ”@* the first quartile or 25 percentile
    the second quartile or 50 percentile
    the thrid quartile or 75 percentile
    Example
    12 observations ”G 33 30 36 45 34 28 25 32 29 34 35 31
    The position of the first quartile is . Therefore, the average value of the third and the fourth data is the first quartile.
    That is,Q1 = 29.5

  5. Range
    Range = ( the maximum of the data )-( the minimum of the data )
    Example
    13 observations in population 1”G 39 32 20 34 40 33 31 29 25 30 31 32 22, then Range=20
    ”@12 observations in population 2”G 33 30 36 45 34 28 25 32 29 34 35 31, then Range=20
    Note that range is very sensitive to the observation with extreme value.

  6. Interquartile Range
    Interquartile range is the difference between the first quartile and the third quartile.
    IQR = Q3 -Q1

  7. Variance and Standard Deviation
    Population variance is the sum of square of data deviation in the population.

    Sample variance is the sum of square of data deviation in the sample.

    Example
    13 observations”G 39 32 20 34 40 33 31 29 25 30 31 32 22

    Then, s2 = 33.42

  8. Box Plot
    Box plot consists of several statistics which can provide important information about the location and dispersion of the data:
    ”@ Minimum
    ”@ The first quartile: Q1
    ”@ The median: Md
    ”@ The third quartile: Q3
    ”@ The maximum: Max
    ”@ The mean
    1. Simple Box Plot
    13 observations in population 1”G 39 32 20 34 40 33 31 29 25 30 31 32 22
    12 observations in population 2”G 33 30 36 45 34 28 25 32 29 34 35 31

    * in the middle is the mean.

    2. Complete Box Plot

    Note”Gthe observations outside the lower limit and upper limit is the outliers.

code      index      home