Exploratory Data Analysis
Box Plot
By Rebecca Su Source
Web Statistics
[Method]
Box plot can be used to provide important information about the location and
dispersion of the data, especially in comparison with different populations. Some
commonly used statistics to describe the location and dispersion of the data are”G
”@1. Mean
”@2. Median
”@3. Percentile
”@4. Quartile
”@5. Range
”@6. Interquartile Range
”@7. Variance and Standard Deviation
”@8. Box Plot
”@
- Mean
Sample Mean ( sample statistic )”G

Population Mean ( population parameter )
Intuitively, sample mean is the "center" of the data.
Example:
13 observations”G 39 32 20 34 40 33 31 29 25 30 31 32 22
Then,
- Median”G
Arrange the data in increasing order”G
1. if n is odd, the median is the value of the ((n+1)/2)'th data.
2. if n is even, the median is the average of the values of the (n/2)'th and
the ((n+1)/2)'th data.
Note that sample mean is more sensitive to the observation with exterme value than the
sample median.
For example, for the following data: 1 3 5 7 9 2 4 6 8 100, then the median is 5.5, but the mean is 14.5.
Example
12 observations”G 33 30 36 45 34 28 25 32 29 34 35 31
Arrange in increasing order: 25 28 29 30 31 32 33 34 34 35 36 45
Median is”]32+33)/2 = 32.5
- Percentile”G
The p percentile is some value
such that p% of the data are smaller or equal to
.
Procedure to find p percentile”G
1. Arrange the data in increasing order.
2. Compute
.
3. If i is not an integer, the (i+1)'th data is the p percentile.
”@ If i is not an integer, the average of the values of the i'th and
the (i+1)'th data is the p perentile.
Note that 50 percentile = meidan.
- Quartile
The data are divided into 4 parts and the division points are the quartiles.
That is
”@
the first quartile or 25 percentile
the second quartile or 50 percentile
the thrid quartile or 75 percentile
Example
12 observations ”G 33 30 36 45 34 28 25 32 29 34 35 31
The position of the first quartile is
.
Therefore, the average value of the third and the fourth data is the first quartile.
That is,Q1 = 29.5
- Range
Range = ( the maximum of the data )-( the minimum of the data )
Example
13 observations in population 1”G 39 32 20 34 40 33 31 29 25 30 31 32 22, then Range=20
”@12 observations in population 2”G 33 30 36 45 34 28 25 32 29 34 35 31, then Range=20
Note that range is very sensitive to the observation with extreme value.
- Interquartile Range
Interquartile range is the difference between the first quartile and the third quartile.
IQR = Q3 -Q1
- Variance and Standard Deviation
Population variance is the sum of square of data deviation in the population.

Sample variance is the sum of square of data deviation in the sample.
Example
13 observations”G 39 32 20 34 40 33 31 29 25 30 31 32 22

Then, s2 =
33.42
- Box Plot
Box plot consists of several statistics which can provide
important information about the location and dispersion of the data:
”@ Minimum
”@ The first quartile: Q1
”@ The median: Md
”@ The third quartile: Q3
”@ The maximum: Max
”@ The mean
1. Simple Box Plot
13 observations in population 1”G 39 32 20 34 40 33 31 29 25 30 31 32 22
12 observations in population 2”G 33 30 36 45 34 28 25 32 29 34 35 31
* in the middle is the mean.
2. Complete Box Plot
Note”Gthe observations outside the lower limit and upper limit is the outliers.
code
index
home