Exercise 1

(Done using R Markdown)

1) Read the data and list first five lines

x = read.csv("http://www.ic.unicamp.br/%7Ewainer/cursos/1s2014/dados1.csv", 
    sep = "\t")
head(x, 5)
##     A   B   C   D
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2

2) which has missing values, remove them

x[apply(is.na(x), 1, any), ]
##      A   B   C   D
## 35 4.9 3.1  NA 0.2
## 49 5.3  NA 1.5 0.2
x = na.omit(x)

3) What are the outliers? I used a boxplot to see the distribution of values in each attribute

boxplot(x)

plot of chunk unnamed-chunk-3

So, I would say that the two extreme values of A and C are outliers. I also checked D and there is a negative value when all others are positive

hist(x$D)

plot of chunk unnamed-chunk-4

I would consider that a outlier too. So the outliers are

x[x$A > 10 | x$C > 10 | x$D < 0, ]
##       A   B    C    D
## 58  4.9 2.4 12.4  1.0
## 89 15.6 3.0  4.1  1.3
## 90  5.5 2.5  4.0 -1.3

Remove them:

x = x[!(x$A > 10 | x$C > 10 | x$D < 0), ]

4) Histograms

hist(x$A, 10)

plot of chunk unnamed-chunk-7

hist(x$A, 30)

plot of chunk unnamed-chunk-7

The histogram with 10 bins show that the distribution of A is uni-modal while the histogram with 30 bins is much less informative with peaks and valleys that are just “noise”

5) Co-variance matrix

print(cov(x), digits = 2)
##        A      B     C     D
## A  0.693 -0.047  1.29  0.52
## B -0.047  0.188 -0.33 -0.12
## C  1.293 -0.331  3.15  1.31
## D  0.523 -0.122  1.31  0.59

there is no need to print numbers with all those digits just to get a sense of the data!

6) PCA

x.pca = princomp(x)
summary(x.pca)
## Importance of components:
##                        Comp.1  Comp.2  Comp.3   Comp.4
## Standard deviation     2.0607 0.48597 0.28276 0.155937
## Proportion of Variance 0.9258 0.05149 0.01743 0.005301
## Cumulative Proportion  0.9258 0.97727 0.99470 1.000000
plot(x.pca$sdev)

plot of chunk unnamed-chunk-9

There are some theories of how many dimensions keep based on the cumulative explained variance. In this case, only one dimension seem to be the right thing. Reed more at http://stackoverflow.com/questions/12067446/how-many-principal-components-to-take

7) X-Y plot of the two larges dimentions of the PCA

plot(x.pca$scores[, 1:2])

plot of chunk unnamed-chunk-10

or better, in the same scale, which shows that much of the variation is in the first dimension, as discussed above

plot(x.pca$scores[, 1:2], asp = 1)

plot of chunk unnamed-chunk-11