Exercise 1 ======================================================== (Done using R Markdown) 1) Read the data and list first five lines ```{r} x=read.csv("http://www.ic.unicamp.br/%7Ewainer/cursos/1s2014/dados1.csv",sep="\t") head(x,5) ``` 2) which has missing values, remove them ```{r} x[apply(is.na(x),1,any),] x=na.omit(x) ``` 3) What are the outliers? I used a boxplot to see the distribution of values in each attribute ```{r fig.width=7, fig.height=6} boxplot(x) ``` So, I would say that the two extreme values of A and C are outliers. I also checked D and there is a negative value when all others are positive ```{r fig.width=7, fig.height=6} hist(x$D) ``` I would consider that a outlier too. So the outliers are ```{r} x[x$A>10 | x$C>10 | x$D<0,] ``` Remove them: ```{r} x=x[!(x$A>10 | x$C>10 | x$D<0),] ``` 4) Histograms ```{r fig.width=7, fig.height=6} hist(x$A,10) hist(x$A,30) ``` The histogram with 10 bins show that the distribution of A is uni-modal while the histogram with 30 bins is much less informative with peaks and valleys that are just "noise" 5) Co-variance matrix ```{r} print(cov(x),digits=2) ``` there is no need to print numbers with all those digits just to get a sense of the data! 6) PCA ```{r fig.width=7, fig.height=6} x.pca=princomp(x) summary(x.pca) plot(x.pca$sdev) ``` There are some theories of how many dimensions keep based on the cumulative explained variance. In this case, only one dimension seem to be the right thing. Reed more at http://stackoverflow.com/questions/12067446/how-many-principal-components-to-take 7) X-Y plot of the two larges dimentions of the PCA ```{r fig.width=7, fig.height=6} plot(x.pca$scores[,1:2]) ``` or better, in the same scale, which shows that much of the variation is in the first dimension, as discussed above ```{r fig.width=7, fig.height=6} plot(x.pca$scores[,1:2],asp=1) ```