利用R进行数据预处理

2017-03-21 09:41阅读：

http://blog.sina.cn/dpool/blog/u/2933499612

一、数据清理 1、整理数据
数据需要整理成行与列的形式，行表示每一个对象，而列表示每个对象的属性或者值。我们可以利用R中的tidyr包进行数据整理，使数据标准化。
例如有如下的一个表：
math English
Anna 86 90
John 43 75
Catherine 80 82
这个表包含三个属性，学生名（studentName），科目（Subject），以及成绩（Grade），但是可以见得这个表并不标准，所以需要进行一定的数据整理。使用如下代码：
> library(readr)
> std <- read_delim('stud.txt',delim = ' ',skip = 1, col_names = c('StudentName','Math','English')
> std
之后使用包tidyr：
> library(tidyr)
> stdL <- gather(std,Subject,Grade,Math:English)
> stdL
# A tibble: 6 × 3
StudentName Subject Grade

1 Anna Math 86
2 John Math 43
3 Catherine Math 80
4 Anna English 90
5 John English 75
6 Catherine English 82
之

后得到的结果就比较标准了。
再假设有一个表，它的很多不同的值都被集中在了一列，那么我们就有必要将它进行分解以达到数据格式的标准化，下面给出一个例子：
> std2 <- read_delim('stud2.txt',delim = ' ',skip = 1,col_names = c('StudentName','Math','English','Degree_Year'))
Parsed with column specification:
cols(
StudentName = col_character(),
Math = col_integer(),
English = col_integer(),
Degree_Year = col_character()
)
> std2
# A tibble: 3 × 4
StudentName Math English Degree_Year

1 Anna 86 90 Bio_2014
2 John 43 75 Math_2013
3 Catherine 80 82 Bio_2012
> std2L <- gather(std2,Subject,Grade,Math:English)
> std2L <- separate(std2L,Degree_Year,c('Degree','Year'))
> std2L
# A tibble: 6 × 5
StudentName Degree Year Subject Grade
*
1 Anna Bio 2014 Math 86
2 John Math 2013 Math 43
3 Catherine Bio 2012 Math 80
4 Anna Bio 2014 English 90
5 John Math 2013 English 75
6 Catherine Bio 2012 English 82
2、处理时间数据
很多时候，我们需要将数据中的string类型的表示时间的数据，转化为date类型，更多的分析可以在网站：https://cran.r-project.org/web/views/TimeSeries.html上找到。使用包lubridate中的函数可以一定程度地达到相关的效果，它可以将相关的数据转化成POSIXct类型，这是一种非常灵活的时间类型，用'y','m','d','h','m','s'就能够达到，下面写出相关的代码：
> library(lubridate)
> ymd('20151021')
[1] '2015-10-21'
> ymd('2015/11-30')
[1] '2015-11-30'
> myd('11.2015.3')
[1] '2015-11-03'
> dmy_hms('2/12/2013 14:05:01')
[1] '2013-12-02 14:05:01 UTC'
> mdy('120112')
[1] '2012-12-01'
当然，这可以用于类似的向量操作：
> dates <- c(20120521, '2010-12-12', '2007/01/5', '2015-2-04',
+ 'Measured on 2014-12-6', '2013-7+ 25')
> dates <- ymd(dates)
> dates
[1] '2012-05-21' '2010-12-12' '2007-01-05' '2015-02-04' '2014-12-06' '2013-07-25'
当需要涉及到星期几的时候，lubridate也有相应的函数：
> data.frame(Dates=dates,WeekDay=wday(dates),nWeekDay=wday(dates,label=TRUE),
+ Year=year(dates),Month=month(dates,label=TRUE))
Dates WeekDay nWeekDay Year Month
1 2012-05-21 2 Mon 2012 May
2 2010-12-12 1 Sun 2010 Dec
3 2007-01-05 6 Fri 2007 Jan
4 2015-02-04 4 Wed 2015 Feb
5 2014-12-06 7 Sat 2014 Dec
6 2013-07-25 5 Thurs 2013 Jul
有时候，当我们需要设计到时间区域的时候，可以参考下面的代码：
> date <- ymd_hms('20150823 18:00:05',tz='Europe/Berlin')
> date
[1] '2015-08-23 18:00:05 CEST'
> with_tz(date,tz='Pacific/Auckland') #将柏林的时间转变为奥克兰的时间
[1] '2015-08-24 04:00:05 NZST'
> force_tz(date,tz='Pacific/Auckland') #强制转换时间区
[1] '2015-08-23 18:00:05 NZST'
3、字符串处理
这一节我们没所需要用到的包是stringr，下面提供一些常见的字符串处理函数。如果读者需要更多更复杂的字符串处理方法，可以去探索包stringi。下面是从一个网站上提取数据的代码块，这里提取了两个文件，一个文件是CSV文件，它包含了我们所要使用到的数据，另一个文件是'.names'文件，它包含我们所需要的文本信息。代码：
> library(stringr)
> library(readr)
> uci.repo <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/'
> dataset <- 'audiology/audiology.standardized'
> dataF <- str_c(uci.repo,dataset,'.data')
> namesF <- str_c(uci.repo,dataset,'.names')
> ## Reading the data file
> data <- read_csv(url(dataF), col_names=FALSE, na='?')
Parsed with column specification:
cols(
.default = col_character()
)
See spec(...) for full column specifications.
> data
# A tibble: 200 × 71
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14

1 f mild f normal normal t f f f f f f
2 f moderate f normal normal t f f f f f f
3 t mild t absent mild t f f f f f f
4 t mild t absent mild f f f f f f f
5 t mild f normal normal mild t f f f f f f
6 t mild f normal normal mild t f f f f f f
7 f mild f normal normal mild t f f f f f f
8 f mild f normal normal mild t f f f f f f
9 f severe f t f f f f f f
10 t mild f elevated absent mild t f f f f f f
# ... with 190 more rows, and 57 more variables: X15 , X16 , X17 , X18 ,
# X19 , X20 , X21 , X22 , X23 , X24 , X25 , X26 ,
# X27 , X28 , X29 , X30 , X31 , X32 , X33 , X34 ,
# X35 , X36 , X37 , X38 , X39 , X40 , X41 , X42 ,
# X43 , X44 , X45 , X46 , X47 , X48 , X49 , X50 ,
# X51 , X52 , X53 , X54 , X55 , X56 , X57 , X58 ,
# X59 , X60 , X61 , X62 , X63 , X64 , X65 , X66 ,
# X67 , X68 , X69 , X70 , X71
> dim(data)
[1] 200 71
> ## Now reading the names file
> text <- read_lines(url(namesF))
> text[1:3]
[1] 'WARNING: This database should be credited to the original owner whenever'
[2] ' used for any publication whatsoever.'
[3] ''
> length(text)
[1] 178
> text[67:70]
[1] ' age_gt_60:\t\t f, t.'
[2] ' air():\t\t mild,moderate,severe,normal,profound.'
[3] ' airBoneGap:\t\t f, t.'
[4] ' ar_c():\t\t normal,elevated,absent.'
在这代码中，str_c()类似于基础R中的paste0()，它用来连接字符串。read_lines()可以用来读取文本文件，所生成的是一个向量，文本文件的每一行组成向量中的一个元素。
接着用如下代码来分离text中的分隔符“:”，并且还要去除其他不必要的符号：

nms <-
str_split_fixed(text[67:135],':',n=2)[,1]
# get the names nms[1:3] nms
<- str_trim(nms) # trim white space
nms[1:3] nms <-
str_replace_all(nms,'\\(|\\)','')
# delete invalid chars.
nms[1:3]
colnames(data)[1:69] <-
nms

data[1:3,1:10]
二、转变数据
1、数据标准化
2、

举报/Report

我的更多文章

下载客户端阅读体验更佳