-
R Programming (5) - Sophisticated data structuresR Programming 2020. 3. 28. 04:35728x90
Sophisticated data structures
Sophisticated data structures
- R provides sophisticated structure for the storage and manipulation of data.
- simplify data representation, manipulation and analysis.
- dataframe
- is like a matrix,
- but extended to allow for different object models in different columns
- list
- is a general data structure that can house many kinds of R object.
- factor
- special variable that represent categorical objects.
Factor
A factor is a vector that can contain only predefined values, and is used to store categorical data
Factors are built on top of integer vectors using two attributes: the
class
, “factor”, which makes them behave differently from regular integer vectors, and thelevels
, which defines the set of allowed values.- Data type for ordinal and categorical vectors.
- The possible values of a factor are referred to as its level.
- To create a factor, apply function
factor
to some vector.
hair <- c("blonde", "black", "brown", "brown", "black", "gray", "none") is.character(hair)
## [1] TRUE
is.factor(hair)
## [1] FALSE
hair <- factor(hair) is.factor(hair)
## [1] TRUE
class(hair)
## [1] "factor"
levels(hair)
## [1] "black" "blonde" "brown" "gray" "none"
table()
calculates the number of times each level of the factor appears.table(hair)
## hair ## black blonde brown gray none ## 2 1 2 1 1
- Specify level using
levels
argument.
hair <- factor(hair, levels=c("black", "blonde", "brown", "gray", "white", "none")) table(hair)
## hair ## black blonde brown gray white none ## 2 1 2 1 0 1
You can’t use values that are not in the levels.
hair[2] <- "green"
## Warning in `[<-.factor`(`*tmp*`, 2, value = "green"): invalid factor level, ## NA generated
# reset hair[2] <- "black"
Indeed, the type of a factor is integer.
typeof(hair)
## [1] "integer"
as.numeric(hair)
## [1] 2 1 3 3 1 4 6
Dataframe
A data frame is the most common way of storing data in R.
- Dataframe is a list of vectors (with equal length).
- Each vector (column) is a variable in experiment.
- Each row is a single observation.
creating data.frame
You create a data.frame using
data.frame()
, which takes named vectors or existing data.frame as input:data.frame(col1=x1, col2=x2, …, df1, df2, …)
-
col1
andcol2
are column names -
x1
andx2
are vectors of equal lengths -
df1
anddf2
are dataframes whose columns must be the same length withx1
,x2
df <- data.frame(x = 2:4, y = c("a", "b", "c")) str(df)
## 'data.frame': 3 obs. of 2 variables: ## $ x: int 2 3 4 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3
data.frame()
turns strings into factors. UsestringsAsFactors = FALSE
to suppress this behaviour:df <- data.frame( x = 2:4, y = c("a", "b", "c"), stringsAsFactors = FALSE) str(df)
## 'data.frame': 3 obs. of 2 variables: ## $ x: int 2 3 4 ## $ y: chr "a" "b" "c"
typeof(df)
## [1] "list"
class(df)
## [1] "data.frame"
is.data.frame(df)
## [1] TRUE
Combinding dataframes:
cbind(df, data.frame(z = 3:1))
## x y z ## 1 2 a 3 ## 2 3 b 2 ## 3 4 c 1
rbind(df, data.frame(x = 10, y = "z"))
## x y ## 1 2 a ## 2 3 b ## 3 4 c ## 4 10 z
read.table
You can use
read.table
to create data.frame from a file.Plot Tree Species Diameter Height 2 1 DF 39 20.5 2 2 WL 48 33.0 3 2 GF 52 30.0 … … … … … - Large dataframe are usually read into R from a file, using
read.table
.
read.table(file, header=FALSE, sep="")
file
- filename to be read, can be relative or absolute.
- the same number of values in each row.
- the values may be different modes, but the pattern of modes must be the same in each row.
header
- whether or not the first line of the file is the variable names
sep
- gives the character used to separate values in each row.
- default
""
has the special interpretation that a variable amount of white space (spaces, tabs, returns) can separate values.
more about read.table
- Two commonly used variants of
read.table
read.csv(file)
- for comma-separated data
- equivalent to
read.table(file, header=TRUE, sep=",")
read.delim(file)
- for tab-delimitated data
- equivalent to
read.table(file, header=TRUE, sep="\t")
Header
- If a header is present, it is used to name the columns of the dataframe.
- You can assign your own column names after reading the dataframe (using the
names
function). - Or when you read it in, using the
col.names
argument,- which should be assigned a character vector the same lengths of the column number.
- If there is no head and no
col.names
argument, then R uses the names “V1”, “V2”, etc.
read csv file
- ufc.csv is comma separated and there is a header line.
ufc <- read.csv("the path of ufc.csv")
- head and tail to examine the object
head(ufc)
## plot tree species dbh.cm height.m ## 1 2 1 DF 39 20.5 ## 2 2 2 WL 48 33.0 ## 3 3 2 GF 52 30.0 ## 4 3 5 WC 36 20.7 ## 5 3 8 WC 38 22.5 ## 6 4 1 WC 46 18.0
tail(ufc)
## plot tree species dbh.cm height.m ## 331 143 1 GF 28.0 21.0 ## 332 143 2 GF 33.0 20.5 ## 333 143 7 WC 47.8 20.5 ## 334 144 1 GF 10.2 16.0 ## 335 144 2 DF 31.5 22.0 ## 336 144 4 WL 26.5 25.0
- Each column has a unique name and we can extract that variable by means of names using
$
.
x <- ufc$height.m x[1:5]
## [1] 20.5 33.0 30.0 20.7 22.5
accessing element in data.frame
- We can use
[[?]]
to extract columns.ufc$height.m
,ufc[[5]]
andufc[["height.m"]]
are all equivalent.- The result is a vector.
x1 <- ufc[["height.m"]] x1[1:5]
## [1] 20.5 33.0 30.0 20.7 22.5
x2 <- ufc[[5]] x2[1:5]
## [1] 20.5 33.0 30.0 20.7 22.5
- You can extract the elements of a dataframe directly using matrix indexing.
# the result is a vector ufc[1:5, 5]
## [1] 20.5 33.0 30.0 20.7 22.5
- To select more than one of the variables in a dataframe, we use
[?]
.ufc[4:5]
is equivalent toufc[c("dbh.cm", "height.m")]
.[]
preserves the type of dataframe.
#diam.height and z are data.frame diam.height <- ufc[4:5] (z <- diam.height[1:5, ])
## dbh.cm height.m ## 1 39 20.5 ## 2 48 33.0 ## 3 52 30.0 ## 4 36 20.7 ## 5 38 22.5
is.data.frame(diam.height)
## [1] TRUE
# diam.height1 is a data.frame diam.height1 <- ufc[5] is.data.frame(diam.height1)
## [1] TRUE
# z1 is not a data.frame, but a vector, since [,] simplifies. z1 <- diam.height1[1:5, ] is.data.frame(z1)
## [1] FALSE
-
Selecting a column using
[[?]]
preserves the type of the object that is being extracted. -
Whereas,
[?]
keeps the type of the object from which the extraction is being made.
typeof(ufc)
## [1] "list"
typeof(ufc[5])
## [1] "list"
# numeric vector typeof(ufc[[5]])
## [1] "double"
- In addition,
[,]
tends to simplify the data type, if possible.
typeof(ufc[1:5, 5])
## [1] "double"
# In this case, simplification is impossible, and original data.frame type is preserved. typeof(ufc[1:5, 4:5])
## [1] "list"
class(ufc[1:5, 4:5])
## [1] "data.frame"
create new variable
- create a new variable within a dataframe, by naming it and assigning it a value.
ufc$volume.m3 <- pi * (ufc$dbh.cm / 200)^2 * ufc$height / 2 mean(ufc$volume.m3)
## [1] 1.93294
name of column
-
names(df)
return the names of the dataframedf
. -
To change the names of
df
you pass a vector of strings tonames(df)
.
(ufc.names <- names(ufc))
## [1] "plot" "tree" "species" "dbh.cm" "height.m" "volume.m3"
names(ufc) <- c("P", "T", "S", "D", "H", "V") names(ufc)
## [1] "P" "T" "S" "D" "H" "V"
names(ufc) <- ufc.names
subset
subset
is a convenient tool for selecting rows of dataframe.x %in% y
returns a logical vector whose i-th element isTRUE
ifx[i]
is iny
.
fir.height <- subset(ufc, subset=species %in% c("DF", "GF"), select = c(plot, tree, height.m)) head(fir.height)
## plot tree height.m ## 1 2 1 20.5 ## 3 3 2 30.0 ## 7 4 2 17.0 ## 8 5 2 29.3 ## 9 5 4 29.0 ## 10 6 1 26.0
write a dataframe
write.table(x, file="", append=FALSE, sep=" ", row.names=TRUE, col.names=TRUE)
x
is a dataframe to be written.file
is the address of the file to write to.append
indicates whether or not to append.sep
is the character used to separate the values.row.name
indicates whether or not to include the row names as the first column.col.names
indicates whether or not to include the column names as the first row.
Another method to read and write tabular data
read_csv()
andread_tsv()
are special cases of the generalread_delim()
inreadr
package.- They’re useful for reading the most common types of flat file data, comma separated values and tab separated values, respectively.
ufc_tibble
is a data frame providing a nicer printing method, useful when working with large data sets.
library(readr) ufc_tibble <- read_csv("./data/ufc.csv")
## Parsed with column specification: ## cols( ## plot = col_integer(), ## tree = col_integer(), ## species = col_character(), ## dbh.cm = col_double(), ## height.m = col_double() ## )
List
-
We have seen that a vector is an indexed set of objects.
-
All the elements of a vector to be the same type - numeric, character or logical - which is called the mode of the vector.
- List is an indexed set of objects, but the element of a list can be of different type, including other list.
- The mode of a list is list.
-
A list is created using the
list(...)
command instead ofc()
, with comma separated arguments. -
Single square brackets
[ ]
are used to select a sublist. -
Double square brackets
[[ ]]
are used to extract a single element.
Example
(my.list <- list("one", TRUE, 3, c("f", "o", "u", "r")))
## [[1]] ## [1] "one" ## ## [[2]] ## [1] TRUE ## ## [[3]] ## [1] 3 ## ## [[4]] ## [1] "f" "o" "u" "r"
str(my.list)
## List of 4 ## $ : chr "one" ## $ : logi TRUE ## $ : num 3 ## $ : chr [1:4] "f" "o" "u" "r"
my.list[[2]]
## [1] TRUE
mode(my.list[[2]])
## [1] "logical"
my.list[2]
## [[1]] ## [1] TRUE
mode(my.list[2])
## [1] "list"
my.list[[4]][1]
## [1] "f"
my.list[4][1]
## [[1]] ## [1] "f" "o" "u" "r"
-
When displaying a list,
-
R uses double brackets
[[1]], [[2]], etc.
, to indicate list elements, -
then single brackets
[1], [2], etc.
, to indicate vector elements.
# list can contain other list my.list2 <- list("a", c(5,6,7), my.list)
my.list2[3]
## [[1]] ## [[1]][[1]] ## [1] "one" ## ## [[1]][[2]] ## [1] TRUE ## ## [[1]][[3]] ## [1] 3 ## ## [[1]][[4]] ## [1] "f" "o" "u" "r"
#error my.list2[3][[4]]
my.list2[[3]]
## [[1]] ## [1] "one" ## ## [[2]] ## [1] TRUE ## ## [[3]] ## [1] 3 ## ## [[4]] ## [1] "f" "o" "u" "r"
my.list2[[3]][4]
## [[1]] ## [1] "f" "o" "u" "r"
my.list2[[3]][[4]]
## [1] "f" "o" "u" "r"
name of elements of list
The element of a list can be named when the list is created.
my.list <- list(first = "one", second=TRUE, third=3, fourth = c("f", "o", "u", "r")) names(my.list)
## [1] "first" "second" "third" "fourth"
my.list$second
## [1] TRUE
The element of a list can be named by names
attribute
.names(my.list) <- c("Fi", "Se", "Th", "Fo") my.list$Se
## [1] TRUE
my.list$"Se"
## [1] TRUE
my.list[["Se"]]
## [1] TRUE
Simplify vs. Preserving
type Simplifying Preserving vector x[[1]]
x[1]
array(matrix) x[1, ]
,x[, 1]
x[1, , drop = F]
,x[, 1, drop = F]
list x[[1]]
,x$a
,x[["a"]]
x[1]
,x["a"]
data.frame x[, 1]
x[1, ]
,x[, 1, drop = F]
- Note that data.frame is a list, and hence simlifying and preserving rules for list are also applied to data.frame.
list as output of function
-
Many functions produce list object as their ouput.
-
For example, when fit a least squares regression, the regression object itself is a list,
-
and can manipulated using list operations.
lm.xy <- lm(y~x, data=data.frame(x=1:5, y=1:5)) typeof(lm.xy)
## [1] "list"
names(lm.xy)
## [1] "coefficients" "residuals" "effects" "rank" ## [5] "fitted.values" "assign" "qr" "df.residual" ## [9] "xlevels" "call" "terms" "model"
str
can be used to summarize a list or dataframe :str(lm.xy)
apply family
-
R has several functions that allow you to easily apply a function to all or selected elements of a list or dataframe.
-
tapply
allows us to vectorise the application of a function to subsets of data.
tapply(X, INDEX, FUN, …)
-
X
: target vector to be applied. -
INDEX
: a factor, the same length asX
, used to group the elements. (IfINDEX
is not a factor, then it will be automatically converted to a factor.) -
FUN
: a function to be applied. It is applied to subvectors ofX
corresponding to a single level ofINDEX
. -
tapply
works similar with groupby + summary function in SQL. -
INDEX
denotes groupby andFUN
represents a summary function.
https://www.rdocumentation.org/packages/base/versions/3.4.0/topics/tapply
tapply example
Using
tapply
we obtain average height group by species as follows:tapply(ufc$height.m, ufc$species, mean)
## DF GF WC WL ## 25.30000 24.34322 23.48777 25.47273
To round the results:
round(tapply(ufc$height.m, ufc$species, mean), digits=1)
## DF GF WC WL ## 25.3 24.3 23.5 25.5
To find out how many samples we have of each species:
tapply(ufc$species, ufc$species, length)
## DF GF WC WL ## 57 118 139 22
# the same result tapply(ufc$height.m, ufc$species, length)
## DF GF WC WL ## 57 118 139 22
Example : tapply
tapply
can be used to count the number of elements in a vector:x <- c(1,2,3,4,5,4,6,2,5,6,5,3,4,1,4,5,6,7,2,2,6,7,9,3,5) tapply(X = x, INDEX = x, FUN = length)
## 1 2 3 4 5 6 7 9 ## 2 4 3 4 5 4 2 1
the above is almost the same as
table(x)
## x ## 1 2 3 4 5 6 7 9 ## 2 4 3 4 5 4 2 1
y <- x %% 2 tapply(X = x, INDEX = y, FUN = length)
## 0 1 ## 12 13
y <- x %% 2 tapply(X = x, INDEX = y, FUN = max)
## 0 1 ## 6 9
lapply and sapply
-
lapply(X, FUN, …)
applies the functionFUN
to each element of the listX
and returns a list. -
sapply(X, FUN, …)
applies the functionFUN
to each element ofX
, which can be a list or a vector, and by default will try to return the results in a vector or a matrix, if this make sense, otherwise in a list.
To obtain the mean diameter, height, and volume of trees:
lapply(ufc[4:6], mean) # return a list
## $dbh.cm ## [1] 37.41369 ## ## $height.m ## [1] 24.2256 ## ## $volume.m3 ## [1] 1.93294
sapply(ufc[4:6], mean) # return a vector
## dbh.cm height.m volume.m3 ## 37.41369 24.22560 1.93294
Simple example:
sapply(1:3, function(x) x^2)
## [1] 1 4 9
lapply(1:3, function(x) x^2)
## [[1]] ## [1] 1 ## ## [[2]] ## [1] 4 ## ## [[3]] ## [1] 9
lapply example
df <- data.frame(replicate(6, sample(c(1:10, -99), 6, rep = TRUE))) names(df) <- letters[1:6] df
## a b c d e f ## 1 -99 7 1 10 5 3 ## 2 10 -99 5 6 8 7 ## 3 10 -99 5 1 9 7 ## 4 7 10 -99 10 2 10 ## 5 1 8 7 7 -99 3 ## 6 5 8 4 3 -99 7
You want to replace all the -99s with NAs.
fix_missing <- function(x) { x[x == -99] <- NA x } df[] <- lapply(df, fix_missing) #list to data.frame, use df[] df
## a b c d e f ## 1 NA 7 1 10 5 3 ## 2 10 NA 5 6 8 7 ## 3 10 NA 5 1 9 7 ## 4 7 10 NA 10 2 10 ## 5 1 8 7 7 NA 3 ## 6 5 8 4 3 NA 7
More about apply : http://adv-r.had.co.nz/Functional-programming.html
three dots ellipsis … in function argument
The three dots ellipsis
...
in function argument is used to get a variable-length argument list.- One example is
paste
, where you can put a variable numbers of arguments. Check?paste
.
To handle
...
, first convert it to a list inside the function.addemup <- function(x, ...){ args <- list(...) for (a in args) x <- x + a x }
addemup(1, 1)
## [1] 2
addemup(1, 2, 3, 4, 5)
## [1] 15
reverse_paste0 <- function(...){ result <- "" args <- list(...) for (i in length(args):1){ result <- paste0(result, args[i]) } result }
reverse_paste0("a", "b", "c")
## [1] "cba"
reverse_paste0("a", "b", "c", "d", "e")
## [1] "edcba"
728x90'R Programming' 카테고리의 다른 글
R Programming (7) - Discrete random variable (0) 2020.03.28 R Programming (6) - Numerical integration (0) 2020.03.28 R Programming (4) - Function (0) 2020.03.28 R Programming (3) - IO (0) 2020.03.28 R Programming (2) - R basic (0) 2020.03.28 - R provides sophisticated structure for the storage and manipulation of data.