R Programming (5) - Sophisticated data structures

R Programming 2020. 3. 28. 04:35

728x90

Sophisticated data structures

R provides sophisticated structure for the storage and manipulation of data.
- simplify data representation, manipulation and analysis.
dataframe
- is like a matrix,
- but extended to allow for different object models in different columns
list
- is a general data structure that can house many kinds of R object.
factor
- special variable that represent categorical objects.

Factor

A factor is a vector that can contain only predefined values, and is used to store categorical data

Factors are built on top of integer vectors using two attributes: the class, “factor”, which makes them behave differently from regular integer vectors, and the levels, which defines the set of allowed values.

Data type for ordinal and categorical vectors.
The possible values of a factor are referred to as its level.
To create a factor, apply function factor to some vector.

hair <- c("blonde", "black", "brown", "brown", "black", "gray", "none")
is.character(hair)

## [1] TRUE

is.factor(hair)

## [1] FALSE

hair <- factor(hair)
is.factor(hair)

## [1] TRUE

class(hair)

## [1] "factor"

levels(hair)

## [1] "black"  "blonde" "brown"  "gray"   "none"

table() calculates the number of times each level of the factor appears.

table(hair)

## hair
##  black blonde  brown   gray   none 
##      2      1      2      1      1

Specify level using levels argument.

hair <- factor(hair, levels=c("black", "blonde", "brown", "gray", "white", "none"))
table(hair)

## hair
##  black blonde  brown   gray  white   none 
##      2      1      2      1      0      1

You can’t use values that are not in the levels.

hair[2] <- "green"

## Warning in `[<-.factor`(`*tmp*`, 2, value = "green"): invalid factor level,
## NA generated

# reset
hair[2] <- "black"

Indeed, the type of a factor is integer.

typeof(hair)

## [1] "integer"

as.numeric(hair)

## [1] 2 1 3 3 1 4 6

Dataframe

A data frame is the most common way of storing data in R.

Dataframe is a list of vectors (with equal length).
Each vector (column) is a variable in experiment.
Each row is a single observation.

creating data.frame

You create a data.frame using data.frame(), which takes named vectors or existing data.frame as input:

data.frame(col1=x1, col2=x2, …, df1, df2, …)

col1 and col2 are column names
x1 and x2 are vectors of equal lengths
df1 and df2 are dataframes whose columns must be the same length with x1, x2

df <- data.frame(x = 2:4, y = c("a", "b", "c"))
str(df)

## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  2 3 4
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3

data.frame() turns strings into factors. Use stringsAsFactors = FALSE to suppress this behaviour:

df <- data.frame(
  x = 2:4,
  y = c("a", "b", "c"),
  stringsAsFactors = FALSE)
str(df)

## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  2 3 4
##  $ y: chr  "a" "b" "c"

typeof(df)

## [1] "list"

class(df)

## [1] "data.frame"

is.data.frame(df)

## [1] TRUE

Combinding dataframes:

cbind(df, data.frame(z = 3:1))

##   x y z
## 1 2 a 3
## 2 3 b 2
## 3 4 c 1

rbind(df, data.frame(x = 10, y = "z"))

##    x y
## 1  2 a
## 2  3 b
## 3  4 c
## 4 10 z

read.table

You can use read.table to create data.frame from a file.

Plot	Tree	Species	Diameter	Height
2	1	DF	39	20.5
2	2	WL	48	33.0
3	2	GF	52	30.0
…	…	…	…	…

Large dataframe are usually read into R from a file, using read.table.

read.table(file, header=FALSE, sep="")

file
- filename to be read, can be relative or absolute.
- the same number of values in each row.
- the values may be different modes, but the pattern of modes must be the same in each row.
header
- whether or not the first line of the file is the variable names
sep
- gives the character used to separate values in each row.
- default "" has the special interpretation that a variable amount of white space (spaces, tabs, returns) can separate values.

read csv file

ufc.csv is comma separated and there is a header line.

ufc <- read.csv("the path of ufc.csv")

head and tail to examine the object

head(ufc)

##   plot tree species dbh.cm height.m
## 1    2    1      DF     39     20.5
## 2    2    2      WL     48     33.0
## 3    3    2      GF     52     30.0
## 4    3    5      WC     36     20.7
## 5    3    8      WC     38     22.5
## 6    4    1      WC     46     18.0

tail(ufc)

##     plot tree species dbh.cm height.m
## 331  143    1      GF   28.0     21.0
## 332  143    2      GF   33.0     20.5
## 333  143    7      WC   47.8     20.5
## 334  144    1      GF   10.2     16.0
## 335  144    2      DF   31.5     22.0
## 336  144    4      WL   26.5     25.0

Each column has a unique name and we can extract that variable by means of names using $.

x <- ufc$height.m
x[1:5]

## [1] 20.5 33.0 30.0 20.7 22.5

accessing element in data.frame

We can use [[?]] to extract columns.
- ufc$height.m, ufc[[5]] and ufc[["height.m"]] are all equivalent.
- The result is a vector.

x1 <- ufc[["height.m"]]
x1[1:5]

## [1] 20.5 33.0 30.0 20.7 22.5

x2 <- ufc[[5]]
x2[1:5]

## [1] 20.5 33.0 30.0 20.7 22.5

You can extract the elements of a dataframe directly using matrix indexing.

# the result is a vector
ufc[1:5, 5]

## [1] 20.5 33.0 30.0 20.7 22.5

To select more than one of the variables in a dataframe, we use [?].
- ufc[4:5] is equivalent to ufc[c("dbh.cm", "height.m")].
- [] preserves the type of dataframe.

#diam.height and z are data.frame
diam.height <- ufc[4:5]    
(z <- diam.height[1:5, ])

##   dbh.cm height.m
## 1     39     20.5
## 2     48     33.0
## 3     52     30.0
## 4     36     20.7
## 5     38     22.5

is.data.frame(diam.height)

## [1] TRUE

# diam.height1 is a data.frame
diam.height1 <- ufc[5]
is.data.frame(diam.height1)

## [1] TRUE

# z1 is not a data.frame, but a vector, since [,] simplifies.
z1 <- diam.height1[1:5, ]
is.data.frame(z1)

## [1] FALSE

Selecting a column using [[?]] preserves the type of the object that is being extracted.
Whereas, [?] keeps the type of the object from which the extraction is being made.

typeof(ufc)

## [1] "list"

typeof(ufc[5])

## [1] "list"

# numeric vector
typeof(ufc[[5]])

## [1] "double"

In addition, [,] tends to simplify the data type, if possible.

typeof(ufc[1:5, 5])

## [1] "double"

# In this case, simplification is impossible, and original data.frame type is preserved.
typeof(ufc[1:5, 4:5])

## [1] "list"

class(ufc[1:5, 4:5])

## [1] "data.frame"

http://r4ds.had.co.nz/vectors.html

create new variable

create a new variable within a dataframe, by naming it and assigning it a value.

ufc$volume.m3 <- pi * (ufc$dbh.cm / 200)^2 * ufc$height / 2
mean(ufc$volume.m3)

## [1] 1.93294

name of column

names(df) return the names of the dataframe df.
To change the names of df you pass a vector of strings to names(df).

(ufc.names <- names(ufc))

## [1] "plot"      "tree"      "species"   "dbh.cm"    "height.m"  "volume.m3"

names(ufc) <- c("P", "T", "S", "D", "H", "V")
names(ufc)

## [1] "P" "T" "S" "D" "H" "V"

names(ufc) <- ufc.names

subset

subset is a convenient tool for selecting rows of dataframe.
- x %in% y returns a logical vector whose i-th element is TRUE if x[i] is in y.

fir.height <- subset(ufc, subset=species %in% c("DF", "GF"), select = c(plot, tree, height.m))
head(fir.height)

##    plot tree height.m
## 1     2    1     20.5
## 3     3    2     30.0
## 7     4    2     17.0
## 8     5    2     29.3
## 9     5    4     29.0
## 10    6    1     26.0

write a dataframe

write.table(x, file="", append=FALSE, sep=" ", row.names=TRUE, col.names=TRUE)

x is a dataframe to be written.
file is the address of the file to write to.
append indicates whether or not to append.
sep is the character used to separate the values.
row.name indicates whether or not to include the row names as the first column.
col.names indicates whether or not to include the column names as the first row.

Another method to read and write tabular data

read_csv() and read_tsv() are special cases of the general read_delim() in readr package.
They’re useful for reading the most common types of flat file data, comma separated values and tab separated values, respectively.
ufc_tibble is a data frame providing a nicer printing method, useful when working with large data sets.

library(readr)
ufc_tibble <- read_csv("./data/ufc.csv")

## Parsed with column specification:
## cols(
##   plot = col_integer(),
##   tree = col_integer(),
##   species = col_character(),
##   dbh.cm = col_double(),
##   height.m = col_double()
## )

List

We have seen that a vector is an indexed set of objects.
All the elements of a vector to be the same type - numeric, character or logical - which is called the mode of the vector.
List is an indexed set of objects, but the element of a list can be of different type, including other list.
- The mode of a list is list.
A list is created using the list(...) command instead of c(), with comma separated arguments.
Single square brackets [ ] are used to select a sublist.
Double square brackets [[ ]] are used to extract a single element.

Example

(my.list <- list("one", TRUE, 3, c("f", "o", "u", "r")))

## [[1]]
## [1] "one"
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] "f" "o" "u" "r"

str(my.list)

## List of 4
##  $ : chr "one"
##  $ : logi TRUE
##  $ : num 3
##  $ : chr [1:4] "f" "o" "u" "r"

my.list[[2]]

## [1] TRUE

mode(my.list[[2]])

## [1] "logical"

my.list[2]

## [[1]]
## [1] TRUE

mode(my.list[2])

## [1] "list"

my.list[[4]][1]

## [1] "f"

my.list[4][1]

## [[1]]
## [1] "f" "o" "u" "r"

When displaying a list,
R uses double brackets [[1]], [[2]], etc., to indicate list elements,
then single brackets [1], [2], etc., to indicate vector elements.

# list can contain other list
my.list2 <- list("a", c(5,6,7), my.list)

my.list2[3]

## [[1]]
## [[1]][[1]]
## [1] "one"
## 
## [[1]][[2]]
## [1] TRUE
## 
## [[1]][[3]]
## [1] 3
## 
## [[1]][[4]]
## [1] "f" "o" "u" "r"

#error
my.list2[3][[4]]

my.list2[[3]]

## [[1]]
## [1] "one"
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] "f" "o" "u" "r"

my.list2[[3]][4]

## [[1]]
## [1] "f" "o" "u" "r"

my.list2[[3]][[4]]

## [1] "f" "o" "u" "r"

name of elements of list

The element of a list can be named when the list is created.

my.list <- list(first = "one", second=TRUE, third=3, fourth = c("f", "o", "u", "r"))
names(my.list)

## [1] "first"  "second" "third"  "fourth"

my.list$second

## [1] TRUE

The element of a list can be named by names attribute.

names(my.list) <- c("Fi", "Se", "Th", "Fo")
my.list$Se

## [1] TRUE

my.list$"Se"

## [1] TRUE

my.list[["Se"]]

## [1] TRUE

Simplify vs. Preserving

type	Simplifying	Preserving
vector	`x[[1]]`	`x[1]`
array(matrix)	`x[1, ]`, `x[, 1]`	`x[1, , drop = F]`, `x[, 1, drop = F]`
list	`x[[1]]`, `x$a`, `x[["a"]]`	`x[1]`, `x["a"]`
data.frame	`x[, 1]`	`x[1, ]`, `x[, 1, drop = F]`

Note that data.frame is a list, and hence simlifying and preserving rules for list are also applied to data.frame.

http://adv-r.had.co.nz/Subsetting.html

list as output of function

Many functions produce list object as their ouput.
For example, when fit a least squares regression, the regression object itself is a list,
and can manipulated using list operations.

lm.xy <- lm(y~x, data=data.frame(x=1:5, y=1:5))
typeof(lm.xy)

## [1] "list"

names(lm.xy)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

str can be used to summarize a list or dataframe : str(lm.xy)

apply family

R has several functions that allow you to easily apply a function to all or selected elements of a list or dataframe.
tapply allows us to vectorise the application of a function to subsets of data.

tapply(X, INDEX, FUN, …)

X : target vector to be applied.
INDEX : a factor, the same length as X, used to group the elements. (If INDEX is not a factor, then it will be automatically converted to a factor.)
FUN : a function to be applied. It is applied to subvectors of X corresponding to a single level of INDEX.
tapply works similar with groupby + summary function in SQL.
INDEX denotes groupby and FUN represents a summary function.

https://www.rdocumentation.org/packages/base/versions/3.4.0/topics/tapply

tapply example

Using tapply we obtain average height group by species as follows:

tapply(ufc$height.m, ufc$species, mean)

##       DF       GF       WC       WL 
## 25.30000 24.34322 23.48777 25.47273

To round the results:

round(tapply(ufc$height.m, ufc$species, mean), digits=1)

##   DF   GF   WC   WL 
## 25.3 24.3 23.5 25.5

To find out how many samples we have of each species:

tapply(ufc$species, ufc$species, length)

##  DF  GF  WC  WL 
##  57 118 139  22

# the same result
tapply(ufc$height.m, ufc$species, length)

##  DF  GF  WC  WL 
##  57 118 139  22

Example : tapply

tapply can be used to count the number of elements in a vector:

x <- c(1,2,3,4,5,4,6,2,5,6,5,3,4,1,4,5,6,7,2,2,6,7,9,3,5)
tapply(X = x, INDEX = x, FUN = length)

## 1 2 3 4 5 6 7 9 
## 2 4 3 4 5 4 2 1

the above is almost the same as

table(x)

## x
## 1 2 3 4 5 6 7 9 
## 2 4 3 4 5 4 2 1

y <- x %% 2
tapply(X = x, INDEX = y, FUN = length)

##  0  1 
## 12 13

y <- x %% 2
tapply(X = x, INDEX = y, FUN = max)

## 0 1 
## 6 9

lapply and sapply

lapply(X, FUN, …) applies the function FUN to each element of the list X and returns a list.
sapply(X, FUN, …) applies the function FUN to each element of X, which can be a list or a vector, and by default will try to return the results in a vector or a matrix, if this make sense, otherwise in a list.

To obtain the mean diameter, height, and volume of trees:

lapply(ufc[4:6], mean)   # return a list

## $dbh.cm
## [1] 37.41369
## 
## $height.m
## [1] 24.2256
## 
## $volume.m3
## [1] 1.93294

sapply(ufc[4:6], mean)   # return a vector

##    dbh.cm  height.m volume.m3 
##  37.41369  24.22560   1.93294

Simple example:

sapply(1:3, function(x) x^2)

## [1] 1 4 9

lapply(1:3, function(x) x^2)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9

lapply example

df <- data.frame(replicate(6, sample(c(1:10, -99), 6, rep = TRUE)))
names(df) <- letters[1:6]
df

##     a   b   c  d   e  f
## 1 -99   7   1 10   5  3
## 2  10 -99   5  6   8  7
## 3  10 -99   5  1   9  7
## 4   7  10 -99 10   2 10
## 5   1   8   7  7 -99  3
## 6   5   8   4  3 -99  7

You want to replace all the -99s with NAs.

fix_missing <- function(x) {
  x[x == -99] <- NA
  x
}
df[] <- lapply(df, fix_missing)  #list to data.frame, use df[]
df

##    a  b  c  d  e  f
## 1 NA  7  1 10  5  3
## 2 10 NA  5  6  8  7
## 3 10 NA  5  1  9  7
## 4  7 10 NA 10  2 10
## 5  1  8  7  7 NA  3
## 6  5  8  4  3 NA  7

More about apply : http://adv-r.had.co.nz/Functional-programming.html

three dots ellipsis … in function argument

The three dots ellipsis ... in function argument is used to get a variable-length argument list.

One example is paste, where you can put a variable numbers of arguments. Check ?paste.

To handle ..., first convert it to a list inside the function.

addemup <- function(x, ...){
  args <- list(...)
  for (a in args) x <- x + a
  x
}

addemup(1, 1)

## [1] 2

addemup(1, 2, 3, 4, 5)

## [1] 15

reverse_paste0 <- function(...){
  result <- ""
  args <- list(...)
  
  for (i in length(args):1){
    result <- paste0(result, args[i])
  }
  result
}

reverse_paste0("a", "b", "c")

## [1] "cba"

reverse_paste0("a", "b", "c", "d", "e")

## [1] "edcba"

728x90

'R Programming' 카테고리의 다른 글

R Programming (7) - Discrete random variable (0)	2020.03.28
R Programming (6) - Numerical integration (0)	2020.03.28
R Programming (4) - Function (0)	2020.03.28
R Programming (3) - IO (0)	2020.03.28
R Programming (2) - R basic (0)	2020.03.28

ABOUT ME

oziguyo oziguyo

Sophisticated data structures

Sophisticated data structures

Factor

Dataframe

creating data.frame

read.table

more about read.table

Header

read csv file

accessing element in data.frame

create new variable

name of column

subset

write a dataframe

Another method to read and write tabular data

List

Example

name of elements of list

Simplify vs. Preserving

list as output of function

apply family

tapply example

Example : tapply

lapply and sapply

lapply example

three dots ellipsis … in function argument

'R Programming' 카테고리의 다른 글

티스토리툴바

ABOUT ME

Sophisticated data structures

Sophisticated data structures

Factor

Dataframe

creating data.frame

read.table

more about read.table

Header

read csv file

accessing element in data.frame

create new variable

name of column

subset

write a dataframe

Another method to read and write tabular data

List

Example

name of elements of list

Simplify vs. Preserving

list as output of function

apply family

tapply example

Example : tapply

lapply and sapply

lapply example

three dots ellipsis … in function argument

'R Programming' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바