In part one, I provided an initial walk through of some nice features that are available within the data.table package. In particular, we saw how to filter data and get a count of rows by the date.

dat = fread("rows.csv") names(dat) <- gsub(" ", "_", names(dat)) dat[1:3]

Let us now add a few columns to our dataset on reported crimes in the city of Chicago. There are many ways to do do this but they involve the use of the := operator. Since data.table updates values by reference, we do not need to save the results as another variable. This is a very desirable feature.

dat[, c("value1", "value2", "value3") := sample(1:50, nrow(dat), replace=TRUE)] dat[, `:=`(value1 = sample(1:50, nrow(dat), replace=TRUE), value2 = sample(1:50, nrow(dat), replace=TRUE), value3 = sample(1:50, nrow(dat), replace=TRUE))]

You can also just use the traditional base R solution for adding new columns as data.tables are also data frames.

dat$value1 <- sample(1:50, nrow(dat), replace=TRUE) dat$value2 <- sample(1:50, nrow(dat), replace=TRUE) dat$value3 <- sample(1:50, nrow(dat), replace=TRUE)

In any case, we now have three new columns with randomly selected values between 1 and 50. We can now look to summarize these values and see how they differ across the primary arrest type and other categorical variables.

dat[, .(mean = mean(value1, na.rm = TRUE), median = median(value1, na.rm = TRUE), min = min(value1, na.rm = TRUE), max = max(value1, na.rm = TRUE))] dat[Primary_Type=="PROSTITUTION", .(mean = mean(value1, na.rm = TRUE), median = median(value1, na.rm = TRUE), min = min(value1, na.rm = TRUE), max = max(value1, na.rm = TRUE))]

The above code allows us to get the mean, median, min, and max values from the value1 column for all rows and for when the primary type is prostitution. To get these summaries across multiple columns, we can use lapply, .SD, and .SDcols. Notice that in the second line below, we passed primary type to the by operator so that we get mean value for each of the arrest type categories.

dat[, lapply(.SD, mean, na.rm=TRUE), .SDcols=c("value1", "value2", "value3")] dat[, lapply(.SD, mean, na.rm=TRUE), by=Primary_Type, .SDcols=c("value1", "value2", "value3")][1:4]

But wait, that only provides me with the mean of a single value column. If you want to apply multiple functions to columns, the user is required to write a function that can then be used within lapply.

my.summary <- function(x){ c(mean = mean(x), median = median(x), min = min(x), max = max(x)) } dat[, lapply(.SD, my.summary), .SDcols=c("value1", "value2", "value3")] dat[, lapply(.SD, my.summary), by=.(Primary_Type), .SDcols=c("value1", "value2", "value3")]

Perfect! As you can see, the syntax is concise and is very easy to work with.

In the next part of this series, I’ll cover a few advanced features like get() and {}.

**If you have any questions or comments, feel free to comment below. You can also contact me at mathewanalytics@gmail.com or reach me through LinkedIn **

Pingback: Data.Table by Example – Part 2 – Mubashir Qasim

Pingback: Data.Table by Example – Part 2 | A bunch of data

One brief comment:

sample (1:50, nrow(dat), replace=TRUE)

is more naturally written as

sample (1:50, .N, replace=TRUE)

The .N readonly vsriable always contains the number of rows available. And it saves a bunch of keystrokes too

LikeLike

Thanks for this. I always get confused by the difference between := and `:=`. This helped to clarify that for me.

LikeLike

What is so “perfect” about getting unnamed rows, where you have to memorize that, say, “min” is the 3rd row? It may be better to modify

dat[, lapply(.SD, my.summary), .SDcols=c(“value1”, “value2”, “value3”)][, what:=c(“mean”,”median”,”min”,”max”)][]

LikeLike