data.table in R - multiple filters using multiple keys - binary search -
i don't understand how can filter based on multiple keys in data.table
. take built-in mtcars
dataset.
dt <- data.table(mtcars) setkey(dt, am, gear, carb)
following vignette, know if want have filtering corresponds am == 1 & gear == 4 & carb == 4
, can say
> dt[.(1, 4, 4)] mpg cyl disp hp drat wt qsec vs gear carb 1: 21 6 160 110 3.9 2.620 16.46 0 1 4 4 2: 21 6 160 110 3.9 2.875 17.02 0 1 4 4
and gives correct result. furthermore, if want have am == 1 & gear == 4 & (carb == 4 | carb == 2)
, works
> dt[.(1, 4, c(4, 2))] mpg cyl disp hp drat wt qsec vs gear carb 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 4: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
however, when want have am == 1 & (gear == 3 | gear == 4) & (carb == 4 | carb == 2)
, plausible
> dt[.(1, c(3, 4), c(4, 2))] mpg cyl disp hp drat wt qsec vs gear carb 1: na na na na na na na na 1 3 4 2: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 3: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
fails. please explain me right approach here?
the reason didn't error query data.table reuse values when they're multiples of other values. in other words, because 1
am
can used 2 times, without telling you. if query number of allowable values weren't multiples of each other give warning. example
dt[.(c(1,0),c(5,4,3),c(8,6,4))]
will give warning complaining remainder of 1 item, same error see when typing data.table(c(1,0),c(5,4,3),c(8,6,4))
. whenever merging x[y]
, both x
, y
should thought of data.tables.
if instead use cj
,
dt[cj(c(1,0),c(5,4,3),c(8,6,4))]
then make every combination of values , data.table give results expect.
from vignette (bolding mine):
what’s happening here? read again. value provided second key column “mia” has find matching vlaues in dest key column on matching rows provided first key column origin. can not skip values of key columns before. therfore provide unique values key column origin. “mia” automatically recycled fit length of unique(origin) 3.
just completeness, vector scan syntax work without using cj
dt[am == 1 & gear == 4 & carb == 4]
or
dt[am == 1 & (gear == 3 | gear == 4) & (carb == 4 | carb == 2)]
how know if need binary search? if speed of subsetting unbearable need binary search. example, i've got 48m row data.table i'm playing , difference between binary search , vector staggering relative 1 another. vector scan takes 1.490 seconds in elapsed time binary search takes 0.001 seconds. that, of course, assumes i've keyed data.table. if include time takes set key combination of setting key , performing subset 1.628. have pick poison
Comments
Post a Comment