regex - rm_between with multiple markers in an observation -
there helpful answers on here using rm_between when each observation has 1 instance of markers. have dataset want extract things in ""'s , of observations have multiple instances of that. example:
fresh or chilled atlantic salmon "salmo salar" , danube salmon "hucho hucho"
when use code,
library(qdapregex) rf <- data.frame(rm_between_multiple(h2$se_desc_en, c("\"", "\""), c("\"", "\"")))
it creates data frame , same line earlier
"fresh or chilled atlantic salmon , danube salmon"
is returned perfect. need missing data. try retain it, change code to:
h3 <- rm_between_multiple(h2$se_desc_en, c("\"", "\""), c("\"", "\""), extract=true)
to create list data in quotations. same line returned is:
c("salmo salar", " , danube salmon ", "hucho hucho", "salmo salar", " , danube salmon ", "hucho hucho")
which has data in quotations has info in between quotations , being repeated. i'm new @ programming , wondering if there way write code not included information between these quotations.
i think don't need rm_between_multiple
rm_between
. there appears regex issue in using same left , right marker i'm not sure if bug yet. can use following extract
x <- 'fresh or chilled atlantic salmon "salmo salar" , danube salmon "hucho hucho"' rm_default( x, pattern = s("@rm_between", '"'), extract=true ) ## [[1]] ## [1] "\"salmo salar\"" "\"hucho hucho\""
edit think because default regex of rm_between
not include left/right bounds. uses following regex "(?<=\").*?(?=\")"
. use of lookaheads cause left/right bounds not consumed , allows quotation marks available for: " , danube salmon "
. (imo) bug address unsure how yet.
edit 2 incorporated @hwnd's response rm_between
. dev version of qdapregex. can instal dev version via:
if (!require("pacman")) install.packages("pacman"); library(pacman) p_install_gh("trinker/qdapregex"); p_load(qdapregex)
and ...
rm_between(x, '"', '"', extract = true) ## [[1]] ## [1] "salmo salar" "hucho hucho"
Comments
Post a Comment