regex - rm_between with multiple markers in an observation -
there helpful answers on here using rm_between when each observation has 1 instance of markers. have dataset want extract things in ""'s , of observations have multiple instances of that. example:
fresh or chilled atlantic salmon "salmo salar" , danube salmon "hucho hucho"
when use code,
library(qdapregex) rf <- data.frame(rm_between_multiple(h2$se_desc_en, c("\"", "\""), c("\"", "\""))) it creates data frame , same line earlier
"fresh or chilled atlantic salmon , danube salmon" is returned perfect. need missing data. try retain it, change code to:
h3 <- rm_between_multiple(h2$se_desc_en, c("\"", "\""), c("\"", "\""), extract=true) to create list data in quotations. same line returned is:
c("salmo salar", " , danube salmon ", "hucho hucho", "salmo salar", " , danube salmon ", "hucho hucho") which has data in quotations has info in between quotations , being repeated. i'm new @ programming , wondering if there way write code not included information between these quotations.
i think don't need rm_between_multiple rm_between. there appears regex issue in using same left , right marker i'm not sure if bug yet. can use following extract
x <- 'fresh or chilled atlantic salmon "salmo salar" , danube salmon "hucho hucho"' rm_default( x, pattern = s("@rm_between", '"'), extract=true ) ## [[1]] ## [1] "\"salmo salar\"" "\"hucho hucho\"" edit think because default regex of rm_between not include left/right bounds. uses following regex "(?<=\").*?(?=\")". use of lookaheads cause left/right bounds not consumed , allows quotation marks available for: " , danube salmon ". (imo) bug address unsure how yet.
edit 2 incorporated @hwnd's response rm_between. dev version of qdapregex. can instal dev version via:
if (!require("pacman")) install.packages("pacman"); library(pacman) p_install_gh("trinker/qdapregex"); p_load(qdapregex) and ...
rm_between(x, '"', '"', extract = true) ## [[1]] ## [1] "salmo salar" "hucho hucho"
Comments
Post a Comment