mapreduce - Check if element is in documents of rdd -

- June 15, 2012

i have such rdd1 in pyspark: (please excuse minor syntax errors):

[(id1,(1,2,3)), (id2,(3,4,5))]

i have rdd2 holding such: (2,3,4).

now want see each element of rdd2 in how many rdd1 sublists occurs, e.g. of expected output rdd (or collected list dont care)

(2, [id1]),(3,[id1,id2]),(4,[id2])

this have far (note rdd2 must first item in line/algorithm)

rdd2.map(lambda x: (x, x in rdd.map(lambda y:y[1])))

even though me give true/false second item of pair tuple live it, not work. failing when trying perform map on rdd2 inside anonymous function of rdd1 map.

any idea how going in right direction?

if rrd2 relatively small (fits in memory):

pairs1 = rdd1.flatmap(lambda (k, vals): ((v, k) v in vals)) vals_set = sc.broadcast(set(rdd2.collect())) (pairs1     .filter(lambda (k, v): k in vals_set.value)     .groupbykey())

if not, can take pairs1 previous part , use join:

pairs2 = rdd2.map(lambda x: (x, none)) (pairs2     .leftouterjoin(pairs1)     .map(lambda (k, (_, v)): (k, v))     .groupbykey())

as always, if intermediate structure should consider reducebykey, aggregatebykey or combinebykey instead of groupbykey. if final structure can add .mapvalues(list).

finally can try use spark data frames:

df1 = sqlcontext.createdataframe(     rdd1.flatmap(lambda (v, keys): ({'k': k, 'v': v} k in keys))) df2 = sqlcontext.createdataframe(rdd2.map(lambda k: {'k': k}))  (df1     .join(df2, df1.k ==  df2.k, 'leftsemi')     .map(lambda r: (r.k, r.v)).groupbykey())

Search This Blog

Overvie

mapreduce - Check if element is in documents of rdd -

Comments

Post a Comment

Popular posts from this blog

android - Gradle sync Error:Configuration with name 'default' not found -

StringGrid issue in Delphi XE8 firemonkey mobile app -

html - jQuery UI Sortable - Remove placeholder after item is dropped -