parsing - Hadoop Informatica Log processing -
i working on project involving creating queryable set of data large informatica log files. so, files imported hadoop cluster using flume, configured coworker before began project. job create table data contained within logs queries can performed easily. issue i'm encountering has log file formatting. logs in format:
timestamp : severity : (pid | thread) : (servicetype | servicename) : clientnode : messagecode : message
the issue message field contains additional colon-delimited comments, example message [ x : y : z ]. when using hcatalog create table cannot account behavior , instead results in additional columns.
any suggestions? use ruby separate fields or replace delimiter keep integrity when importing using hcatalog. there pre-processing can cluster side allowing me this? files large handle locally.
the answer use pig script , python udf. pig script loads in file calls python script line line break fields properly. result can written friendlier csv and/or stored in table.
Comments
Post a Comment