[Solved-1 Solution] Writing one file per group in Pig Latin ?
Writing a file
- To write to a file, an output stream is used. An output stream is an object which can be used to write bytes, strings and other values to a file.
- There are numerous files that contain Apache web server log entries. Those entries are not in date time order and are scattered files, group and order the log entries by date time, then write them to files named for the day and hour of the entries it contains.
- Once we have imported my files, we using Regex to get the date field, then we truncating it to hour. This produces a set that has the record in one field, and the date truncated to hour in another. From here we grouping on the date-hour field.
We first thought was to use the STORE command while iterating through my groups using a FOREACH and quickly found out that is not cool with Pig.
- second try was to use the MultiStorage() method in the piggybank which worked great until we looked at the file.
- The problem is that MulitStorage wants to write all fields to the file, including the field we used to group on. What we really want is just the original record written to the file.
- Is there a better way to approach this problem using Pig ?
- Out of the box, Pig doesn't have a lot of functionality. It does the basic stuff, but more times than not we find myself having to write custom UDFs or load/store funcs to get form 95% of the way there to 100% of the way there.
- We usually find it worth it since just writing a small store function is a lot less Java than a whole MapReduce program.
- Your second attempt is really close to what we would do. We should either copy/paste the source code for MultiStorage or use inheritance as a starting point. Then, modify the putNext method to strip out the group value, but still write to that file.
Tuple doesn't have a
delete method, so we'll have to rewrite the entire tuple. Or, if all we have is the original string, just pull that out and output that wrapped in a