[Solved-2 Solutions] Encoding in Pig ?



What is encoding ?

  • Characters which cannot be represented by an 8-bit ASCII code, can not be used in an URL as there is no way to reliably encode them (the encoding scheme for URLs is based off of octets).
  • Despite this, some servers do support varying means of encoding double byte characters in URLs, the most common technique seems to be to use UTF-8 encoding and encode each octet separately even if a pair of octets represents one character. This however, is not specified by the standard and is highly prone to error, so it is recommended that URLs be restricted to the 8-bit ASCII range.

Problem :

  • Loading data that contains some particular characters (as for example, À, ° and others) using Pig Latin and storing data in a .txt file is possible to see that these symbols in a txt file are displayed as � and ï characters. That happens because of UTF-8 substitution character.
  • We would like to ask if is possible to avoid it somehow, maybe with some pig commands, to have in the result (in txt file) for example À instead of � ?

Solution 1:

  • In Pig we have built in dynamic invokers that that allow a Pig programmer to refer to Java functions without having to wrap them in custom Pig UDFs. So now we can load the data as UTF-8 encoded strings, then decode it, then perform all your operations on it and then store it back as UTF-8.
  • The below code helps for encoding
    DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');
    encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray);
    decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');
  • The java code responsible for doing this is:
    import java.io.IOException;
    import java.net.URLDecoder;

    import org.apache.pig.EvalFunc;
    import org.apache.pig.data.Tuple;

    public class UrlDecode extends EvalFunc<String> {

        @Override
        public String exec(Tuple input) throws IOException {
            String encoded = (String) input.get(0);
            String encoding = (String) input.get(1);
            return URLDecoder.decode(encoded, encoding);
  • Now modify this code to return UTF-8 encoded strings from normal strings and store it to yr text file.

Solution 2:

  • We have to use bytearray type instead of chararray
no_conversion = LOAD 'strangeEncodingdata' using TextLoader AS (line:bytearray);

Related Searches to Encoding in Pig