[Solved-2 Solutions] Regex matching in pig ?



REGEX_EXTRACT

  • It Performs regular expression matching and extracts the matched group defined by an index parameter.

Syntax

REGEX_EXTRACT (string, regex, index)

Terms

String - The string in which to perform the match.

Regex - The regular expression.

index - The index of the matched group to return.

Usage

  • Use the REGEX_EXTRACT_ALL function to perform regular expression matching and to extract all matched groups. The function uses Java regular expression form.
  • The function returns a tuple where each field represents a matched expression. If there is no match, an empty tuple is returned.

Example

This example will return the tuple (192.168.1.5,8020).

REGEX_EXTRACT_ALL('192.168.1.5:8020', '(.*)\:(.*)');

Problem:

  • Using apache pig and the text
hahahah. my brother just didnt do anything wrong. He cheated on a test? no way!
  • It is an example of matching "my brother just didnt do anything wrong."
  • If you want to match anything beginning with "my brother just" and end with either punctuation(end of sentence) or EOL.

Looking at the pig docs, and then following the link to java.util.regex.Pattern

extrctd = FOREACH fltr GENERATE FLATTEN(EXTRACT(txt,'(my brother just .*\\p{Punct})')) as (txt:chararray);

Solution 1:

  • In this case we want to match only up to the first punctuation mark.
  • To solve we can use the quantifier
my brother just .*?\\p{Punct}
  • Note that the use of ? Here is different from its use as a quantifier where it means 'match zero or one'.

Solution 2:

Try Below Expression

.*(my brother just .*\\p{Punct})
  • It looks like that our expression wanted the my brother part to be the begining of the string, but in the example it's in the middle of the string so we have to account for everything before my brother.

Related Searches to Regex matching in pig