[Solved-1 Solution] Find if a string is present inside another string in Pig ?



What is string ?

  • A string is a data type used in programming, such as an integer and floating point unit, but is used to represent text rather than numbers. It is comprised of a set of characters that can also contain spaces and numbers

We have the following String functions in Apache Pig.

S.N. Functions & Description
1 ENDSWITH(string, testAgainst) - To verify whether a given string ends with a particular substring.
2 STARTSWITH(string, substring) - Accepts two string parameters and verifies whether the first string starts with the second.
3 SUBSTRING(string, startIndex, stopIndex) - Returns a substring from a given string.
4 EqualsIgnoreCase(string1, string2) - To compare two stings ignoring the case.
5 INDEXOF(string, ‘character’, startIndex) - Returns the first occurrence of a character in a string, searching forward from a start index.
6 LAST_INDEX_OF(expression) - Returns the index of the last occurrence of a character in a string, searching backward from a start index.
7 LCFIRST(expression) - Converts the first character in a string to lower case.
8 UCFIRST(expression) - Returns a string with the first character converted to upper case.
9 UPPER(expression) - UPPER(expression) Returns a string converted to upper case.
10 LOWER(expression) - Converts all characters in a string to lower case.
11 REPLACE(string, ‘oldChar’, ‘newChar’); - To replace existing characters in a string with new characters.
12 STRSPLIT(string, regex, limit) - To split a string around matches of a given regular expression.
13 STRSPLITTOBAG(string, regex, limit) - Similar to the STRSPLIT() function, it splits the string by given delimiter and returns the result in a bag.
14 TRIM(expression) - Returns a copy of a string with leading and trailing whitespaces removed.
15 LTRIM(expression) - Returns a copy of a string with leading whitespaces removed.
16 RTRIM(expression) - Returns a copy of a string with trailing whitespaces removed.
  • Suppose we have a string that is stored in a PHP variable called $aString. And we want to find out if inside of $aString there is another substring
  • Now let’s say that the name of the larger string ($aString) is this: “Where is Waldo?”. And, we just want to find out if $aString contains “Waldo”. PHP provides us with a function called strpos that will allow us to find the existence of one string inside of another.

Here is an example of how to use the strpos function

Example of how to find out if one string contains another in PHP

if (strpos($aString,'Waldo') !== false) {
    echo 'I found Waldo!';
}
  • The strpos function will return successfully with the string positon of “Waldo”, basically saying that “Waldo” was indeed found. This is of course a problem if we only want to search for the string “Waldo” as a separate word, and not as part of another word.

Strpos never returns true

  • One thing about the strpos function that we should remember is that it never returns the boolean value of true.
  • The strpos function returns a value indicating the position of the first occurrence of the substring being searched for.
  • If the substring is not found “false” is returned instead - which is why in the code above we check for false instead of true.

!== vs != in PHP

  • One thing worth noting in the code above is that we used the !== operator instead of the != operator (which has one less “=”). What’s the difference between the 2 operators?
  • We can think of the !== operator as being more ‘strict’ than the != operator. This is because the !== operator will say that the two operands being compared are not equal only if the type of the two operands are the same, but their values are not equal.
  • This is desirable behavior because the strpos function can return a 0 if the string being searched contains the substring as the very first element. The 0 would represent the 0th index of the larger string - meaning the first position in that string. So, if $aString is “Waldo is here”, and we are searching for “Waldo”, then the strpos function will return a 0.
  • This means that the check being performed will be to see if 0 is not equal to false. But the problem is that 0 is also considered as the integer equivalent of the boolean ‘false’ in PHP, which means that the statement “0 != false” will be considered false, because 0 is equal to false in PHP.
  • But, if we run “0 !== false” instead, then that statement will be considered to be true, because it just adds the additional check to see if 0 and false are of the same type. Since 0 is an integer and false is a boolean, clearly they are not equal so comparing the 0 and false for inequality returns true unlike the “0 != false” check, which returns false

Problematic code to find a substring inside a larger string

if (strpos($aString,'Waldo') != false) {
    echo 'I found Waldo!';
}
  • The code above can result in problems for the reasons discussed above. It’s always better to use !== instead of !=.

Problem:

  • If we want to find if a string contains another string in Pig. we found that there is a built-in index function, but it only searches for characters not string. Is there any other alternative ?

Solution 1:

Way of finding string in a another string:

X = FILTER A BY (f1 matches '.*the_word_you're_looking_for.*');

Related Searches to Find if a string is present inside another string in Pig