[Solved-2 Solutions] Pig & Cassandra & DataStax Splits Control ?
- Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
- The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
- Apache Cassandra is a NoSQL database ideal for high-speed, online transactional data, while Hadoop is a big data analytics system that focuses on data warehousing and data lake use cases.
We have a small sandbox cluster (2-nodes) where we putting this system thru some tests. we have a CQL table that has ~53M rows (about 350 bytes ea.), and we notice that the Mapper later takes a very long time to grind thru these 53M rows. We are started pushing around the logs and we can see that the map is spilling repeatedly (we saw 177 spills from the mapper), and this is part of the problem.
The combination of CassandraInputFormat and JobConfig only create a single mapper, so this mapper has to read 100% of the rows from the table.
Now, there are a lot of gears at work in this picture, including:
- 2 physical nodes
- The hadoop node is in a "Analytics" DC (default config), but physically in the same rack.
- We can see the job using LOCAL_QUORUM
How to get Pig to create more Input Splits so we can run more mappers?
- We can set cassandra.input.split.size to something less than 64k which is the default split size,we can get more splits.
How many rows per node for the Cql table?
Add split_size to the url paramaters
For CassandraStorage use the following parameters
For CqlStorage use the following parameters
- We can set
pig.noSplitCombination = truetakes to the other extreme end - with this flag we started having 769 map tasks