Mahout Interview Question And Answers - Wikitechy

What is the difference between Cloudera Oryx and Apache Mahout ?

Editor — Tue, 20 Jul 2021 03:20:00 +0000

Differences between Cloudera Oryx and Apache Mahout

There are 3 broad things an operational ML system needs to do eventually
- Build models at scale, offline
- Update models in near real time
- Query models in real time
Most of the tools like Mahout or MLLib do building models at scale only.

Oryx tries to do all 3, and is not doing building model.
Therefore it is really intended as a complement to any Hadoop-based model build system.
As a result it is MapReduce based for model building and implemented algorithms instead of using Mahout to improve on perceived problems.
The project which is open source, is more designed as 3 complete apps rather than a platform for extension.
It only implements
- ALS for recommendation
- Kmeans for clustering
- Random decision forests for classification and regression
The major difference is fewer algorithms but complete apps including incremental update and serving. It is not the algorithms that are really the difference since Oryx is not a new library.
The next version is built on Spark and Kafka then becomes more of generic lambda architecture for ML that happens to have entire apps too.
It is kind of Summing bird for ML on Spark. It has no algorithms implementations at all, not now. Therefore it is even more different from Mahout or MLLib.

What is the difference between GraphLab and Mahout ?

Editor — Tue, 20 Jul 2021 03:13:48 +0000

Difference between graphlab and mahout:

Mahout	Graphlab
Mahout is a framework for machine learning and part of the Apache Foundation	Graphlab project takes a quite different approach to parallel collaborative filtering (more broadly, machine learning), and is primarily used by academic institutions.
Mahout has inherent Fault-tolerance	Graphlab does not have inherent Fault-tolerance
Mahout looks like a more polished product, especially as it relies on Hadoop for scalability and distribution.	Graphlab excells since it is built ground up for iterative algorithms such as those used in collaborative filtering.
The mahout framework comes in two approaches: Online where recommendations are computed on demand, typically on smaller datasets. Offline which utilise Apache Hadoop to achieve scalability.	Graphlab lacks a production-ready distribution framework.
For 50000 items, you need to have N machines with at least 28 GiB of memory for each, where N is the number of Hadoop nodes and hence 28 GiB of memory becomes an issue.	Costly performance penalties since runtime of each phase is decided by slowest machine.

How Mahout used with Python ?

Editor — Tue, 20 Jul 2021 02:48:20 +0000

Mahout is used with Python:

You should need to download and install the JPype package for python.The initial step is to set up JPype is determining the path to the dynamic library for the jvm ; on linux this will be a .so file and on windows it will be a .dll.
In python script, make a global variable with the path to this dll file.
Then we need to make sense how we have to set the classpath for mahout. The simplest way to do this is to edit script in “bin/mahout” to print out the classpath. Now include the code line “echo $CLASSPATH” to the script anywhere in the following comment “run it”.
Finally execute the script to print out the classpath. Now copy this output and paste into a variable in your python script.
Presently we can create a function to begin the jvm in python utilizing jype.

from jpype import *
jvm=None
def start_jpype():
global jvm
if (jvm is None):
cpopt="-Djava.class.path={cp}".format(cp=classpath)
startJVM(jvmlib,"-ea",cpopt)
jvm="started"

In the same way while reading or writing call the JPype function:

start_jpype()