Differences between Cloudera Oryx and Apache Mahout

  • There are 3 broad things an operational ML system needs to do eventually
    • Build models at scale, offline
    • Update models in near real time
    • Query models in real time
  • Most of the tools like Mahout or MLLib do building models at scale only.
  • Oryx tries to do all 3, and is not doing building model.
  • Therefore it is really intended as a complement to any Hadoop-based model build system.
  • As a result it is MapReduce based for model building and implemented algorithms instead of using Mahout to improve on perceived problems.
  • The project which is open source, is more designed as 3 complete apps rather than a platform for extension.
  • It only implements
    • ALS for recommendation
    • Kmeans for clustering
    • Random decision forests for classification and regression
  • The major difference is fewer algorithms but complete apps including incremental update and serving. It is not the algorithms that are really the difference since Oryx is not a new library.
  • The next version is built on Spark and Kafka then becomes more of generic lambda architecture for ML that happens to have entire apps too.
  • It is kind of Summing bird for ML on Spark. It has no algorithms implementations at all, not now. Therefore it is even more different from Mahout or MLLib.

Categorized in: