Native Apps At The Client & Cloud

Srinivasan Sundara Rajan

Subscribe to Srinivasan Sundara Rajan: eMailAlertsEmail Alerts
Get Srinivasan Sundara Rajan: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Blog Post

Big Data: Enterprise Class Machine Learning with Spark and MLbase

New Infrastructure Aims at Economical, In-memory, Large Scale Machine Learning

Machine Learning is a critical part of extracting value from Big Data. Choosing proper model, preparing data and getting usable results on large scale data is non-trivial exercise. Typically process consists of model prototyping using higher level, (mostly) single machine based tool like R, Matlab, Weka, then coding in Java or some other language for large scale deployment. This process is fairly involved, error prone, slow and inefficient.

Existing tools aiming at automating and improving this process are still somewhat immature and wide scale Machine Learning enterprise adoption is still low. Efforts are under way to address this gap i.e. to make enterprise class Machine Learning more accessible and easier.

Spark is new, purpose-built, distributed, in-memory engine that makes it possible to perform compute intensive jobs on commodity hardware clusters. One of applications Spark is targeted and especially suitable for is Machine Learning, key part in getting actionable insights from Big Data.

Machine Learning is compute intensive application, characterized by many iterative passes through data until optimal solution is found, and Spark is natural fit for such workloads.

MLbase (open source project) is ML platform  implemented on top of Spark which aims at easier and more productive implementation of ML algorithms.

Arguably most interesting part of MLbase will be ML Optimizer ( not released yet ), which will automate the task of choosing models.

Choosing proper model is difficult task and there are quite a few attempts to automate this process (one of the most interesting available products is Google Prediction API, a Cloud service which automatically evaluates, picks and executes model on submitted data).

More Stories By Ranko Mosic

Ranko Mosic, BScEng, is specializing in Big Data/Data Architecture consulting services ( database/data architecture, machine learning ). His clients are in finance, retail, telecommunications industries. Ranko is welcoming inquiries about his availability for consulting engagements and can be reached at 408-757-0053 or ranko.mosic@gmail.com