Chimpler » hadoop

↧

Image may be NSFW.
Clik here to view.

Deploying Hadoop on EC2 with Whirr

January 20, 2013, 3:38 pm

Apache Whirr is a set of tools to deploy cloud services. It can be used on Amazon Elastic Cloud(EC2), Rackspace Cloud and many other cloud providers. Requirement You need to have an account on Amazon...

View Article

Image may be NSFW.
Clik here to view.

Playing with Hadoop Pig

February 4, 2013, 5:34 am

Hadoop Pig is a tool to manipulate data from various sources (CSV file, MySQL, MongoDB, …) using a procedural language (Pig Latin). It can run standalone or distributed with Hadoop. Unlike Hive, it can...

View Article

Image may be NSFW.
Clik here to view.

Playing with the Mahout recommendation engine on a Hadoop cluster

February 20, 2013, 6:08 am

Apache Mahout is an open source library which implements several scalable machine learning algorithms. They can be used among other things to categorize data, group items by cluster, and to implement a...

View Article

Image may be NSFW.
Clik here to view.

Playing with Apache Hive, MongoDB and the MTA

March 6, 2013, 6:03 am

Apache Hive is a popular datawarehouse system for Hadoop that allows to run SQL queries on top of Hadoop by translating queries into Map/Reduce jobs. Due to the high latency incurred by Hadoop to...

View Article

Image may be NSFW.
Clik here to view.

Playing with Apache Hive and SOLR

March 20, 2013, 5:56 am

As described in a previous post, Apache SOLR can perform very well to provide low latency analytics. Data logs can be pre-aggregated using Hive and then synced to SOLR. To this end, we developed a...

View Article

Image may be NSFW.
Clik here to view.

Installing and comparing MySQL/MariaDB, MongoDB, Vertica, Hive and Impala...

May 10, 2013, 6:13 am

A common thing a data analyst does in his day to day job is to run aggregations of data by generally summing and averaging columns using different filters. When tables start to grow to hundreds of...

View Article

Image may be NSFW.
Clik here to view.

Using the Mahout Naive Bayes Classifier to automatically classify Twitter...

June 24, 2013, 6:06 am

In this post, we are going to categorize the tweets by distributing the classification on the hadoop cluster. It can make the classification faster if there is a huge number of tweets to classify. To...

View Article