Getting Started With Hadoop Using Hortonworks Sandbox

Getting started with a distributed system like Hadoop can be a daunting task for developers. From installing and configuring Hadoop to learning the basics of MapReduce and other add-on tools, the learning curve is pretty high.

Hortonworks recently released the Hortonworks Sandbox for anyone interested in learning and evaluating enterprise Hadoop.

The Hortonworks Sandbox provides:

  1. A virtual machine with Hadoop preconfigured.
  2. A set of hands-on tutorials to get you started with Hadoop.
  3. An environment to help you explore related projects in the Hadoop ecosystem like Apache Pig, Apache Hive, Apache HCatalog and Apache HBase.

You can download the Sandbox from Hortonworks website:

http://hortonworks.com/products/hortonworks-sandbox/

The Sandbox download is available for both VirtualBox and VMware Fusion/Player environments. Just follow the instruction to import the Sandbox into your environment.

The download is an OVA (open virtual appliance), which is really a TAR file.

1
tar -xvf Hortonworks+Sandbox+1.2+1-21-2012-1+vmware.ova

Untar it and the archive consists of an OVF (Open Virtualization Format) descriptor file, a manifest file and a disk image of vmdk format.

Rackspace Cloud doesn’t let you upload your own images, but if you have an OpenStack based cloud, you can boot a virtual machine with the image provided.

First, you can convert the vmdk image to a more familiar format like qcow2.

1
2
3
4
qemu-img convert –c -O qcow2 Hortonworks_Sandbox_1.2_1-21-2012-1_vmware-disk1.vmdk hadoop-sandbox.qcow2

file hadoop-sandbox.qcow2
hadoop-sandbox.qcow2: QEMU QCOW Image (v2), 17179869184 bytes

Now, let’s upload the image to Glance.

1
glance add name="hadoop-sandbox" is_public=true container_format=bare disk_format=qcow2 < /path/to/hadoop-sandbox.qcow2

Now let’s create a virtual server off of the new image – give at least 4GB of RAM.

1
nova boot --flavor $flavor_id --image $image_id hadoop-sandbox

Once the instance goes to ACTIVE status and that the instance pings, you can ssh into the instance using

  • Username: root
  • Password: hadoop

Watch /var/log/boot.log as the services are coming up, and it will let you know when the installation is complete. This can take about 10 minutes.

At the end, you should have these java processes running:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
jps
2912 TaskTracker
2336 DataNode
2475 SecondaryNameNode
3343 HRegionServer
2813 JobHistoryServer
2142 NameNode
3012 QuorumPeerMain
4215 RunJar
4591 Jps
3568 RunJar
3589 RunJar
1559 Bootstrap
2603 JobTracker
3857 RunJar

Go to the browser at http://instance_ip and your single node Hadoop cluster should be running. Just follow through the UI; it has demos, videos and step-by-step hands-on tutorials on Hadoop, Pig, Hive and HCatalog.

Make your web site faster

Google’s mod_pagespeed speeds up your site and reduces page load time. This open-source Apache HTTP server module automatically applies web performance best practices to pages, and associated assets (CSS, JavaScript, images) without requiring that you modify your existing content or workflow.

Features
  • Automatic website and asset optimization
  • Latest web optimization techniques
  • 40+ configurable optimization filters
  • Free, open-source, and frequently updated
  • Deployed by individual sites, hosting providers, CDN’s

How does mod_pagespeed speed up web-sites?

mod_pagespeed improves web page latency and bandwidth usage by changing the resources on that web page to implement web performance best practices. Each optimization is implemented as a custom filter in mod_pagespeed, which are executed when the Apache HTTP server serves the website assets. Some filters simply alter the HTML content, and other filters change references to CSS, JavaScript, or images to point to more optimized versions.

mod_pagespeed implements custom optimization strategies for each type of asset referenced by the website, to make them smaller, reduce the loading time, and extend the cache lifetime of each asset. These optimizations include combining and minifying JavaScript and CSS files, inlining small resources, and others. mod_pagespeed also dynamically optimizes images by removing unused meta-data from each file, resizing the images to specified dimensions, and re-encoding images to be served in the most efficient format available to the user.

mod_pagespeed ships with a set of core filters designed to safely optimize the content of your site without affecting the look or behavior of your site. In addition, it provides a number of more advanced filters which can be turned on by the site owner to gain higher performance improvements.

mod_pagespeed can be deployed and customized for individual web sites, as well as being used by large hosting providers and CDN’s to help their users improve performance of their sites, lower the latency of their pages, and decrease bandwidth usage.

Installing mod_pagespeed

Supported platforms

  • CentOS/Fedora (32-bit and 64-bit)
  • Debian/Ubuntu (32-bit and 64-bit)

To install the packages, on Debian/Ubuntu, please run (as root) the following command:

dpkg -i mod-pagespeed-*.deb
apt-get -f install

For CentOS/Fedora, please execute (also as root):

yum install at  # if you do not already have 'at' installed
rpm -U mod-pagespeed-*.rpm

Installing mod_pagespeed will add the Google repository so your system will automatically keep mod_pagespeed up to date. If you don’t want Google’s repository, do sudo touch /etc/default/mod-pagespeed before installing the package.

You can also download a number of system tests. These are the same tests available onModPageSpeed.com.

What is installed

  • The mod_pagespeed packages install two versions of the mod_pagespeed code itself, mod_pagespeed.so for Apache 2.2 andmod_pagespeed_ap24.so for Apache 2.4.
  • Configuration files: pagespeed.confpagespeed_libraries.conf, and (on Debian) pagespeed.load. If you modify one of these configuration files, that file will not be upgraded automatically in the future.
  • A standalone JavaScript minifier pagespeed_js_minify based on the one used in mod_pagespeed, that can both minify JavaScript and generate metadata for library canonicalization.

Facebook Events Join the Contextual-Computing Party

Facebook made a tweak to its Events system this week, adding a little embedded forecast that shows projected weather on the day of the event. It’s a small change, but part of a big shift in computing.

zuck

Facebook CEO Mark Zuckerberg at a product launch earlier this month. Photo: Alex Washburn/Wired

The new feature, described by Facebook in briefings with individual reporters, pulls forecasts for the location of the event from monitoring company Weather Underground and attaches it to the Facebook pages of events happening within the next 10 days. The data is also shown while the event is being created, helping organizers avoid rained-out picnics and the like.

The change makes Facebook more sensitive to contextual information, data like location and time of day that the user doesn’t even have to enter. Facebook rival Google has drawn big praise for its own context-sensitive application Google Now, which, depending on your habits, might show you weather and the day’s appointments when you wake up, traffic information when you get in your car, and your boarding pass when you arrive at the airport. Google Now was so successful on Android smartphones that Google is reportedly porting the app to Apple’s iOS.

Apple’s own stab at contextual computing, the Siri digital assistant, has been less successful, but that seems to have more to do with implementation issues – overloaded servers, bad maps, and tricky voice-recognition problems – than with the idea of selecting information based on location and other situational data.

Hungry as Facebook is to sell ever-more-targeted ads at ever-higher premiums, expect the social network to add more context-sensitive features. One natural step is putting the Graph Search search engine on mobile phones and tailoring results more closely to location. Another is to upgrade Facebook’s rapidly evolving News Feed, which already filters some information based on your past check-ins, along the same lines. Done right, pushing information to Facebook users based on context could multiply the social network’s utility. Done wrong, it could be creepy on a whole new level.

Awesome MediaWiki theme

For anyone who saw the recent launch of the new oVirt website a while back and was wondering how they could make such an attractive theme and lay-out for a MediaWiki wiki, wonder no more. In fact, you don’t even have to be jealous! Because the theme, called Strapping, so called because it’s based on the Bootstrap web framework, has just been published by  Garrett on GitHub.

Kudos to Garrett, who did amazing work on this theme to make it as beautiful and reusable as possible. I’m looking forward to using it for other websites in the near future. And so can you!

LinkedIn has just announced the release of Camus

Kafka is a high-throughput, persistent, distributed messaging system that was originally developed at LinkedIn. It forms the backbone of Wikimedia’s new data analytics pipeline.

Kafka is both performant and durable. To make it easier to achieve high throughput on a single node it also does away with lots of stuff message brokers ordinarily provide (making it a simpler distributed messaging system).

LinkedIn has just announced the release of Camus: their Kafka to HDFS pipeline.

 

Are You a Force Multiplier?

multiply

multiply

On most days, my To Do List seems longer than the Nile River.  It contains everything from the quotidien (remember the milk!) to the critical — tasks that trigger serious consequences. On days when it seems like I add two tasks for every one I complete, it can be tempting to focus on the noisiest ones.  What are noisy tasks?  The tasks with the most pressing deadline or the most vocal sponsor. And so it goes, racing from one due date to another, with barely enough time for a breath much less a moment to consider the true results of what I am doing.

Writers on productivity, time management and strategy have told us for a long time that we should focus on the IMPORTANT not the URGENT. That’s excellent advice.  However, I’ve recently started thinking about another lens through which to view and prioritize tasks:  Will the completion of the task (or project) act as a force multiplier?

To understand this better, let’s spend a moment on force multiplication.  The military calls a factor a “force multiplier” when that factor enables a force to work much more effectively.  The example in Wikipedia relates to GPS:  ”if a certain technology like GPS enables a force to accomplish the same results of a force five times as large but without GPS, then the multiplier is 5.”  Interestingly, while technology can be an enormous advantage, force multipliers are not limited to technology.  Some of the force multipliers listed in that Wikipedia article have nothing at all to do with technology:

Now come back to that growing To Do List and take another look at those tasks.  How many of them are basically chores — things that simply need to get done in order to get people off your back or to move things forward (perhaps towards an unclear goal)? How many of them are (or are part of) force multipliers — things that will allow you or your organization to work in a dramatically more effective fashion?  Viewed through this lens, the chores seem much less relevant, akin to rearranging the deck chairs on the Titanic, while the force multipliers are clearly much more deserving of your time and attention.

The challenge of course is that the noisy tasks grab your attention because others insist on it.  They want something when they want it because they want it.  They may not have a single strategic thought in their head, but they are demanding and persistent.  So how do you limit the encroachment of purveyors of noisy tasks?  One answer is to limit the amount of time available for chores.  To do this credibly, you’ll need to know where you and your activities fit within the strategy of your organization.  If the task does not advance strategy, don’t do it.  Or decide upfront to allow a fixed percentage of your time for chores that may be of minimal use to you, but may be important to keep the people around you happy.  Another approach is to get a better understanding of the task and its context.  If your job is to copy documents, one page looks much like another.  However, it matters if the document you are copying contains the cafeteria menu or the firm’s emergency response guidelines. Finally, you need to educate the folks around you.  With your subordinates, do your decision making aloud — explaining how you determine if a particular task or project is a force multiplier. With your superiors, ask them to help you understand better the force multiplication attributes they see in the tasks they assign.  (This will either provide you with more useful contextual information or smoke out a chore that is masquerading as an important task.) Finally, with the others, engage them in conversation. When you cannot see your way clear to handle their chore, explain your reasoning.  They won’t always be happy about it, but they will start learning when to call on you and when to dump their requests on someone else.

Of course, the concept of force multiplication goes far beyond your To Do List.  Do your projects have a force multiplying effect on your department?  Does your department have a force multiplying effect on your firm? These are important questions for everyone, but especially for people engaged in the sometime amorphous field of knowledge management. Sure, most of what we do helps.  But do we make a dramatic difference?  If not, why not?

[Photo Credit: Leo Reynolds]

Written By: V Mary Abraham

Anonymous members speak out about WikiLeaks’ fundraising tactics

anonymous

In the past, Anonymous has been among the most supportive of WikiLeaks and the mission behind it — which is still halfway true, but since everything seems to have funneled off into the ‘one man Julian Assange show,’ the majority of the hacktivist group no longer embraces the site.

AnonymousIRC released a statement on Pastebin yesterday, shortly after announcing their withdrawal of support for WikiLeaks via Twitter:

The end of an era. We unfollowed @wikileaks and withdraw our support. It was an awesome idea, ruined by Egos. Good Bye.

WikiLeaks is funded entirely through donations — which is fine, according to Anonymous, but the problem is how it began demanding users to donate money in order to access any content at all.

Since yesterday visitors of the Wikileaks site are presented a red overlay banner that asks them to donate money. This banner cannot be closed and unless a donation is made, the content like GIFiles and the Syria emails are not displayed.

That’s a great way for any donation-driven service to pull in a ton of donations in a short amount of time, but like Anonymous has already said, it clearly demonstrates that WikiLeaks’ primary focus has changed from releasing information and serving its users, to just another money-making scheme.

“The idea behind WikiLeaks was to provide the public with information that would otherwise be kept secret by industries and governments. Information we strongly believe the public has a right to know,” the statement said.

“But this has been pushed more and more into the background, instead we only hear about Julian Assange, like he had dinner last night with Lady Gaga. That’s great for him but not much of our interest. We are more interested in transparent governments and bringing out documents and information they want to hide from the public.”

I think I’ll have to agree with the group’s Pastebin statement — I’m all for establishing an online business or service and monetizing it to no end, but certainly not if you’re a not-for-profit organization who’s mission statement is to “bring important news and information to the public.”

Any organization – especially non-profit groups – needs funding to survive, but in the case of WikiLeaks, a fee shouldn’t be charged in order to access content — not if it wants to keep its credibility and supporters, anyway.

The banner has since been taken down, and Anonymous already made it clear that it still supports the original idea, and that it is completely in opposition to any legal action being taken against Assange;

It goes without saying that we oppose any plans of extraditing Julian to the USA. He is a content provider and publisher, not a criminal.

This whole ordeal could definitely cause some turbulence for WikiLeaks – a fair amount of content is believed to have been submitted by Anonymous in the past (including the recent Stratfor email cache).

So if Anonymous is cutting off ties to the organization, that could mean less information-leaks, and thus, less content for WikiLeaks.

T-Mobile Merging With MetroPCS

Last year T-Mobile tried to merge with AT&T but the deal was blocked by the FCC. Now T-Mobile and MetroPCS have agreed to merge in a $1.5 billion deal.There doesn’t seem to be much concern that the FCC will disagree with this deal, perhaps because the two companies combined will have a user base of 42.5 million, which will still be smaller than the #3 player Sprint‘s 56.4 million. Because the two companies have similar spectrum holdings T-Mobile claims the merger will allow them to offer better coverage. They also say they will continue to offera range of both on and off-contract plans.

r2d2b2g: an experimental prototype Firefox OS test environment

Developers building apps for Firefox OS should be able to test them without having to deploy them to actual devices.  Myk Melez looked into the state of the art recently and found that the existing desktop test environments, like B2G Desktop, the B2G Emulators, and Firefox’s Responsive Design View, are either difficult to configure or significantly different from Firefox OS on a phone.

Firefox add-ons provide one of the simplest software installation and update experiences. And B2G Desktop is a lot like a phone. So, Myk Melez decided to experiment with distributing B2G Desktop via an add-on. And the result is r2d2b2g, an experimental prototype test environment for Firefox OS.

How It Works

r2d2b2g bundles B2G Desktop with Firefox menu items for accessing that test environment and installing an app into it. With r2d2b2g, starting B2G Desktop is as simple as selecting Tools > B2G Desktop:

r2d2b2g bundles B2G Desktop with Firefox menu items for accessing that test environment and installing an app into it. With r2d2b2g, starting B2G Desktop is as simple as selecting Tools > B2G Desktop:

To install an app into B2G Desktop, navigate to it in Firefox, then select Tools > Install Page as App:

 To install an app into B2G Desktop, navigate to it in Firefox, then select Tools > Install Page as App:

r2d2b2g will install the app and start B2G Desktop so you can see the app the way it’ll appear to Firefox OS users:

 r2d2b2g will install the app and start B2G Desktop so you can see the app the way it’ll appear to Firefox OS users:

Try It Out!

Note that r2d2b2g is an experiment, not a product! It is neither stable nor complete, and its features may change or be removed over time. Or Mozilla might end the project after learning what they can from it. But if you’re the adventurous sort, and you’d like to provide feedback on this investigation into a potential future product direction, then they’d love to hear from you!

Install r2d2b2g via these platform-specific XPIs: MacLinux (32-bit), orWindows (caveat: the Windows version of B2G Desktop currently crashes on startup due to bug 794662 795484), or fork it on GitHub, and let us know what you think!

Also, try out the Wikipedia Mobile for Firefox OS application available on GitHub. You can see it in action here.

Google Glass, Augmented Reality Spells Data Headaches

Google seems determined to press forward with Google Glass technology, filinga patent for a Google Glass wristwatch. As pointed out by CNET, the timepiece includes a camera and a touch screen that, once flipped up, acts as a secondary display. In the patent, Google refers to the device as a ‘smart-watch. Whether or not a Google Glass wristwatch ever appears on the marketplace — just because a tech titan patents a particular invention doesn’t mean it’s bound for store shelves anytime soon — the appearance of augmented-reality accessories brings up a handful of interesting issues for everyone from app developers to those tasked with handling massive amounts of corporate data.For app developers, augmented-reality devices raise the prospect of broader ecosystems and spiraling complexity. It’s one thing to build an app for smartphones and tablets — but what if that app also needs to handle streams of data ported from a pair of tricked-out sunglasses or a wristwatch, or send information in a concise and timely way to a tiny screen an inch in front of someone’s left eye?