What Does a Technology Strategist Do?

A technology strategist is responsible for developing, implementing and maintaining strategies as they relate to a company’s information technology structure. This is critical for a business because the use of technology typically reduces costs and results in greater efficiency and profit yield. As a technology strategist works, he must coordinate not only with members of management, but also with the company’s clients.

One of the initial tasks an technology strategist has is to evaluate the company’s current technology status. This may include speaking with managers of the information technology team, as well as physically visiting the company’s various centers to assess what is currently in use. The technology strategist also researches what the company’s competitors are doing with technology in order to determine whether his company is being equally innovative. As part of this process, the technology strategist may request formal inventory, technology and technology-related production reports.

During the research process, a technology strategist also tries to determine what the customers of the company need or want that could be supplied or supported with technology. For instance, the strategist might discover that the company’s clients have a strong desire to track shipments. He could take this information to managers of the technology department to develop an order tracking system clients could access online. The strategist may gather information about the client’s needs and wants through techniques such as interviewing, feedback forms, surveys and reviews of the type and number of items sold.

What a technology strategist is able to do for a company is determined to a great extent by the company’s budget and operational philosophies. For instance, if the company has suffered a profit loss in the previous year, the company might not be able to allocate as many resources to technology and technology strategy development. Thus, the strategist has to get information about the company’s current financial status and why the company plans to allocate funds in a specific way before he can design an acceptable strategy. This information usually comes from managers of the finance department, as well as documents such as the company mission statement or project proposals.

Once a technology strategist has all the data he needs about the company, its clients and the company’s competitors, he uses his knowledge about the company’s products, services, market position and current budget and technology status to brainstorm ideas about how the company could approach technology. Each strategy that is developed has to show the risks, benefits, resources and opportunities available for the company clearly, so the strategist spends time getting his ideas in a presentable form, such as a formal report or presentation.

The next step for a technology strategist is to present his ideas to the managers of the company. During this presentation, the strategist’s job is to make sure the managers understand the pros and cons of each strategy proposed. The strategist ultimately does not decide which strategy to follow, but because his insights have a huge amount of sway in the direction the managers take, the strategist has a powerful position within the company.

After the company has selected an appropriate technology strategy, the technology strategist moves on to the implementation and maintenance phases of strategy development. At this point, the strategist concentrates on purchasing and setting up the company’s technology as discussed. If something doesn’t work, the strategist has to troubleshoot and come up with a solution. He has to liaise with clients and suppliers to make this work and monitor results. The strategist may request reports from various departments within the company to evaluate the effectiveness of the plan.

Technology is constantly evolving, so a major challenge a technology strategist has is that his strategies have to be easily adaptable to future technological advances. This is a conundrum because it is so difficult to predict what technology will arise or to determine by what point any new technology will be truly functional. Subsequently, even after the strategist has developed and implemented a technology strategy, he has to be on the lookout for more efficient options constantly. The strategist plays a key role in helping a company decide when and how to switch from one technology to another if necessary.

http://www.linkedin.com/in/preilly

Get Ready For a Streaming Music Die-Off

Streaming services are ailing. Pandora, the giant of its class and the survivor at 13 years old, is waging an ugly war to pay artists and labels less in order to stay afloat. Spotify, in spite of 6 million paid users and 18 million subscribers who humor some ads in their stream, has yet to turn a profit. Rhapsody axed 15% of its workforce right as Apple’s iTunes Radio hit the scene. On-demand competitor Rdio just opted for layoffs too, in order to move into a ‘scalable business model.’ Did no one wonder about that business-model bit in the beginning? Meanwhile, Turntable.fm, a comparatively tiny competitor with what should have been viral DNA, just pulled the plug on its virtual jam sessions this week—and it just might be the canary in the coal mine.

How the Bitcoin protocol actually works

Many thousands of articles have been written purporting to explain Bitcoin, the online, peer-to-peer currency. Most of those articles give a hand-wavy account of the underlying cryptographic protocol, omitting many details. Even those articles which delve deeper often gloss over crucial points. My aim in this post is to explain the major ideas behind the Bitcoin protocol in a clear, easily comprehensible way. We’ll start from first principles, build up to a broad theoretical understanding of how the protocol works, and then dig down into the nitty-gritty, examining the raw data in a Bitcoin transaction.

Understanding the protocol in this detailed way is hard work. It is tempting instead to take Bitcoin as given, and to engage in speculation about how to get rich with Bitcoin, whether Bitcoin is a bubble, whether Bitcoin might one day mean the end of taxation, and so on. That’s fun, but severely limits your understanding. Understanding the details of the Bitcoin protocol opens up otherwise inaccessible vistas. In particular, it’s the basis for understanding Bitcoin’s built-in scripting language, which makes it possible to use Bitcoin to create new types of financial instruments, such as smart contracts. New financial instruments can, in turn, be used to create new markets and to enable new forms of collective human behaviour. Talk about fun!

I’ll describe Bitcoin scripting and concepts such as smart contracts in future posts. This post concentrates on explaining the nuts-and-bolts of the Bitcoin protocol. To understand the post, you need to be comfortable with public key cryptography, and with the closely related idea of digital signatures. I’ll also assume you’re familiar with cryptographic hashing. None of this is especially difficult. The basic ideas can be taught in freshman university mathematics or computer science classes. The ideas are beautiful, so if you’re not familiar with them, I recommend taking a few hours to get familiar.

It may seem surprising that Bitcoin’s basis is cryptography. Isn’t Bitcoin a currency, not a way of sending secret messages? In fact, the problems Bitcoin needs to solve are largely about securing transactions — making sure people can’t steal from one another, or impersonate one another, and so on. In the world of atoms we achieve security with devices such as locks, safes, signatures, and bank vaults. In the world of bits we achieve this kind of security with cryptography. And that’s why Bitcoin is at heart a cryptographic protocol.

My strategy in the post is to build Bitcoin up in stages. I’ll begin by explaining a very simple digital currency, based on ideas that are almost obvious. We’ll call that currency Infocoin, to distinguish it from Bitcoin. Of course, our first version of Infocoin will have many deficiencies, and so we’ll go through several iterations of Infocoin, with each iteration introducing just one or two simple new ideas. After several such iterations, we’ll arrive at the full Bitcoin protocol. We will have reinvented Bitcoin!

This strategy is slower than if I explained the entire Bitcoin protocol in one shot. But while you can understand the mechanics of Bitcoin through such a one-shot explanation, it would be difficult to understand why Bitcoin is designed the way it is. The advantage of the slower iterative explanation is that it gives us a much sharper understanding of each element of Bitcoin.

Finally, I should mention that I’m a relative newcomer to Bitcoin. I’ve been following it loosely since 2011 (and cryptocurrencies since the late 1990s), but only got seriously into the details of the Bitcoin protocol earlier this year. So I’d certainly appreciate corrections of any misapprehensions on my part. Also in the post I’ve included a number of “problems for the author” – notes to myself about questions that came up during the writing. You may find these interesting, but you can also skip them entirely without losing track of the main text.

First steps: a signed letter of intent

So how can we design a digital currency?

On the face of it, a digital currency sounds impossible. Suppose some person – let’s call her Alice – has some digital money which she wants to spend. If Alice can use a string of bits as money, how can we prevent her from using the same bit string over and over, thus minting an infinite supply of money? Or, if we can somehow solve that problem, how can we prevent someone else forging such a string of bits, and using that to steal from Alice?

These are just two of the many problems that must be overcome in order to use information as money.

As a first version of Infocoin, let’s find a way that Alice can use a string of bits as a (very primitive and incomplete) form of money, in a way that gives her at least some protection against forgery. Suppose Alice wants to give another person, Bob, an infocoin. To do this, Alice writes down the message “I, Alice, am giving Bob one infocoin”. She then digitally signs the message using a private cryptographic key, and announces the signed string of bits to the entire world.

(By the way, I’m using capitalized “Infocoin” to refer to the protocol and general concept, and lowercase “infocoin” to refer to specific denominations of the currency. A similar useage is common, though not universal, in the Bitcoin world.)

This isn’t terribly impressive as a prototype digital currency! But it does have some virtues. Anyone in the world (including Bob) can use Alice’s public key to verify that Alice really was the person who signed the message “I, Alice, am giving Bob one infocoin”. No-one else could have created that bit string, and so Alice can’t turn around and say “No, I didn’t mean to give Bob an infocoin”. So the protocol establishes that Alice truly intends to give Bob one infocoin. The same fact – no-one else could compose such a signed message – also gives Alice some limited protection from forgery. Of course, after Alice has published her message it’s possible for other people to duplicate the message, so in that sense forgery is possible. But it’s not possible from scratch. These two properties – establishment of intent on Alice’s part, and the limited protection from forgery – are genuinely notable features of this protocol.

I haven’t (quite) said exactly what digital money is in this protocol. To make this explicit: it’s just the message itself, i.e., the string of bits representing the digitally signed message “I, Alice, am giving Bob one infocoin”. Later protocols will be similar, in that all our forms of digital money will be just more and more elaborate messages [1].

Using serial numbers to make coins uniquely identifiable

A problem with the first version of Infocoin is that Alice could keep sending Bob the same signed message over and over. Suppose Bob receives ten copies of the signed message “I, Alice, am giving Bob one infocoin”. Does that mean Alice sent Bob tendifferent infocoins? Was her message accidentally duplicated? Perhaps she was trying to trick Bob into believing that she had given him ten different infocoins, when the message only proves to the world that she intends to transfer one infocoin.

What we’d like is a way of making infocoins unique. They need a label or serial number. Alice would sign the message “I, Alice, am giving Bob one infocoin, with serial number 8740348″. Then, later, Alice could sign the message “I, Alice, am giving Bob one infocoin, with serial number 8770431″, and Bob (and everyone else) would know that a different infocoin was being transferred.

To make this scheme work we need a trusted source of serial numbers for the infocoins. One way to create such a source is to introduce a bank. This bank would provide serial numbers for infocoins, keep track of who has which infocoins, and verify that transactions really are legitimate,

In more detail, let’s suppose Alice goes into the bank, and says “I want to withdraw one infocoin from my account”. The bank reduces her account balance by one infocoin, and assigns her a new, never-before used serial number, let’s say 1234567. Then, when Alice wants to transfer her infocoin to Bob, she signs the message “I, Alice, am giving Bob one infocoin, with serial number 1234567″. But Bob doesn’t just accept the infocoin. Instead, he contacts the bank, and verifies that: (a) the infocoin with that serial number belongs to Alice; and (b) Alice hasn’t already spent the infocoin. If both those things are true, then Bob tells the bank he wants to accept the infocoin, and the bank updates their records to show that the infocoin with that serial number is now in Bob’s possession, and no longer belongs to Alice.

Making everyone collectively the bank

This last solution looks pretty promising. However, it turns out that we can do something much more ambitious. We can eliminate the bank entirely from the protocol. This changes the nature of the currency considerably. It means that there is no longer any single organization in charge of the currency. And when you think about the enormous power a central bank has – control over the money supply – that’s a pretty huge change.

The idea is to make it so everyone (collectively) is the bank. In particular, we’ll assume that everyone using Infocoin keeps a complete record of which infocoins belong to which person. You can think of this as a shared public ledger showing all Infocoin transactions. We’ll call this ledger the block chain, since that’s what the complete record will be called in Bitcoin, once we get to it.

Now, suppose Alice wants to transfer an infocoin to Bob. She signs the message “I, Alice, am giving Bob one infocoin, with serial number 1234567″, and gives the signed message to Bob. Bob can use his copy of the block chain to check that, indeed, the infocoin is Alice’s to give. If that checks out then he broadcasts both Alice’s message and his acceptance of the transaction to the entire network, and everyone updates their copy of the block chain.

We still have the “where do serial number come from” problem, but that turns out to be pretty easy to solve, and so I will defer it to later, in the discussion of Bitcoin. A more challenging problem is that this protocol allows Alice to cheat by double spending her infocoin. She sends the signed message “I, Alice, am giving Bob one infocoin, with serial number 1234567″ to Bob, and the message”I, Alice, am giving Charlie one infocoin, with [the same] serial number 1234567″ to Charlie. Both Bob and Charlie use their copy of the block chain to verify that the infocoin is Alice’s to spend. Provided they do this verification at nearly the same time (before they’ve had a chance to hear from one another), both will find that, yes, the block chain shows the coin belongs to Alice. And so they will both accept the transaction, and also broadcast their acceptance of the transaction. Now there’s a problem. How should other people update their block chains? There may be no easy way to achieve a consistent shared ledger of transactions. And even if everyone can agree on a consistent way to update their block chains, there is still the problem that either Bob or Charlie will be cheated.

At first glance double spending seems difficult for Alice to pull off. After all, if Alice sends the message first to Bob, then Bob can verify the message, and tell everyone else in the network (including Charlie) to update their block chain. Once that has happened, Charlie would no longer be fooled by Alice. So there is most likely only a brief period of time in which Alice can double spend. However, it’s obviously undesirable to have any such a period of time. Worse, there are techniques Alice could use to make that period longer. She could, for example, use network traffic analysis to find times when Bob and Charlie are likely to have a lot of latency in communication. Or perhaps she could do something to deliberately disrupt their communications. If she can slow communication even a little that makes her task of double spending much easier.

How can we address the problem of double spending? The obvious solution is that when Alice sends Bob an infocoin, Bob shouldn’t try to verify the transcation alone. Rather, he should broadcast the possible transaction to the entire network of Infocoin users, and ask them to help determine whether the transaction is legitimate. If they collectively decide that the transaction is okay, then Bob can accept the infocoin, and everyone will update their block chain. This type of protocol can help prevent double spending, since if Alice tries to spend her infocoin with both Bob and Charlie, other people on the network will notice, and network users will tell both Bob and Charlie that there is a problem with the transaction, and the transaction shouldn’t go through.

In more detail, let’s suppose Alice wants to give Bob an infocoin. As before, she signs the message “I, Alice, am giving Bob one infocoin, with serial number 1234567″, and gives the signed message to Bob. Also as before, Bob does a sanity check, using his copy of the block chain to check that, indeed, the coin currently belongs to Alice. But at that point the protocol is modified. Bob doesn’t just go ahead and accept the transaction. Instead, he broadcast Alice’s message to the entire network. Other members of the network check to see whether Alice owns that infocoin. If so, they broadcast the message “Yes, Alice owns infocoin 1234567, it can now be transferred to Bob.” Once enough people have broadcast that message, everyone updates their block chain to show that infocoin 1234567 now belongs to Bob, and the transaction is complete.

This protocol has many imprecise elements at present. For instance, what does it mean to say “once enough people have broadcast that message”? What exactly does “enough” mean here? It can’t mean everyone in the network, since we don’t a prioriknow who is on the Infocoin network. For the same reason, it can’t mean some fixed fraction of users in the network. We won’t try to make these ideas precise right now. Instead, in the next section I’ll point out a serious problem with the approach as described. Fixing that problem will at the same time have the pleasant side effect of making the ideas above much more precise.

Proof-of-work

Suppose Alice wants to double spend in the network-based protocol I just described. She could do this by taking over the Infocoin network. Let’s suppose she uses an automated system to set up a large number of separate identities, let’s say a billion, on the Infocoin network. As before, she tries to double spend the same infocoin with both Bob and Charlie. But when Bob and Charlie ask the network to validate their respective transactions, Alice’s sock puppet identities swamp the network, announcing to Bob that they’ve validated his transaction, and to Charlie that they’ve validated his transaction, possibly fooling one or both into accepting the transaction.

There’s a clever way of avoiding this problem, using an idea known as proof-of-work. The idea is counterintuitive and involves a combination of two ideas: (1) to (artificially) make it computationally costly for network users to validate transactions; and (2) toreward them for trying to help validate transactions. The reward is used so that people on the network will try to help validate transactions, even though that’s now been made a computationally costly process. The benefit of making it costly to validate transactions is that validation can no longer be influenced by the number of network identities someone controls, but only by the total computational power they can bring to bear on validation. As we’ll see, with some clever design we can make it so a cheater would need enormous computational resources to cheat, making it impractical.

That’s the gist of proof-of-work. But to really understand proof-of-work, we need to go through the details.

Suppose Alice broadcasts to the network the news that “I, Alice, am giving Bob one infocoin, with serial number 1234567″.

As other people on the network hear that message, each adds it to a queue of pending transactions that they’ve been told about, but which haven’t yet been approved by the network. For instance, another network user named David might have the following queue of pending transactions:

I, Tom, am giving Sue one infocoin, with serial number 1201174.

I, Sydney, am giving Cynthia one infocoin, with serial number 1295618.

I, Alice, am giving Bob one infocoin, with serial number 1234567.

David checks his copy of the block chain, and can see that each transaction is valid. He would like to help out by broadcasting news of that validity to the entire network.

However, before doing that, as part of the validation protocol David is required to solve a hard computational puzzle – the proof-of-work. Without the solution to that puzzle, the rest of the network won’t accept his validation of the transaction.

What puzzle does David need to solve? To explain that, let h be a fixed hash function known by everyone in the network – it’s built into the protocol. Bitcoin uses the well-known SHA-256 hash function, but any crypographically secure hash function will do. Let’s give David’s queue of pending transactions a label, l, just so it’s got a name we can refer to. Suppose David appends a number x (called the nonce) to land hashes the combination. For example, if we use l =  “Hello, world!” (obviously this is not a list of transactions, just a string used for illustrative purposes) and the nonce x = 0 then (output is in hexadecimal)

h("Hello, world!0") = 
  1312af178c253f84028d480a6adc1e25e81caa44c749ec81976192e2ec934c64

The puzzle David has to solve – the proof-of-work – is to find a nonce x such that when we append x to l and hash the combination the output hash begins with a long run of zeroes. The puzzle can be made more or less difficult by varying the number of zeroes required to solve the puzzle. A relatively simple proof-of-work puzzle might require just three or four zeroes at the start of the hash, while a more difficult proof-of-work puzzle might require a much longer run of zeros, say 15 consecutive zeroes. In either case, the above attempt to find a suitable nonce, with x = 0, is a failure, since the output doesn’t begin with any zeroes at all. Trying x = 1 doesn’t work either:

h("Hello, world!1") = 
  e9afc424b79e4f6ab42d99c81156d3a17228d6e1eef4139be78e948a9332a7d8

We can keep trying different values for the nonce, x = 2, 3,\ldots. Finally, at x = 4250 we obtain:

h("Hello, world!4250") = 
  0000c3af42fc31103f1fdc0151fa747ff87349a4714df7cc52ea464e12dcd4e9

This nonce gives us a string of four zeroes at the beginning of the output of the hash. This will be enough to solve a simple proof-of-work puzzle, but not enough to solve a more difficult proof-of-work puzzle.

What makes this puzzle hard to solve is the fact that the output from a cryptographic hash function behaves like a random number: change the input even a tiny bit and the output from the hash function changes completely, in a way that’s hard to predict. So if we want the output hash value to begin with 10 zeroes, say, then David will need, on average, to try 16^{10} \approx 10^{12} different values for x before he finds a suitable nonce. That’s a pretty challenging task, requiring lots of computational power.

Obviously, it’s possible to make this puzzle more or less difficult to solve by requring more or fewer zeroes in the output from the hash function. In fact, the Bitcoin protocol gets quite a fine level of control over the difficulty of the puzzle, by using a slight variation on the proof-of-work puzzle described above. Instead of requiring leading zeroes, the Bitcoin proof-of-work puzzle requires the hash of a block’s header to be lower than or equal to a number known as the target. This target is automatically adjusted to ensure that a Bitcoin block takes, on average, about ten minutes to validate.

(In practice there is a sizeable randomness in how long it takes to validate a block – sometimes a new block is validated in just a minute or two, other times it may take 20 minutes or even longer. It’s straightforward to modify the Bitcoin protocol so that the time to validation is much more sharply peaked around ten minutes. Instead of solving a single puzzle, we can require that multiple puzzles be solved; with some careful design it is possible to considerably reduce the variance in the time to mine a block of transactions.)

Alright, let’s suppose David is lucky and finds a suitable nonce, x. Celebration! (He’ll be rewarded for finding the nonce, as described below). He broadcasts the block of transactions he’s approving to the network, together with the value for x. Other participants in the Infocoin network can verify that x is a valid solution to the proof-of-work puzzle. And they then update their block chains to include the new block of transactions.

For the proof-of-work idea to have any chance of succeeding, network users need an incentive to help validate transactions. Without such an incentive, they have no reason to expend valuable computational power, merely to help validate other people’s transactions. And if network users are not willing to expend that power, then the whole system won’t work. The solution to this problem is to reward people who help validate transactions. In particular, suppose we reward whoever successfully validates a block of transactions by crediting them with some infocoins. Provided the infocoin reward is large enough that will give them an incentive to participate in validation.

In the Bitcoin protocol, this validation process is called mining. For each block of transactions validated, the successful miner receives a bitcoin reward. Initially, this was set to be a 50 bitcoin reward. But for every 210,000 validated blocks (roughly, once every four years) the reward halves. This has happened just once, to date, and so the current reward for mining a block is 25 bitcoins. This halving in the rate will continue every four years until the year 2140 CE. At that point, the reward for mining will drop below 10^{-8} bitcoins per block. 10^{-8} bitcoins is actually the minimal unit of Bitcoin, and is known as a satoshi. So in 2140 CE the total supply of bitcoins will cease to increase. However, that won’t eliminate the incentive to help validate transactions. Bitcoin also makes it possible to set aside some currency in a transaction as a transaction fee, which goes to the miner who helps validate it. In the early days of Bitcoin transaction fees were mostly set to zero, but as Bitcoin has gained in popularity, transaction fees have gradually risen, and are now a substantial additional incentive on top of the 25 bitcoin reward for mining a block.

You can think of proof-of-work as a competition to approve transactions. Each entry in the competition costs a little bit of computing power. A miner’s chance of winning the competition is (roughly, and with some caveats) equal to the proportion of the total computing power that they control. So, for instance, if a miner controls one percent of the computing power being used to validate Bitcoin transactions, then they have roughly a one percent chance of winning the competition. So provided a lot of computing power is being brought to bear on the competition, a dishonest miner is likely to have only a relatively small chance to corrupt the validation process, unless they expend a huge amount of computing resources.

Of course, while it’s encouraging that a dishonest party has only a relatively small chance to corrupt the block chain, that’s not enough to give us confidence in the currency. In particular, we haven’t yet conclusively addressed the issue of double spending.

I’ll analyse double spending shortly. Before doing that, I want to fill in an important detail in the description of Infocoin. We’d ideally like the Infocoin network to agree upon the order in which transactions have occurred. If we don’t have such an ordering then at any given moment it may not be clear who owns which infocoins. To help do this we’ll require that new blocks always include a pointer to the last block validated in the chain, in addition to the list of transactions in the block. (The pointer is actually just a hash of the previous block). So typically the block chain is just a linear chain of blocks of transactions, one after the other, with later blocks each containing a pointer to the immediately prior block:

Occasionally, a fork will appear in the block chain. This can happen, for instance, if by chance two miners happen to validate a block of transactions near-simultaneously – both broadcast their newly-validated block out to the network, and some people update their block chain one way, and others update their block chain the other way:

This causes exactly the problem we’re trying to avoid – it’s no longer clear in what order transactions have occurred, and it may not be clear who owns which infocoins. Fortunately, there’s a simple idea that can be used to remove any forks. The rule is this: if a fork occurs, people on the network keep track of both forks. But at any given time, miners only work to extend whichever fork is longest in their copy of the block chain.

Suppose, for example, that we have a fork in which some miners receive block A first, and some miners receive block B first. Those miners who receive block A first will continue mining along that fork, while the others will mine along fork B. Let’s suppose that the miners working on fork B are the next to successfully mine a block:

After they receive news that this has happened, the miners working on fork A will notice that fork B is now longer, and will switch to working on that fork. Presto, in short order work on fork A will cease, and everyone will be working on the same linear chain, and block A can be ignored. Of course, any still-pending transactions in A will still be pending in the queues of the miners working on fork B, and so all transactions will eventually be validated.

Likewise, it may be that the miners working on fork A are the first to extend their fork. In that case work on fork B will quickly cease, and again we have a single linear chain.

No matter what the outcome, this process ensures that the block chain has an agreed-upon time ordering of the blocks. In Bitcoin proper, a transaction is not considered confirmed until: (1) it is part of a block in the longest fork, and (2) at least 5 blocks follow it in the longest fork. In this case we say that the transaction has “6 confirmations”. This gives the network time to come to an agreed-upon the ordering of the blocks. We’ll also use this strategy for Infocoin.

With the time-ordering now understood, let’s return to think about what happens if a dishonest party tries to double spend. Suppose Alice tries to double spend with Bob and Charlie. One possible approach is for her to try to validate a block that includes both transactions. Assuming she has one percent of the computing power, she will occasionally get lucky and validate the block by solving the proof-of-work. Unfortunately for Alice, the double spending will be immediately spotted by other people in the Infocoin network and rejected, despite solving the proof-of-work problem. So that’s not something we need to worry about.

A more serious problem occurs if she broadcasts two separate transactions in which she spends the same infocoin with Bob and Charlie, respectively. She might, for example, broadcast one transaction to a subset of the miners, and the other transaction to another set of miners, hoping to get both transactions validated in this way. Fortunately, in this case, as we’ve seen, the network will eventually confirm one of these transactions, but not both. So, for instance, Bob’s transaction might ultimately be confirmed, in which case Bob can go ahead confidently. Meanwhile, Charlie will see that his transaction has not been confirmed, and so will decline Alice’s offer. So this isn’t a problem either. In fact, knowing that this will be the case, there is little reason for Alice to try this in the first place.

An important variant on double spending is if Alice = Bob, i.e., Alice tries to spend a coin with Charlie which she is also “spending” with herself (i.e., giving back to herself). This sounds like it ought to be easy to detect and deal with, but, of course, it’s easy on a network to set up multiple identities associated with the same person or organization, so this possibility needs to be considered. In this case, Alice’s strategy is to wait until Charlie accepts the infocoin, which happens after the transaction has been confirmed 6 times in the longest chain. She will then attempt to fork the chain before the transaction with Charlie, adding a block which includes a transaction in which she pays herself:

Unfortunately for Alice, it’s now very difficult for her to catch up with the longer fork. Other miners won’t want to help her out, since they’ll be working on the longer fork. And unless Alice is able to solve the proof-of-work at least as fast as everyone else in the network combined – roughly, that means controlling more than fifty percent of the computing power – then she will just keep falling further and further behind. Of course, she might get lucky. We can, for example, imagine a scenario in which Alice controls one percent of the computing power, but happens to get lucky and finds six extra blocks in a row, before the rest of the network has found any extra blocks. In this case, she might be able to get ahead, and get control of the block chain. But this particular event will occur with probability 1/100^6 = 10^{-12}. A more general analysis along these lines shows that Alice’s probability of ever catching up is infinitesimal, unless she is able to solve proof-of-work puzzles at a rate approaching all other miners combined.

Of course, this is not a rigorous security analysis showing that Alice cannot double spend. It’s merely an informal plausibility argument. The original paper introducing Bitcoin did not, in fact, contain a rigorous security analysis, only informal arguments along the lines I’ve presented here. The security community is still analysing Bitcoin, and trying to understand possibility vulnerabilities. You can see some of this researchlisted here, and I mention a few related problems in the “Problems for the author” below. At this point I think it’s fair to say that the jury is still out on how secure Bitcoin is.

The proof-of-work and mining ideas give rise to many questions. How much reward is enough to persuade people to mine? How does the change in supply of infocoins affect the Infocoin economy? Will Infocoin mining end up in concentrated in the hands of a few, or many? If it’s just a few, doesn’t that endanger the security of the system? Presumably transaction fees will eventually equilibriate – won’t this introduce an unwanted source of friction, and make small transactions less desirable? These are all great questions, but beyond the scope of this post. I may come back to the questions (in the context of Bitcoin) in a future post. For now, we’ll stick to our focus on understanding how the Bitcoin protocol works.

Problems for the author

  • I don’t understand why double spending can’t be prevented in a simpler manner using two-phase commit. Suppose Alice tries to double spend an infocoin with both Bob and Charlie. The idea is that Bob and Charlie would each broadcast their respective messages to the Infocoin network, along with a request: “Should I accept this?” They’d then wait some period – perhaps ten minutes – to hear any naysayers who could prove that Alice was trying to double spend. If no such nays are heard (and provided there are no signs of attempts to disrupt the network), they’d then accept the transaction. This protocol needs to be hardened against network attacks, but it seems to me to be the core of a good alternate idea. How well does this work? What drawbacks and advantages does it have compared to the full Bitcoin protocol?
  • Early in the section I mentioned that there is a natural way of reducing the variance in time required to validate a block of transactions. If that variance is reduced too much, then it creates an interesting attack possibility. Suppose Alice tries to fork the chain in such a way that: (a) one fork starts with a block in which Alice pays herself, while the other fork starts with a block in which Alice pays Bob; (b) both blocks are announced nearly simultaneously, so roughly half the miners will attempt to mine each fork; (c) Alice uses her mining power to try to keep the forks of roughly equal length, mining whichever fork is shorter – this is ordinarily hard to pull off, but becomes significantly easier if the standard deviation of the time-to-validation is much shorter than the network latency; (d) after 5 blocks have been mined on both forks, Alice throws her mining power into making it more likely that Charles’s transaction is confirmed; and (e) after confirmation of Charles’s transaction, she then throws her computational power into the other fork, and attempts to regain the lead. This balancing strategy will have only a small chance of success. But while the probability is small, it will certainly be much larger than in the standard protocol, with high variance in the time to validate a block. Is there a way of avoiding this problem?
  • Suppose Bitcoin mining software always explored nonces starting with x = 0, then x = 1, x = 2,\ldots. If this is done by all (or even just a substantial fraction) of Bitcoin miners then it creates a vulnerability. Namely, it’s possible for someone to improve their odds of solving the proof-of-work merely by starting with some other (much larger) nonce. More generally, it may be possible for attackers to exploit any systematic patterns in the way miners explore the space of nonces. More generally still, in the analysis of this section I have implicitly assumed a kind of symmetry between different miners. In practice, there will be asymmetries and a thorough security analysis will need to take account of those asymmetries.

Bitcoin

Let’s move away from Infocoin, and describe the actual Bitcoin protocol. There are a few new ideas here, but with one exception (discussed below) they’re mostly obvious modifications to Infocoin.

To use Bitcoin in practice, you first install a wallet program on your computer. To give you a sense of what that means, here’s a screenshot of a wallet called Multbit. You can see the Bitcoin balance on the left — 0.06555555 Bitcoins, or about 70 dollars at the exchange rate on the day I took this screenshot — and on the right two recent transactions, which deposited those 0.06555555 Bitcoins:

Suppose you’re a merchant who has set up an online store, and you’ve decided to allow people to pay using Bitcoin. What you do is tell your wallet program to generate a Bitcoin address. In response, it will generate a public / private key pair, and then hash the public key to form your Bitcoin address:

You then send your Bitcoin address to the person who wants to buy from you. You could do this in email, or even put the address up publicly on a webpage. This is safe, since the address is merely a hash of your public key, which can safely be known by the world anyway. (I’ll return later to the question of why the Bitcoin address is a hash, and not just the public key.)

The person who is going to pay you then generates a transaction. Let’s take a look at the data from an actual transaction transferring 0.31900000 bitcoins. What’s shown below is very nearly the raw data. It’s changed in three ways: (1) the data has been deserialized; (2) line numbers have been added, for ease of reference; and (3) I’ve abbreviated various hashes and public keys, just putting in the first six hexadecimal digits of each, when in reality they are much longer. Here’s the data:

1.  {"hash":"7c4025...",
2.  "ver":1,
3.  "vin_sz":1,
4.  "vout_sz":1,
5.  "lock_time":0,
6.  "size":224,
7.  "in":[
8.    {"prev_out":
9.      {"hash":"2007ae...",
10.      "n":0},
11.    "scriptSig":"304502... 042b2d..."}],
12. "out":[
13.   {"value":"0.31900000",
14.    "scriptPubKey":"OP_DUP OP_HASH160 a7db6f OP_EQUALVERIFY OP_CHECKSIG"}]}

Let’s go through this, line by line.

Line 1 contains the hash of the remainder of the transaction, 7c4025..., expressed in hexadecimal. This is used as an identifier for the transaction.

Line 2 tells us that this is a transaction in version 1 of the Bitcoin protocol.

Lines 3 and 4 tell us that the transaction has one input and one output, respectively. I’ll talk below about transactions with more inputs and outputs, and why that’s useful.

Line 5 contains the value for lock_time, which can be used to control when a transaction is finalized. For most Bitcoin transactions being carried out today thelock_time is set to 0, which means the transaction is finalized immediately.

Line 6 tells us the size (in bytes) of the transaction. Note that it’s not the monetary amount being transferred! That comes later.

Lines 7 through 11 define the input to the transaction. In particular, lines 8 through 10 tell us that the input is to be taken from the ouput from an earlier transaction, with the given hash, which is expressed in hexadecimal as 2007ae.... The n=0 tells us it’s to be the first output from that transaction; we’ll see soon how multiple outputs (and inputs) from a transaction work, so don’t worry too much about this for now. Line 11 contains the signature of the person sending the money, 304502..., followed by a space, and then the corresponding public key, 04b2d.... Again, these are both in hexadecimal.

One thing to note about the input is that there’s nothing explicitly specifying how many bitcoins from the previous transaction should be spent in this transaction. In fact, all the bitcoins from the n=0th output of the previous transaction are spent. So, for example, if the n=0th output of the earlier transaction was 2 bitcoins, then 2 bitcoins will be spent in this transaction. This seems like an inconvenient restriction – like trying to buy bread with a 20 dollar note, and not being able to break the note down. The solution, of course, is to have a mechanism for providing change. This can be done using transactions with multiple inputs and outputs, which we’ll discuss in the next section.

Lines 12 through 14 define the output from the transaction. In particular, line 13 tells us the value of the output, 0.39 bitcoins. Line 14 is somewhat complicated. The main thing to note is that the string a7db6f... is the Bitcoin address of the intended recipient of the funds (written in hexadecimal). In fact, Line 14 is actually an expression in Bitcoin’s scripting language. I’m not going to describe that language in detail in this post, the important thing to take away now is just that a7db6f... is the Bitcoin address.

You can now see, by the way, how Bitcoin addresses the question I swept under the rug in the last section: where do Bitcoin serial numbers come from? In fact, the role of the serial number is played by transaction hashes. In the transaction above, for example, the recipient is receiving 0.39 Bitcoins, which come out of the first output of an earlier transaction with hash 2007ae... (line 9). If you go and look in the block chain for that transaction, you’d see that its output comes from a still earlier transaction. And so on.

There are two clever things about using transaction hashes instead of serial numbers. First, in Bitcoin there’s not really any separate, persistent “coins” at all, just a long series of transactions in the block chain. It’s a clever idea to realize that you don’t need persistent coins, and can just get by with a ledger of transactions. Second, by operating in this way we remove the need for any central authority issuing serial numbers. Instead, the serial numbers can be self-generated, merely by hashing the transaction.

In fact, it’s possible to keep following the chain of transactions further back in history. Ultimately, this process must terminate. This can happen in one of two ways. The first possibilitity is that you’ll arrive at the very first Bitcoin transaction, contained in the so-called Genesis block. This is a special transaction, having no inputs, but a 50 Bitcoin output. In other words, this transaction establishes an initial money supply. The Genesis block is treated separately by Bitcoin clients, and I won’t get into the details here, although it’s along similar lines to the transaction above. You can see the deserialized raw data here, and read about the Genesis block here.

The second possibility when you follow a chain of transactions back in time is that eventually you’ll arrive at a so-called coinbase transaction. With the exception of the Genesis block, every block of transactions in the block chain starts with a special coinbase transaction. This is the transaction rewarding the miner who validated that block of transactions. It uses a similar but not identical format to the transaction above. I won’t go through the format in detail, but if you want to see an example, seehere. You can read a little more about coinbase transactions here.

Something I haven’t been precise about above is what exactly is being signed by the digital signature in line 11. The obvious thing to do is for the payer to sign the whole transaction (apart from the transaction hash, which, of course, must be generated later). Currently, this is not what is done – some pieces of the transaction are omitted. This makes some pieces of the transaction malleable, i.e., they can be changed later. However, this malleability does not include the amounts being paid out, senders and recipients, which can’t be changed later. I must admit I haven’t dug down into the details here. I gather that this malleability is under discussion in the Bitcoin developer community, and there are efforts afoot to reduce or eliminate this malleability.

Transactions with multiple inputs and outputs

In the last section I described how a transaction with a single input and a single output works. In practice, it’s often extremely convenient to create Bitcoin transactions with multiple inputs or multiple outputs. I’ll talk below about why this can be useful. But first let’s take a look at the data from an actual transaction:

1. {"hash":"993830...",
2. "ver":1,
3. "vin_sz":3,
4.  "vout_sz":2,
5.  "lock_time":0,
6.  "size":552,
7.  "in":[
8.    {"prev_out":{
9.      "hash":"3beabc...",
10.        "n":0},
11.     "scriptSig":"304402... 04c7d2..."},
12.    {"prev_out":{
13.        "hash":"fdae9b...",
14.        "n":0},
15.      "scriptSig":"304502... 026e15..."},
16.    {"prev_out":{
17.        "hash":"20c86b...",
18.        "n":1},
19.      "scriptSig":"304402... 038a52..."}],
20.  "out":[
21.    {"value":"0.01068000",
22.      "scriptPubKey":"OP_DUP OP_HASH160 e8c306... OP_EQUALVERIFY OP_CHECKSIG"},
23.    {"value":"4.00000000",
24.      "scriptPubKey":"OP_DUP OP_HASH160 d644e3... OP_EQUALVERIFY OP_CHECKSIG"}]}

Let’s go through the data, line by line. It’s very similar to the single-input-single-output transaction, so I’ll do this pretty quickly.

Line 1 contains the hash of the remainder of the transaction. This is used as an identifier for the transaction.

Line 2 tells us that this is a transaction in version 1 of the Bitcoin protocol.

Lines 3 and 4 tell us that the transaction has three inputs and two outputs, respectively.

Line 5 contains the lock_time. As in the single-input-single-output case this is set to 0, which means the transaction is finalized immediately.

Line 6 tells us the size of the transaction in bytes.

Lines 7 through 19 define a list of the inputs to the transaction. Each corresponds to an output from a previous Bitcoin transaction.

The first input is defined in lines 8 through 11.

In particular, lines 8 through 10 tell us that the input is to be taken from the n=0th output from the transaction with hash 3beabc.... Line 11 contains the signature, followed by a space, and then the public key of the person sending the bitcoins.

Lines 12 through 15 define the second input, with a similar format to lines 8 through 11. And lines 16 through 19 define the third input.

Lines 20 through 24 define a list containing the two outputs from the transaction.

The first output is defined in lines 21 and 22. Line 21 tells us the value of the output, 0.01068000 bitcoins. As before, line 22 is an expression in Bitcoin’s scripting language. The main thing to take away here is that the string e8c30622... is the Bitcoin address of the intended recipient of the funds.

The second output is defined lines 23 and 24, with a similar format to the first output.

One apparent oddity in this description is that although each output has a Bitcoin value associated to it, the inputs do not. Of course, the values of the respective inputs can be found by consulting the corresponding outputs in earlier transactions. In a standard Bitcoin transaction, the sum of all the inputs in the transaction must be at least as much as the sum of all the outputs. (The only exception to this principle is the Genesis block, and in coinbase transactions, both of which add to the overall Bitcoin supply.) If the inputs sum up to more than the outputs, then the excess is used as a transaction fee. This is paid to whichever miner successfully validates the block which the current transaction is a part of.

That’s all there is to multiple-input-multiple-output transactions! They’re a pretty simple variation on single-input-single-output-transactions.

One nice application of multiple-input-multiple-output transactions is the idea ofchange. Suppose, for example, that I want to send you 0.15 bitcoins. I can do so by spending spending money from a previous transaction in which I received 0.2 bitcoins. Of course, I don’t want to send you the entire 0.2 bitcoins. The solution is to send you 0.15 bitcoins, and to send 0.05 bitcoins to a Bitcoin address which I own. Those 0.05 bitcoins are the change. Of course, it differs a little from the change you might receive in a store, since change in this case is what you pay yourself. But the broad idea is similar.

Conclusion

That completes a basic description of the main ideas behind Bitcoin. Of course, I’ve omitted many details – this isn’t a formal specification. But I have described the main ideas behind the most common use cases for Bitcoin.

While the rules of Bitcoin are simple and easy to understand, that doesn’t mean that it’s easy to understand all the consequences of the rules. There is vastly more that could be said about Bitcoin, and I’ll investigate some of these issues in future posts.

For now, though, I’ll wrap up by addressing a few loose ends.

How anonymous is Bitcoin? Many people claim that Bitcoin can be used anonymously. This claim has led to the formation of marketplaces such as Silk Road(and various successors), which specialize in illegal goods. However, the claim that Bitcoin is anonymous is a myth. The block chain is public, meaning that it’s possible for anyone to see every Bitcoin transaction ever. Although Bitcoin addresses aren’t immediately associated to real-world identities, computer scientists have done agreat deal of work figuring out how to de-anonymize “anonymous” social networks. The block chain is a marvellous target for these techniques. I will be extremely surprised if the great majority of Bitcoin users are not identified with relatively high confidence and ease in the near future. The confidence won’t be high enough to achieve convictions, but will be high enough to identify likely targets. Furthermore, identification will be retrospective, meaning that someone who bought drugs on Silk Road in 2011 will still be identifiable on the basis of the block chain in, say, 2020. These de-anonymization techniques are well known to computer scientists, and, one presumes, therefore to the NSA. I would not be at all surprised if the NSA and other agencies have already de-anonymized many users. It is, in fact, ironic that Bitcoin is often touted as anonymous. It’s not. Bitcoin is, instead, perhaps the most open and transparent financial instrument the world has ever seen.

Can you get rich with Bitcoin? Well, maybe. Tim O’Reilly once said: “Money is like gas in the car – you need to pay attention or you’ll end up on the side of the road – but a well-lived life is not a tour of gas stations!” Much of the interest in Bitcoin comes from people whose life mission seems to be to find a really big gas station. I must admit I find this perplexing. What is, I believe, much more interesting and enjoyable is to think of Bitcoin and other cryptocurrencies as a way of enabling new forms of collective behaviour. That’s intellectually fascinating, offers marvellous creative possibilities, is socially valuable, and may just also put some money in the bank. But if money in the bank is your primary concern, then I believe that other strategies are much more likely to succeed.

Details I’ve omitted: Although this post has described the main ideas behind Bitcoin, there are many details I haven’t mentioned. One is a nice space-saving trick used by the protocol, based on a data structure known as a Merkle tree. It’s a detail, but a splendid detail, and worth checking out if fun data structures are your thing. You can get an overview in the original Bitcoin paper. Second, I’ve said little about theBitcoin network – questions like how the network deals with denial of service attacks, how nodes join and leave the network, and so on. This is a fascinating topic, but it’s also something of a mess of details, and so I’ve omitted it. You can read more about it at some of the links above.

Bitcoin scripting: In this post I’ve explained Bitcoin as a form of digital, online money. But this is only a small part of a much bigger and more interesting story. As we’ve seen, every Bitcoin transaction is associated to a script in the Bitcoin programming language. The scripts we’ve seen in this post describe simple transactions like “Alice gave Bob 10 bitcoins”. But the scripting language can also be used to express far more complicated transactions. To put it another way, Bitcoin isprogrammable money. In later posts I will explain the scripting system, and how it is possible to use Bitcoin scripting as a platform to experiment with all sorts of amazing financial instruments.

Footnote

[1] In the United States the question “Is money a form of speech?” is an important legal question, because of the protection afforded speech under the US Constitution. In my (legally uninformed) opinion digital money may make this issue more complicated. As we’ll see, the Bitcoin protocol is really a way of standing up before the rest of the world (or at least the rest of the Bitcoin network) and avowing “I’m going to give such-and-such a number of bitcoins to so-and-so a person” in a way that’s extremely difficult to repudiate. At least naively, it looks more like speech than exchanging copper coins, say.

Splitting a subpath out into a new repository with Git

From time to time you may find that you want to make a new repository from a subpath of an existing repository. Perhaps you’re moving some code out into a library or just want to have a common submodule across projects. Thanks to git, it’s easy to do this without losing the history of that subpath in the process.

The Good Stuff

Splitting a subpath into a repository is a fairly straightforward process, even if the command is hard to remember. For this example, we’ll split lib/ out of the GitHub gem repository, removing empty commits but retaining the path’s history.

git clone git://github.com/defunkt/github-gem.git
# Clone the repository we're going to work with
# Initialized empty Git repository in /Users/tekkub/tmp/github-gem/.git/
# remote: Counting objects: 1301, done.
# remote: Compressing objects: 100% (769/769), done.
# remote: Total 1301 (delta 724), reused 910 (delta 522)
# Receiving objects: 100% (1301/1301), 164.39 KiB | 274 KiB/s, done.
# Resolving deltas: 100% (724/724), done.

cd github-gem/
# Change directory into the repository

git filter-branch --prune-empty --subdirectory-filter lib master
# Filter the master branch to the lib path and remove empty commits
# Rewrite 48dc599c80e20527ed902928085e7861e6b3cbe6 (89/89)
# Ref 'refs/heads/master' was rewritten

Now we have a re-written master branch that contains the files that were in lib/. We can add a remote to the new repository and push, or do whatever we want with the repository.

Fix eth0 network interface when cloning RedHat, CentOS or Scientific virtual machines using Oracle VirtualBox or VMWare

For years we’ve used bash scripting to ensure that all of our server configurations are standardized, so that we can expect servers with the same role to have the exact same configuration profile and exhibit the exact same behavior. We’re in the process of moving to Puppet but we’re not quite there yet. Recently we decided to redesign our network architecture with a higher level of focus around High Availability. This new design introduces Percona XtraDB Cluster and we’re writing our installation and configuration scripts to ensure that our new cluster boxes are both tuned and standardized.

We have base template linux VM’s in both VMWare and Oracle VirtualBox, and also templates that are specific to server roles such as web servers, HAproxy servers or MySQL Cluster servers. If we need to test a new configuration, or add a new node to the cluster, we just clone the appropriate template add some minor configuration and we’re good.

However if you clone a VMWare or Oracle VirtualBox VM, you’ll notice that it kills your network interfaces throwing errors like the one listed below:

#ifup eth0
Device eth0 does not seem to be present, delaying initialisation

What’s happening here is that when you clone your VM, VirtualBox and VMWare apply a new MAC Address to your network interfaces but they don’t update the linux configuration files to mirror these changes and so the kernel doesn’t firstly can’t find or start the interface that matches it’s configuration (with the old MAC Address) and it finds a new interface (the new MAC Address) that it has no configuration information for. The result is that only your networking service can only start the loopback networking interface and eth0 is dead.

So here’s how we fix it:

  1. Remove the kernel’s networking interface rules file so that it can be regenerated 
    # rm -f /etc/udev/rules.d/70-persistent-net.rules

     

  2. Restart the VM 
    # reboot

     

  3. UPDATE your interface configuration file 
    # vim /etc/sysconfig/networking/devices/ifcfg-eth0

    Remove the MACADDR entry or update it to the new MACADDR for the interface (listed in this file: /etc/udev/rules.d/70-persistent-net.rules).

    Remove the UUID entry

    Save and exit the file

  4. Restart the networking service 
    # service network restart

NFC vs Bluetooth Low Energy

2011 was meant to be “the year of NFC”  so was 2012 and 2013.
However,  it seems that earlier this year NFC ran out of steam as far as “mind share” is concerned.

When Google Wallet decided to drop support for loyalty cards  —  one of the key components of any m-wallet (and one of the core functions of ISIS wallet, via SmartTap) — it became clear that Google lost its interest in NFC and started looking at the alternatives: Google bought Bump – the technology that enables P2P interaction via accelerometers and cloud;  Google finally opened up Android to BLE (Bluetooth Low Energy – aka Bluetooth Smart).

Apple didn’t include NFC in their latest iPhone models.  Again, instead, Apple is currently betting heavily on BLE technology.

PayPal introduced BLE-based Beacon, specifically for retail payments (and marketing/advertising).

Tons of smartwatches are coming out soon, all with BLE, and  potentially  payment applications. Only a few smartwatches are expected to include passive contactless interfaces.

Projections for NFC-enabled phones vary widely (and are being revised downwards by most analysts) at the same time, almost every new smartphone model out there has BLE.

BLE, BLE, BLE… Does all that mean the end of NFC (in retail)? Well, that depends…

Before we get too excited about BLE, let’s consider the following. One of the key problems of NFC is lack of infrastructure in retail. Guess what: BLE infrastructure is… zero. Moreover, such forthcoming solutions as PayPal Beacon are not simple “plug and play” you still need some “add-on” to identify the customer who wants to make the payment, among the dozen of others in the BLE range. That’s where NFC seems to actually shine.

NFC is great for instant ad hoc proximity-based “peer to peer” communication.
Other alternatives come nowhere near in that respect. (For now…) There are many areas where NFC is a star: transit, access control, authentication, device pairing, device configuration.

So, what could make it or break it for NFC? EMV push in the US will lead to further gradual introduction of NFC payment terminals. NFC will come of age, but don’t expect miracles it could be too little too late.

On the other hand, if several major players unite to push (a) open, (b) low-cost (think $10-20), and (c) true “plug-n-play” alternative to NFC, based on BLE, that could seriously tip the scales and much faster than anyone can imagine.

Maven — Making both a War and Jar at the same time

Maven nicely automates building WAR files, but it places your compiled classes in WEB-INF/classes instead of making a new jar in /WEB-INF/lib.

If you want your code to be compiled as a .jar as well as a .war, you can do this by specifying the jar goal in the command line:

[shell>mvn clean jar install

Note this will make myproject.jar in target/, not in target/myproject/WEB-INF/lib, so you will need to use the Ant plugin to move this stuff around.

But this is not always an option: for deep, modular builds using the reactor, you may want to build your whole thing using one “mvn install”. To do this, do the following:

  1. Specify “war” packaging at the top of your pom.xml.
  2. Then add the following to your build section.
<?xml version="1.0" encoding="UTF-8"?>
<build>
   <plugins>
      <plugin>
         <groupId>org.apache.maven.plugins</groupId>
         <artifactId>maven-jar-plugin</artifactId>
         <executions>
            <execution>
               <id>make-a-jar</id>
               <phase>compile</phase>
               <goals>
                  <goal>jar</goal>
               </goals>
            </execution>
         </executions>
      </plugin>
   </plugins>
</build>

And that’s it. Do a “maven install” and you get both a jar as well as a war.

What are Namecoins and .bit domains?

One digital currency that you might not have heard of is Namecoin. It is based on exactly the same code as Bitcoin. In fact, the two currencies are almost identical. However, in the same way that Bitcoin is a decentralised currency that cannot be shut down; Namecoin is the basis for a decentralised domain name system (DNS), i.e. web URLs, which could put a stop to Internet censorship.

What is the DNS system?

While we’re all used to typing text addresses into our browser and email programs, such as coindesk.com, the Internet doesn’t run on text. The Internet actually works on numerical addresses called IP addresses, just in the same way we dial telephone numbers. The problem is that numbers are not easy to remember. Therefore, an Internet wide address book, called the Domain Name System (DNS), was created to make navigation much easier.

Every time you type an address into your browser, your computer or mobile device is actually querying a DNS server. It has to ask for the IP address of the destination server before it can retrieve any data for you. For example, typing “google.com” into your browser will trigger your computer to check its DNS server for Google’s IP address. The DNS server will return a number like 173.194.70.113.

The very last part of a domain, e.g. .com, is called a top-level domain (TLD). TLDs are controlled by central authorities. For example, the .com TLD is controlled by ICANN in the United States. These central authorities allow third party companies, known as registrars, to deal with accepting domain name orders and customer service.

Whenever anyone has a complaint with a website, the central authority for its TLD has the ultimate say on what happens to it. In most real world cases, lawyers, copyright holders, etc., will simply contact the domain’s registrar. However, the potential for commands from a central authority should be of concern to groups who will suffer due to censorship.

How does decentralisation help?

A decentralised DNS system means that TLDs can exist which are not owned by anyone, and the DNS lookup tables are shared on a peer-to-peer system. As long as there are volunteers running the customised DNS server software for the rest of us, then we can always access any alternative domains. Short of seizing the physical servers, authorities cannot impose rules to affect the operation of a peer-to-peer top level domain.

What does this have to do with crypto-currency?

The model of Bitcoin involves a peer-to-peer system where participants are continuously validating a series of transactions without any central control. That model was directly applied to the domain name system by modifying the bitcoin protocol and the result was called Namecoin (NMC). In particular, a new genesis block was created, so that a whole new block chain would be created. This ensures that Namecoin and Bitcoin do not interact or interfere with each other. Secondly, the developers of Namecoin created several transaction types to reflect the needs of a new domain name system. Because of the shared heritage, there will only ever by 21 million Namecoins created, and 50 coins are generated for each solved block of crypto problems.

How to use Namecoins to register .bit domains

.bit is the first and only TLD of the so-called Domain 2.0 namespace. The actions necessary to register a new domain or to update an existing one are built into the Namecoin protocol by means of the new transaction types mentioned above.

There are three types of Namecoin transaction (source):

  • name_new – Registration cost 0.01 NMC. This constitutes a fixed cost pre-order of a domain.
  • name_firstupdate – Registration cost 0 NMC. Registers a domain making it publically visible, subject to variable costs (price calculator).
  • name_update – Registration cost 0 NMC. This is used for updating, renewing or transferring a domain.

All NMC transactions are subject to a 0.005NMC fee.

Even though the Namecoin system effectively makes you into your own domain registrar, there are some registration services out there, who offer to handle the registration for you and take payment in BTC. Additionally, they offer services such as an (easier) interface to modify domain details and to automatically renew.

How to view .bit websites

Namecoin.com claims to have registered at least 450 domains. According to theBitcoin Contact website, there is a grand total of 77,000 registered .bit domains (full list here). That’s all well and good, but because they are not part of the standard domain name system, you can’t just type, e.g., wikileaks.bit into your browser and expect to see a website.

Fortunately, there are .bit web proxy servers that will correctly handle your DNS requests in a browser. To make the process even easier, there are extensions, via Namecoin.com, for Firefox and Chrome.

How can Namecoin and Bitcoin complement each other?

While the two digital currencies do not interact, they do rely on exactly the same set of mathematical problems. Therefore, the same hardware used to mine bitcoins can be used to mine Namecoins. Furthermore, there is process calledmerged mining, in which a mining machine is configured to query both block chains whenever it comes up with a possible solution to the cryptographic problems. The Dot-bit wiki describes this as entering two lotteries with the same ticket to increase the odds of winning.

How does this affect you?

The chances are that 99% of the people reading this do not need to create a .bit website or service. However, information is power as the saying goes, and so it is important that you have the capability to access websites and email addresses on the .bit namespace.

Yes, this technology can be abused just like anything else, and so it’s even more imperative that we all have the capability to view the .bit namespace so that we’re aware of the good and the bad.

More important than anything else, however, is that the ability to view .bit websites means attempts to silence those with a legitimate message will have less of a chance of succeeding.

Using git over proxy

I was trying to clone the thrift repository at a clients office when this problem occurred.
Upon some Googling, I got this really useful link (thanks to Emil Sit)
which explained how git can be used over an http proxy for those git servers
which don’t allow http method as an alternate/bypass to the git protocol.

Okay, so here’s what’s needed to be done;

  • Typed in the below lines (quoted) into a shell script called git-proxy and put it in $(HOME)/bin directory.
    Of course; its executable bits has to be set with: chmod a+x $(HOME)/bin/git-proxy.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    #!/bin/sh
    # Use socat to proxy git through an HTTP CONNECT firewall.
    # Useful if you are trying to clone git:// from inside a company.
    # Requires that the proxy allows CONNECT to port 9418.
    #
    # Save this file as gitproxy somewhere in your path
    # (e.g., ~/bin) and then run
    # chmod +x git-proxy
    # git config --global core.gitproxy git-proxy
    #
    #
    # Configuration. Common proxy ports are 3128, 8123, 8000.
    _proxy=yourproxyhost
    _proxyport=yourproxyport
    exec socat STDIO PROXY:$_proxy:$1:$2,proxyport=$_proxyport

    Note: I replaced proxy.yourcompany.com with appropriate proxy-server details
    for my organization and the value for _proxyport with an appropriate value
    used in my organization (instead of the default 3128).

  • Installed the package socat with the command; sudo apt-get install socat
  • Ran the command git config –global core.gitproxy git-proxy to configure ‘git’ to use the git-proxy.
  • With the http_proxy environment variable already set using:
  • export https_proxy=https://10.15.11.132:443/

    I could now use git and get the repositories without any problems.

Redhat developer toolset 1.1

Tru Huynh of centos.org has built the redhat developer toolset 1.1, for centos and it contains gcc 4.7.2

So you could simply use his repo and install just gcc, instantly.

cd /etc/yum.repos.d
wget http://people.centos.org/tru/devtools-1.1/devtools-1.1.repo 
yum --enablerepo=testing-1.1-devtools-6 install devtoolset-1.1-gcc 
devtoolset-1.1-gcc-c++

This will install it most likely into: /opt/centos/devtoolset-1.1/root/usr/bin/

Then you can tell your compile process to use the gcc 4.7 instead of 4.4 with the CC variable

export CC=/opt/centos/devtoolset-1.1/root/usr/bin/gcc  
export CPP=/opt/centos/devtoolset-1.1/root/usr/bin/cpp
export CXX=/opt/centos/devtoolset-1.1/root/usr/bin/c++

Also, worth noting; that instead of setting individual variables you can do
 scl enable devtoolset-1.1 bash
(it just starts new shell with all the appropriate variables already set).

How to install Maven on CentOS

Apache Maven is a project management software, managing building, reporting and documentation of a Java  development project. In order to install and configure Apache Maven on CentOS, follow these steps.

First of all, you need to install Java 1.7 JDK. Make sure to install Java JDK, not JRE.

Then go ahead and download the latest Maven binary from its official site. For example, for version 3.0.4:

$ wget http://mirror.cc.columbia.edu/pub/software/apache/maven/maven-3/3.0.5/binaries/apache-maven-3.0.5-bin.tar.gz
$ sudo tar xzf apache-maven-3.0.5-bin.tar.gz -C /usr/local
$ cd /usr/local
$ sudo ln -s apache-maven-3.0.5 maven

 

Next, set up Maven path system-wide:

$ sudo vi /etc/profile.d/maven.sh
export M2_HOME=/usr/local/maven
export PATH=${M2_HOME}/bin:${PATH}

Finally, log out and log in again to activate the above environment variables.
To verify successful installation of maven, check the version of maven:

$ mvn -version

 

Optionally, if you are using Maven behind a proxy, you must do the following.

$ vi ~/.m2/settings.xml
<settings>
  <proxies>
    <proxy>
      <active>true</active>
      <protocol>http</protocol>
      <host>proxy.host.com</host>
      <port>port_number</port>
      <username>proxy_user</username>
      <password>proxy_user_password</password>
      <nonProxyHosts>www.google.com</nonProxyHosts>
    </proxy>
  </proxies>
</settings>

Getting Started With Hadoop Using Hortonworks Sandbox

Getting started with a distributed system like Hadoop can be a daunting task for developers. From installing and configuring Hadoop to learning the basics of MapReduce and other add-on tools, the learning curve is pretty high.

Hortonworks recently released the Hortonworks Sandbox for anyone interested in learning and evaluating enterprise Hadoop.

The Hortonworks Sandbox provides:

  1. A virtual machine with Hadoop preconfigured.
  2. A set of hands-on tutorials to get you started with Hadoop.
  3. An environment to help you explore related projects in the Hadoop ecosystem like Apache Pig, Apache Hive, Apache HCatalog and Apache HBase.

You can download the Sandbox from Hortonworks website:

http://hortonworks.com/products/hortonworks-sandbox/

The Sandbox download is available for both VirtualBox and VMware Fusion/Player environments. Just follow the instruction to import the Sandbox into your environment.

The download is an OVA (open virtual appliance), which is really a TAR file.

1
tar -xvf Hortonworks+Sandbox+1.2+1-21-2012-1+vmware.ova

Untar it and the archive consists of an OVF (Open Virtualization Format) descriptor file, a manifest file and a disk image of vmdk format.

Rackspace Cloud doesn’t let you upload your own images, but if you have an OpenStack based cloud, you can boot a virtual machine with the image provided.

First, you can convert the vmdk image to a more familiar format like qcow2.

1
2
3
4
qemu-img convert –c -O qcow2 Hortonworks_Sandbox_1.2_1-21-2012-1_vmware-disk1.vmdk hadoop-sandbox.qcow2

file hadoop-sandbox.qcow2
hadoop-sandbox.qcow2: QEMU QCOW Image (v2), 17179869184 bytes

Now, let’s upload the image to Glance.

1
glance add name="hadoop-sandbox" is_public=true container_format=bare disk_format=qcow2 < /path/to/hadoop-sandbox.qcow2

Now let’s create a virtual server off of the new image – give at least 4GB of RAM.

1
nova boot --flavor $flavor_id --image $image_id hadoop-sandbox

Once the instance goes to ACTIVE status and that the instance pings, you can ssh into the instance using

  • Username: root
  • Password: hadoop

Watch /var/log/boot.log as the services are coming up, and it will let you know when the installation is complete. This can take about 10 minutes.

At the end, you should have these java processes running:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
jps
2912 TaskTracker
2336 DataNode
2475 SecondaryNameNode
3343 HRegionServer
2813 JobHistoryServer
2142 NameNode
3012 QuorumPeerMain
4215 RunJar
4591 Jps
3568 RunJar
3589 RunJar
1559 Bootstrap
2603 JobTracker
3857 RunJar

Go to the browser at http://instance_ip and your single node Hadoop cluster should be running. Just follow through the UI; it has demos, videos and step-by-step hands-on tutorials on Hadoop, Pig, Hive and HCatalog.

Make your web site faster

Google’s mod_pagespeed speeds up your site and reduces page load time. This open-source Apache HTTP server module automatically applies web performance best practices to pages, and associated assets (CSS, JavaScript, images) without requiring that you modify your existing content or workflow.

Features
  • Automatic website and asset optimization
  • Latest web optimization techniques
  • 40+ configurable optimization filters
  • Free, open-source, and frequently updated
  • Deployed by individual sites, hosting providers, CDN’s

How does mod_pagespeed speed up web-sites?

mod_pagespeed improves web page latency and bandwidth usage by changing the resources on that web page to implement web performance best practices. Each optimization is implemented as a custom filter in mod_pagespeed, which are executed when the Apache HTTP server serves the website assets. Some filters simply alter the HTML content, and other filters change references to CSS, JavaScript, or images to point to more optimized versions.

mod_pagespeed implements custom optimization strategies for each type of asset referenced by the website, to make them smaller, reduce the loading time, and extend the cache lifetime of each asset. These optimizations include combining and minifying JavaScript and CSS files, inlining small resources, and others. mod_pagespeed also dynamically optimizes images by removing unused meta-data from each file, resizing the images to specified dimensions, and re-encoding images to be served in the most efficient format available to the user.

mod_pagespeed ships with a set of core filters designed to safely optimize the content of your site without affecting the look or behavior of your site. In addition, it provides a number of more advanced filters which can be turned on by the site owner to gain higher performance improvements.

mod_pagespeed can be deployed and customized for individual web sites, as well as being used by large hosting providers and CDN’s to help their users improve performance of their sites, lower the latency of their pages, and decrease bandwidth usage.

Installing mod_pagespeed

Supported platforms

  • CentOS/Fedora (32-bit and 64-bit)
  • Debian/Ubuntu (32-bit and 64-bit)

To install the packages, on Debian/Ubuntu, please run (as root) the following command:

dpkg -i mod-pagespeed-*.deb
apt-get -f install

For CentOS/Fedora, please execute (also as root):

yum install at  # if you do not already have 'at' installed
rpm -U mod-pagespeed-*.rpm

Installing mod_pagespeed will add the Google repository so your system will automatically keep mod_pagespeed up to date. If you don’t want Google’s repository, do sudo touch /etc/default/mod-pagespeed before installing the package.

You can also download a number of system tests. These are the same tests available onModPageSpeed.com.

What is installed

  • The mod_pagespeed packages install two versions of the mod_pagespeed code itself, mod_pagespeed.so for Apache 2.2 andmod_pagespeed_ap24.so for Apache 2.4.
  • Configuration files: pagespeed.confpagespeed_libraries.conf, and (on Debian) pagespeed.load. If you modify one of these configuration files, that file will not be upgraded automatically in the future.
  • A standalone JavaScript minifier pagespeed_js_minify based on the one used in mod_pagespeed, that can both minify JavaScript and generate metadata for library canonicalization.

Facebook Events Join the Contextual-Computing Party

Facebook made a tweak to its Events system this week, adding a little embedded forecast that shows projected weather on the day of the event. It’s a small change, but part of a big shift in computing.

zuck

Facebook CEO Mark Zuckerberg at a product launch earlier this month. Photo: Alex Washburn/Wired

The new feature, described by Facebook in briefings with individual reporters, pulls forecasts for the location of the event from monitoring company Weather Underground and attaches it to the Facebook pages of events happening within the next 10 days. The data is also shown while the event is being created, helping organizers avoid rained-out picnics and the like.

The change makes Facebook more sensitive to contextual information, data like location and time of day that the user doesn’t even have to enter. Facebook rival Google has drawn big praise for its own context-sensitive application Google Now, which, depending on your habits, might show you weather and the day’s appointments when you wake up, traffic information when you get in your car, and your boarding pass when you arrive at the airport. Google Now was so successful on Android smartphones that Google is reportedly porting the app to Apple’s iOS.

Apple’s own stab at contextual computing, the Siri digital assistant, has been less successful, but that seems to have more to do with implementation issues – overloaded servers, bad maps, and tricky voice-recognition problems – than with the idea of selecting information based on location and other situational data.

Hungry as Facebook is to sell ever-more-targeted ads at ever-higher premiums, expect the social network to add more context-sensitive features. One natural step is putting the Graph Search search engine on mobile phones and tailoring results more closely to location. Another is to upgrade Facebook’s rapidly evolving News Feed, which already filters some information based on your past check-ins, along the same lines. Done right, pushing information to Facebook users based on context could multiply the social network’s utility. Done wrong, it could be creepy on a whole new level.

Running a Multi-Broker Apache Kafka 0.8 Cluster on a Single Node

In this article I describe how to install, configure and run a multi-broker Apache Kafka 0.8 (trunk) cluster on a single machine. The final setup consists of one local ZooKeeper instance and three local Kafka brokers. We will test-drive the setup by sending messages to the cluster via a console producer and receive those messages via a console receiver. I will also describe how to build Kafka for Scala 2.9.2, which makes it much easier to integrate Kafka with other Scala-based frameworks and tools that require Scala 2.9 instead of Kafka’s default Scala 2.8.

What we want to do

Here is an overview of what we want to do:

  • Build Kafka 0.8-trunk for Scala 2.9.2.
    • I also provide instructions for the default 2.8.0, just in case.
  • Use a single machine for this Kafka setup.
  • Run 1 ZooKeeper instance on that machine.
  • Run 3 Kafka brokers on that machine.
  • Create a Kafka topic called “zerg.hydra” and send/receive messages for that topic via the console. The topic will be configured to use 3 partitions and 2 replicas per partition.

The purpose of this article is not to present a production-ready configuration of a Kafka cluster. However it should get you started with using Kafka as a distributed messaging system in your own infrastructure.

Installing Kafka

Background: Why Kafka and Scala 2.9?

Personally I’d like to use Scala 2.9.2 for Kafka – which is still built for Scala 2.8.0 by default as of today – because many related software packages that are of interest to me (such as Finagle, Kestrel) are based on Scala 2.9. Also, the current versions of many development and build tools (e.g. IDEs, sbt) for Scala require at least version 2.9. If you are working in a similar environment you may want build Kafka for Scala 2.9 just like Michael G. Noll did – otherwise you can expect to run into issues such as Scala version conflicts.

Option 1 (preferred): Kafka 0.8-trunk with Scala 2.9.2

Unfortunately the current trunk of Kafka has problems to build against Scala 2.9.2 out of the box. Michael G. Noll created a fork of Kafka 0.8-trunk that includes the required fix (a change of one file) in the branch “scala-2.9.2”. The fix ties the Scala version used by Kafka’s shell scripts to 2.9.2 instead of 2.8.0.

The following instructions will use this fork to download, build and install Kafka for Scala 2.9.2:

$ cd $HOME
$ git clone git@github.com:miguno/kafka.git
$ cd kafka
# this branch of includes a patched bin/kafka-run-class.sh for Scala 2.9.2
$ git checkout -b scala-2.9.2 remotes/origin/scala-2.9.2
$ ./sbt update
$ ./sbt "++2.9.2 package"

Option 2: Kafka 0.8-trunk with Scala 2.8.0

If you are fine with Scala 2.8 you need to build and install Kafka as follows.

$ cd $HOME
$ git clone git@github.com:apache/kafka.git
$ cd kafka
$ ./sbt update
$ ./sbt package

Configuring and running Kafka

Unless noted otherwise all commands below assume that you are in the top level directory of your Kafka installation. If you followed the instructions above, this directory is $HOME/kafka/.

Configure your OS

For Kafka 0.8 it is recommended to increase the maximum number of open file handles because due to changes in 0.8 Kafka will keep more file handles open than in 0.7. The exact number depends on your usage patterns, of course, but on the Kafka mailing list the ballpark figure “tens of thousands” was shared:

In Kafka 0.8, we keep the file handles for all segment files open until they are garbage collected. Depending on the size of your cluster, this number can be pretty big. Few 10 K or so.

For instance, to increase the maximum number of open file handles for the user kafkato 98,304 (change kafka to whatever user you are running the Kafka daemons with – this can be your own user account, of course) you must add the following line to /etc/security/limits.conf:

/etc/security/limits.conf
1
kafka    -    nofile    98304

Start ZooKeeper

Kafka ships with a reasonable default ZooKeeper configuration for our simple use case. The following command launches a local ZooKeeper instance.

Start ZooKeeper
1
$ bin/zookeeper-server-start.sh config/zookeeper.properties

By default the ZooKeeper server will listen on *:2181/tcp.

Configure and start the Kafka brokers

We will create 3 Kafka brokers, whose configurations are based on the default config/server.properties. Apart from the settings below the configurations of the brokers are identical.

The first broker:

Create the config file for broker 1
1
$ cp config/server.properties config/server1.properties

Edit config/server1.properties and replace the existing config values as follows:

broker.id=1
port=9092
log.dir=/tmp/kafka-logs-1

The second broker:

Create the config file for broker 2
1
$ cp config/server.properties config/server2.properties

Edit config/server2.properties and replace the existing config values as follows:

broker.id=2
port=9093
log.dir=/tmp/kafka-logs-2

The third broker:

Create the config file for broker 3
1
$ cp config/server.properties config/server3.properties

Edit config/server3.properties and replace the existing config values as follows:

broker.id=3
port=9094
log.dir=/tmp/kafka-logs-3

Now you can start each Kafka broker in a separate console:

Start the first broker in its own terminal session
1
$ env JMX_PORT=9999  bin/kafka-server-start.sh config/server1.properties
Start the second broker in its own terminal session
1
$ env JMX_PORT=10000 bin/kafka-server-start.sh config/server2.properties
Start the third broker in its own terminal session
1
$ env JMX_PORT=10001 bin/kafka-server-start.sh config/server3.properties

Here is a summary of the configured network interfaces and ports that the brokers will listen on:

        Broker 1     Broker 2      Broker 3
----------------------------------------------
Kafka   *:9092/tcp   *:9093/tcp    *:9094/tcp
JMX     *:9999/tcp   *:10000/tcp   *:10001/tcp

Excursus: Topics, partitions and replication in Kafka

In a nutshell Kafka partitions incoming messages for a topic, and assigns those partitions to the available Kafka brokers. The number of partitions is configurable and can be set per-topic and per-broker.

First the stream [of messages] is partitioned on the brokers into a set of distinct partitions. The semantic meaning of these partitions is left up to the producer and the producer specifies which partition a message belongs to. Within a partition messages are stored in the order in which they arrive at the broker, and will be given out to consumers in that same order.

A new feature of Kafka 0.8 is that those partitions will be now be replicated across Kafka brokers to make the cluster more resilient against host failures:

Partitions are now replicated. Previously the topic would remain available in the case of server failure, but individual partitions within that topic could disappear when the server hosting them stopped. If a broker failed permanently any unconsumed data it hosted would be lost. Starting with 0.8 all partitions have a replication factor and we get the prior behavior as the special case where replication factor = 1. Replicas have a notion of committed messages and guarantee that committed messages won’t be lost as long as at least one replica survives. Replica are byte-for-byte identical across replicas.

Producer and consumer are replication aware. When running in sync mode, by default, the producer send() request blocks until the messages sent is committed to the active replicas. As a result the sender can depend on the guarantee that a message sent will not be lost. Latency sensitive producers have the option to tune this to block only on the write to the leader broker or to run completely async if they are willing to forsake this guarantee. The consumer will only see messages that have been committed.

The following diagram illustrates the relationship between topics, partitions and replicas.

The relationship between topics, partitions and replicas in Kafka.

Logically this relationship is very similar to how Hadoop manages blocks and replication in HDFS.

When a topic is created in Kafka 0.8, Kafka determines how each replica of a partition is mapped to a broker. In general Kafka tries to spread the replicas across all brokers (source). Messages are first sent to the first replica of a partition (i.e. to the current “leader” broker of that partition) before they are replicated to the remaining brokers. Message producers may choose from different strategies for sending messages (e.g. synchronous mode, asynchronous mode). Producers discover the available brokers in a cluster and the number of partitions on each, by registering watchers in ZooKeeper.

If you wonder how to configure the number of partitions per topic/broker, here’s feedback from LinkedIn developers:

At LinkedIn, some of the high volume topics are configured with more than 1 partition per broker. Having more partitions increases I/O parallelism for writes and also increases the degree of parallelism for consumers (since partition is the unit for distributing data to consumers). On the other hand, more partitions adds some overhead: (a) there will be more files and thus more open file handlers; (b) there are more offsets to be checkpointed by consumers which can increase the load of ZooKeeper. So, you want to balance these tradeoffs.

Create a Kafka topic

In Kafka 0.8, there are 2 ways of creating a new topic:

  1. Turn on auto.create.topics.enable option on the broker. When the broker receives the first message for a new topic, it creates that topic with num.partitionsand default.replication.factor.
  2. Use the admin command bin/kafka-topics.sh.

We will use the latter approach. The following command creates a new topic “zerg.hydra”. The topic is configured to use 3 partitions and a replication factor of 2. Note that in a production setting we’d rather set the replication factor to 3, but a value of 2 is better for illustrative purposes (i.e. we intentionally use different values for the number of partitions and replications to better see the effects of each setting).

Create the “zerg.hydra” topic
1
2
$ bin/kafka-topics.sh --zookeeper localhost:2181 \
    --create --topic zerg.hydra --partitions 3 --replication-factor 2

This has the following effects:

  • Kafka will create 3 logical partitions for the topic.
  • Kafka will create a total of two replicas (copies) per partition. For each partition it will pick two brokers that will host those replicas. For each partition Kafka will elect a “leader” broker.

Ask Kafka for a list of available topics. The list should include the new zerg.hydratopic:

List the available topics in the Kafka cluster
1
2
3
4
$ bin/kafka-topics.sh --zookeeper localhost:2181 --list
<snipp>
zerg.hydra
</snipp>

You can also inspect the configuration of the topic as well as the currently assigned brokers per partition and replica. Because a broker can only host a single replica per partition, Kafka has opted to use a broker’s ID also as the corresponding replica’s ID.

List the available topics in the Kafka cluster
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic zerg.hydra
<snipp>
zerg.hydra
    configs:
    partitions: 3
        partition 0
        leader: 1 (192.168.0.153:9092)
        replicas: 1 (192.168.0.153:9092), 2 (192.168.0.153:9093)
        isr: 1 (192.168.0.153:9092), 2 (192.168.0.153:9093)
        partition 1
        leader: 2 (192.168.0.153:9093)
        replicas: 2 (192.168.0.153:9093), 3 (192.168.0.153:9094)
        isr: 2 (192.168.0.153:9093), 3 (192.168.0.153:9094)
        partition 2
        leader: 3 (192.168.0.153:9094)
        replicas: 3 (192.168.0.153:9094), 1 (192.168.0.153:9092)
        isr: 3 (192.168.0.153:9094), 1 (192.168.0.153:9092)
<snipp>

In this example output the first broker (with broker.id = 1) happens to be the designated leader for partition 0 at the moment. Similarly, the second and third brokers are the leaders for partitions 1 and 2, respectively.

The following diagram illustrates the setup (and also includes the producer and consumer that we will run shortly).

Overview of our Kafka setup including the current state of the partitions and replicas. The colored boxes represent replicas of partitions. “P0 R1” denotes the replica with ID 1 for partition 0. A bold box frame means that the corresponding broker is the leader for the given partition.

You can also inspect the local filesystem to see how the --describe output above matches actual files. By default Kafka persists topics as “log files” (Kafka terminology) in the log.dir directory.

Local files that back up the partitions of Kafka topics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
$ tree /tmp/kafka-logs-{1,2,3}
/tmp/kafka-logs-1                   # first broker (broker.id = 1)
├── zerg.hydra-0                    # replica of partition 0 of topic "zerg.hydra" (this broker is leader)
│   ├── 00000000000000000000.index
│   └── 00000000000000000000.log
├── zerg.hydra-2                    # replica of partition 2 of topic "zerg.hydra"
│   ├── 00000000000000000000.index
│   └── 00000000000000000000.log
└── replication-offset-checkpoint

/tmp/kafka-logs-2                   # second broker (broker.id = 2)
├── zerg.hydra-0                    # replica of partition 0 of topic "zerg.hydra"
│   ├── 00000000000000000000.index
│   └── 00000000000000000000.log
├── zerg.hydra-1                    # replica of partition 1 of topic "zerg.hydra" (this broker is leader)
│   ├── 00000000000000000000.index
│   └── 00000000000000000000.log
└── replication-offset-checkpoint

/tmp/kafka-logs-3                   # third broker (broker.id = 3)
├── zerg.hydra-1                    # replica of partition 1 of topic "zerg.hydra"
│   ├── 00000000000000000000.index
│   └── 00000000000000000000.log
├── zerg.hydra-2                    # replica of partition 2 of topic "zerg.hydra" (this broker is leader)
│   ├── 00000000000000000000.index
│   └── 00000000000000000000.log
└── replication-offset-checkpoint

6 directories, 15 files

Caveat: Deleting a topic via bin/kafka-topics.sh --delete will apparently not delete the corresponding local files for that topic. I am not sure whether this behavior is expected or not.

Start a producer

Start a console producer in sync mode:

Start a console producer in sync mode
1
2
$ bin/kafka-console-producer.sh --broker-list localhost:9092,localhost:9093,localhost:9094 --sync \
    --topic zerg.hydra

Example producer output:

[...] INFO Verifying properties (kafka.utils.VerifiableProperties)
[...] INFO Property broker.list is overridden to localhost:9092,localhost:9093,localhost:9094 (...)
[...] INFO Property compression.codec is overridden to 0 (kafka.utils.VerifiableProperties)
[...] INFO Property key.serializer.class is overridden to kafka.serializer.StringEncoder (...)
[...] INFO Property producer.type is overridden to sync (kafka.utils.VerifiableProperties)
[...] INFO Property queue.buffering.max.messages is overridden to 10000 (...)
[...] INFO Property queue.buffering.max.ms is overridden to 1000 (kafka.utils.VerifiableProperties)
[...] INFO Property queue.enqueue.timeout.ms is overridden to 0 (kafka.utils.VerifiableProperties)
[...] INFO Property request.required.acks is overridden to 0 (kafka.utils.VerifiableProperties)
[...] INFO Property request.timeout.ms is overridden to 1500 (kafka.utils.VerifiableProperties)
[...] INFO Property send.buffer.bytes is overridden to 102400 (kafka.utils.VerifiableProperties)
[...] INFO Property serializer.class is overridden to kafka.serializer.StringEncoder (...)

You can now enter new messages, one per line. Here we enter two messages “Hello, world!” and “Rock: Nerf Paper. Scissors is fine.”:

Hello, world!
Rock: Nerf Paper. Scissors is fine.

After the messages are produced, you should see the data being replicated to the three log directories for each of the broker instances, i.e. /tmp/kafka-logs-{1,2,3}/zerg.hydra-*/.

Start a consumer

Start a console consumer that reads messages in zerg.hydra from the beginning (in a production setting you would usually NOT want to add the --from-beginning option):

Start a console consumer
1
$ bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic zerg.hydra --from-beginning

The consumer will see a new message whenever you enter a message in the producer above.

Example consumer output:

<snipp>
[...] INFO [console-consumer-28434_panama.local-1363174829799-954ed29e], Connecting to zookeeper instance at localhost:2181 ...
[...] INFO Starting ZkClient event thread. (org.I0Itec.zkclient.ZkEventThread)
[...] INFO Client environment:zookeeper.version=3.3.3-1203054, built on 11/17/2011 05:47 GMT ...
[...] INFO Client environment:host.name=192.168.0.153 (org.apache.zookeeper.ZooKeeper)
<snipp>
[...] INFO Fetching metadata with correlation id 0 for 1 topic(s) Set(zerg.hydra) (kafka.client.ClientUtils$)
[...] INFO Connected to 192.168.0.153:9092 for producing (kafka.producer.SyncProducer)
[...] INFO Disconnecting from 192.168.0.153:9092 (kafka.producer.SyncProducer)
[...] INFO [ConsumerFetcherThread-console-consumer-28434_panama.local-1363174829799-954ed29e-0-3], Starting ...
[...] INFO [ConsumerFetcherManager-1363174829916] adding fetcher on topic zerg.hydra, partion 2, initOffset -1 to broker 3 with fetcherId 0 ...
[...] INFO [ConsumerFetcherThread-console-consumer-28434_panama.local-1363174829799-954ed29e-0-2], Starting ...
[...] INFO [ConsumerFetcherManager-1363174829916] adding fetcher on topic zerg.hydra, partion 1, initOffset -1 to broker 2 with fetcherId 0 ...
[...] INFO [ConsumerFetcherThread-console-consumer-28434_panama.local-1363174829799-954ed29e-0-1], Starting ...
[...] INFO [ConsumerFetcherManager-1363174829916] adding fetcher on topic zerg.hydra, partion 0, initOffset -1 to broker 1 with fetcherId 0 ...

And at the end of the output you will see the following messages:

Hello, world!
Rock: Nerf Paper. Scissors is fine.

That’s it!

A note when using Kafka with Storm

The maximum parallelism you can have on a KafkaSpout is the number of partitions of the corresponding Kafka topic. The following question-answer thread (Michael G. Noll slightly modified the original text for clarification purposes) is from the Storm user mailing list, but supposedly refers to Kafka pre-0.8 and thereby before the replication feature was added:

Question: Suppose the number of Kafka partitions per broker is configured as 1 and the number of hosts is 2. If we set the spout parallelism as 10, then how does Storm handle the difference between the number of Kafka partitions and the number of spout tasks? Since there are only 2 partitions, does every other spout task (greater than first 2) not read the data or do they read the same data?

Answer (by Nathan Marz): The remaining 8 (= 10 – 2) spout tasks wouldn’t read any data from the Kafka topic.

My current understanding is that the number of partitions (i.e. regardless of replicas) is still the limiting factor for the parallelism of a KafkaSpout. Why? Because Kafka is not allowing consumers to read from replicas other than the (replica of the) leader of a partition to simplify concurrent access to data in Kafka.

A note when using Kafka with Hadoop

LinkedIn has published their Kafka->HDFS pipeline named Camus. It is a MapReduce job that does distributed data loads out of Kafka.

Where to go from here

The following documents provide plenty of information about Kafka that goes way beyond what I covered in this article:

Awesome MediaWiki theme

For anyone who saw the recent launch of the new oVirt website a while back and was wondering how they could make such an attractive theme and lay-out for a MediaWiki wiki, wonder no more. In fact, you don’t even have to be jealous! Because the theme, called Strapping, so called because it’s based on the Bootstrap web framework, has just been published by  Garrett on GitHub.

Kudos to Garrett, who did amazing work on this theme to make it as beautiful and reusable as possible. I’m looking forward to using it for other websites in the near future. And so can you!

LinkedIn has just announced the release of Camus

Kafka is a high-throughput, persistent, distributed messaging system that was originally developed at LinkedIn. It forms the backbone of Wikimedia’s new data analytics pipeline.

Kafka is both performant and durable. To make it easier to achieve high throughput on a single node it also does away with lots of stuff message brokers ordinarily provide (making it a simpler distributed messaging system).

LinkedIn has just announced the release of Camus: their Kafka to HDFS pipeline.

 

Connecting to HBase from Erlang using Thrift

The key was to piece together steps from the following two pages:

Thrift API and Hbase.thrift file can be found here
http://wiki.apache.org/hadoop/Hbase/ThriftApi

Download the latest thrift*.tar.gz from http://thrift.apache.org/download/

1
2
3
4
5
6
7
sudo apt-get install libboost-dev
tar -zxvf thrift*.tar.gz
cd thrift*
./configure
make
cd compiler/cpp
./thrift -gen erl Hbase.thrift

Take all the files in the gen-erl directory and copy them to your application’s /src.
Copy the thrift erlang client files from thrift*/lib/erl to your application or copy/symlink to $ERL_LIB

Can connect using either approach:

1
2
3
4
{ok, TFactory} = thrift_socket_transport:new_transport_factory("localhost", 9090, []).
{ok, PFactory} = thrift_binary_protocol:new_protocol_factory(TFactory, []).
{ok, Protocol} = PFactory().
{ok, C0} = thrift_client:new(Protocol, hbase_thrift).

Or by using the utility, need to investigate the difference

1
{ok, C0} = thrift_client_util:new("localhost", 9090, hbase_thrift, []).

Basic CRUD commands

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
% Load records into the shell
rr(hbase_types).
% Get a list of tables
{C1, Tables} = thrift_client:call(C0, getTableNames, []).
% Create a table
{C2, _Result} = thrift_client:call(C1, createTable, ["test", [#columnDescriptor{name="test_col:"}]]).
% Insert a column value
% TODO: Investigate the attributes dictionary's purpose
{C3, _Result} = thrift_client:call(C2, mutateRow, ["test", "key1", [#mutation{isDelete=false,column="test_col:", value="wooo"}], dict:new()]).
% Delete
{C4, _Result} = thrift_client:call(C3, mutateRow, ["test", "key1", [#mutation{isDelete=true}], dict:new()]).
% Get data
% TODO: Investigate the attributes dictionary's purpose
thrift_client:call(C4, getRow, ["test", "key1", dict:new()]).

 

Are You a Force Multiplier?

multiply

multiply

On most days, my To Do List seems longer than the Nile River.  It contains everything from the quotidien (remember the milk!) to the critical — tasks that trigger serious consequences. On days when it seems like I add two tasks for every one I complete, it can be tempting to focus on the noisiest ones.  What are noisy tasks?  The tasks with the most pressing deadline or the most vocal sponsor. And so it goes, racing from one due date to another, with barely enough time for a breath much less a moment to consider the true results of what I am doing.

Writers on productivity, time management and strategy have told us for a long time that we should focus on the IMPORTANT not the URGENT. That’s excellent advice.  However, I’ve recently started thinking about another lens through which to view and prioritize tasks:  Will the completion of the task (or project) act as a force multiplier?

To understand this better, let’s spend a moment on force multiplication.  The military calls a factor a “force multiplier” when that factor enables a force to work much more effectively.  The example in Wikipedia relates to GPS:  ”if a certain technology like GPS enables a force to accomplish the same results of a force five times as large but without GPS, then the multiplier is 5.”  Interestingly, while technology can be an enormous advantage, force multipliers are not limited to technology.  Some of the force multipliers listed in that Wikipedia article have nothing at all to do with technology:

Now come back to that growing To Do List and take another look at those tasks.  How many of them are basically chores — things that simply need to get done in order to get people off your back or to move things forward (perhaps towards an unclear goal)? How many of them are (or are part of) force multipliers — things that will allow you or your organization to work in a dramatically more effective fashion?  Viewed through this lens, the chores seem much less relevant, akin to rearranging the deck chairs on the Titanic, while the force multipliers are clearly much more deserving of your time and attention.

The challenge of course is that the noisy tasks grab your attention because others insist on it.  They want something when they want it because they want it.  They may not have a single strategic thought in their head, but they are demanding and persistent.  So how do you limit the encroachment of purveyors of noisy tasks?  One answer is to limit the amount of time available for chores.  To do this credibly, you’ll need to know where you and your activities fit within the strategy of your organization.  If the task does not advance strategy, don’t do it.  Or decide upfront to allow a fixed percentage of your time for chores that may be of minimal use to you, but may be important to keep the people around you happy.  Another approach is to get a better understanding of the task and its context.  If your job is to copy documents, one page looks much like another.  However, it matters if the document you are copying contains the cafeteria menu or the firm’s emergency response guidelines. Finally, you need to educate the folks around you.  With your subordinates, do your decision making aloud — explaining how you determine if a particular task or project is a force multiplier. With your superiors, ask them to help you understand better the force multiplication attributes they see in the tasks they assign.  (This will either provide you with more useful contextual information or smoke out a chore that is masquerading as an important task.) Finally, with the others, engage them in conversation. When you cannot see your way clear to handle their chore, explain your reasoning.  They won’t always be happy about it, but they will start learning when to call on you and when to dump their requests on someone else.

Of course, the concept of force multiplication goes far beyond your To Do List.  Do your projects have a force multiplying effect on your department?  Does your department have a force multiplying effect on your firm? These are important questions for everyone, but especially for people engaged in the sometime amorphous field of knowledge management. Sure, most of what we do helps.  But do we make a dramatic difference?  If not, why not?

[Photo Credit: Leo Reynolds]

Written By: V Mary Abraham