Is Big Data the Next Billion-Dollar Technology Industry?

Fri, Aug 3, 2012 - 8:33am

It is not news that our capacity to gather and store immense amounts of data has grown by leaps and bounds. A few years ago, it was unthinkable for a free email account to offer more than 10 or 20 megabytes of storage. Today, one stores thousands of times that amount. But that's barely scratching the surface compared to the truly massive data collection projects now under way.

The Large Synoptic Survey Telescope is slated to come online in 2016. When it's operational, estimates are that it will acquire knowledge of our universe at the rate of 140 terabytes of data every five days, or better than 10 petabytes a year - that's 10,000,000,000,000,000 bytes per year, or more data than in every book ever written accruing about every two days. And who knows how much info the Large Hadron Collider will be spewing out by then? In 2010 alone, the LHC gathered 13 petabytes' worth. And then there's Google, processing in the neighborhood of 24 petabytes. Per day.

Only a few years ago, a gigabyte (one billion bytes) was thought to be a lot of data. Now it's nothing. Even home hard drives can store a terabyte (one trillion) these days. The commercial and governmental sectors regularly handle petabytes (quadrillion), while researchers routinely chat about the looming frontiers: exabytes (quintillion), zettabytes (sextillion), and yottabytes (septillion). It has not been necessary to name the next one after that. Yet.

But it's not just the Googles and NASAs of the world that are dealing with that kind of data. Virtually every Fortune 500 company in the world has a massive data warehouse where it's accumulated millions of documents and billions of data records from inventory systems, ecommerce systems, and marketing-analytics software.

You bump up against this kind of massive data collection every time you swipe your credit card at Walmart. The retail giant processes more than a million transactions just like yours every hour and dumps the results into a database that currently contains more than 2.5 petabytes of data. That's equivalent to all the information contained in all the books in the Library of Congress about 170 times over.

These increasingly large mounds of data have begun to befuddle even the geekiest members of those organizations.

Our ability to collect massive amounts of data continues to grow at an exponential rate. But the more we collect, the harder it becomes to derive anything meaningful from it. After all, what on earth do you do with all this stuff? How do you sort it? How do you search it? How do you analyze it so that something useful comes out the other end? That's the problem facing developers for whom the traditional tools of database management are powerless in the face of such an onslaught. Data stores have far outgrown our ability to keep the data neat, clean, and tidy, and hence easy to analyze. What we have now is a mess of varying types of data - with moving definitions, inconsistent implementations, even the equivalent of digital freeform - that needs to be analyzed at a massive scale. It's a problem both of size and complexity.

Which brings us face to face with the hottest tech buzz words of 2012: Big Data.

Supersized Us

The idea that data can be supersized is, of course, not new. But what is new is a convergence of technologies that deal with it in some efficient, innovative, and highly creative ways. Though Big Data is a market that's still in its infancy, it is consuming increasingly large chunks of the nation's overall IT budget. How much actually is being spent depends on how you define the term; hard numbers are impossible to come by. Conservative estimates claim we're headed to somewhere between $20 and $55 billion by 2015. Out at the high end, Pat Gelsinger, COO of data-storage giant EMC, claims that it is already a $70-billion market - and growing at 15-20% per year.

Take your pick. But regardless, it's small wonder that venture capitalists are falling all over themselves to throw money at this tech. Accel Partners launched a $100 million Big Data fund last November, and IA Ventures initiated its $105-million IAVS Fund II in February. Even American Express has ponied up $100 million to create a fund to invest in the sector.

Down in Washington, DC, the White House has predictably jumped into the fray, with an announcement on March 29 that it was committing $200 million to develop new technologies to manipulate and manage Big Data in the areas of science, national security, health, energy, and education.

John Holdren, director of the White House's Office of Science and Technology Policy, paid lip service to the private sector, saying that while it "will take the lead on big data, we believe that the government can play an important role, funding big data research, launching a big data workforce, and using big data approaches to make progress on key national challenges."

At the same time, The National Institute of Standards and Technology (NIST) will be placing a new focus on big data. According to IT Lab Director Chuck Romine, NIST will be increasing its work on standards, interoperability, reliability, and usability of big data technologies, and predicts that the agency will "have a lot of impact on the big data question."

CRM = customer resource management

ERP = enterprise resource planning

ETL = extract, transform, and load

HDFS = Hadoop distributed file system for Big Data

SQL = a programming language for managing relational databases

NoSQL = not just SQL

NGDW = next generation data warehouse

No shocker, the Department of Defense is also already hip-deep in the sector, planning to spend about $250 million annually - including $60 million committed to new research projects - on Big Data. And of course you know that DARPA (the Defense Advanced Research Projects Agency) has to have its finger in the pie. It's hard at work on the XDATA program, a $100-million effort over four years to "develop computational techniques and software tools for sifting through large structured and unstructured data sets."

If much of this seems a bit fuzzy, here's an easy way of thinking about it: Suppose you own the mineral rights to a square mile of the earth. In this particular spot, there were gold nuggets lying on the surface and a good deal more accessible gold just below ground, and you've mined all of that. Your operation thus far is analogous to the stripping of chunks of useful information from the available data using traditional methods.

But suppose there is a lot more gold buried deeper down. You can get it out and do so cost-effectively, but in order to accomplish that you have to sink mine shafts deep into the earth and then off at various angles to track the veins of precious-metal-bearing rock (the deepest mine on earth is in South Africa, and it plunges two miles down). That's a much more complex operation, and extracting gold under those conditions is very like pulling one small but exceedingly useful bit of information out of a mountain-sized conglomeration of otherwise-useless Big Data.

So how do you do it?

You do it with an array of new, exciting, and rapidly evolving tools. But in order to understand the process, you'll first have to learn the meaning of some acronyms and terms you may not yet be familiar with. Sorry about that.

With these in mind, we can now interpret this diagram, courtesy of Wikibon, which lays out the traditional flow of information within a commercial enterprise:

Here you can see that data generated by three different departments - customer resource management, enterprise resource planning, and finance - are funneled into a processor that extracts the relevant material, transforms it into a useful format (like a spreadsheet), and loads it into a central storage area, the relational database warehouse. From there, it can be made available to whichever end user wants or needs it, either someone within-house or an external customer.

Enter the Elephant

The old system works fine within certain parameters. But in many ways, it's becoming Stone-Age stuff, because: The raw amount of input must not be too large; it must be structured in a way that is easy to process (traditionally, in rows and columns); and the desired output must not be too complex. Heretofore, as businesses were interested mainly in such things as generating accurate financial statements and tracking customer accounts, this was all that was needed.

However, potential input that could be of value to a company has increased exponentially in volume and variety, as well as in the speed at which it is created. Social media, as we all know, have exploded. 700 million Facebook denizens, a quarter of a billion Twitter users, 150 million public bloggers - all these and more are churning out content that is being captured and stored. Meanwhile, 5 billion mobile-phone owners are having their calls, texts, IMs, and locations logged. Online transactions of all different kinds are conducted by the billions every day. And there are networked devices and sensors all over the place, streaming information.

This amounts to a gargantuan haystack. And what is more, much of this haystack consists of material that is only semi-structured, if not completely unstructured, making it impossible for traditional processing systems to handle. So if you're combing the hay, looking for the golden needle - let's say, two widely separated but marginally related data points that can be combined in a meaningful whole for you - you won't be able to find it without a faster and more practical method of getting to the object of your search. You must be able to maneuver through Big Data.

Some IT pros could see this coming, and so they invented - ta dah - a little elephant:


Apache.org

Hadoop was originally created by Doug Cutting at Yahoo! and was inspired by MapReduce, a tool for indexing the Web that was developed by Google. The basic concept was simple: Instead of poking at the haystack with a single, big computer, Hadoop relies on a series of nodes running massively parallel processing (MPP) techniques. In other words, it employs clusters of the smaller, less-expensive machines known as "commodity hardware" - whose components are common and unspecialized - and uses them to break up Big Data into numerous parts that can be analyzed simultaneously.

That takes care of the volume problem and eliminates the data-ingesting choke point caused by reliance on a single, large-box processor. Hadoop clusters can scale up to the petabyte and even exabyte level.

But there's also that other obstacle - namely, that Big Data comes in semi- or unstructured forms that are resistant to traditional analytical tools. Hadoop solves this problem by creating a default file storage known as the Hadoop Distributed File System (HDFS). HDFS is specially tailored to store data that aren't amenable to organization into the neatly structured rows and columns of relational databases.

After the node clusters have been loaded, queries can be written to the system, usually in Java. Instead of returning relevant data to be worked on in some central processor, Hadoop causes the analysis to occur at each node simultaneously. There is also redundancy, so that if one node fails, another preserves the data.

The MapReduce part of Hadoop then goes to work according to its two functions. "Map" divides the query into parts and parallel processes it at the node level. "Reduce" aggregates the results and delivers them to the inquirer.

After processing is completed, the resulting information can be transferred into existing relational databases, data warehouses, or other traditional IT systems, where analysts can further refine them. Queries can be written in SQL - a language with which more programmers are familiar - and converted into MapReduce.

One of the beauties of Hadoop - now a project of the Apache Software Foundation - is that it is open source. Thus, it's always unfinished. It evolves, with hundreds of contributors continuously working to improve the core technology.

Now trust us, the above explanation is pared down to just the barest of bones of this transformational tech. If you're of a seriously geeky bent (want to play in your very own Hadoop sandbox? - you can: the download is free) or are simply masochistic, you can pursue the subject down a labyrinth that'll force you to learn about a bewildering array of Hadoop subtools with such colorful names as Hive, Pig, Flume, Oozie, Avro, Mahout, Sqoop, and Big Top. Help yourself.

Numerous small startups have, well, started up in order to vend their own Hadoop distributions, along with different levels of proprietary customization. Cloudera is the leader at the moment, as its big-name personnel lineup includes Hadoop creator Cutting and data scientist Jeff Hammerbacher from Facebook. Alternatively, there is Hortonworks, which also emerged from Yahoo! and went commercial last November. MapR is another name to watch. Unfortunately, the innovators remain private, and there are no pure-investment plays as yet in this space.

It isn't simply about finding that golden needle in the haystack, either. The rise of Hadoop has enabled users to answer questions no one previously would have thought to ask. Author Jeff Kelly, writing on Wikibon, offers this outstanding example (emphasis ours):

"[S]ocial networking data [can be] mined to determine which customers pose the most influence over others inside social networks. This helps enterprises determine which are their 'most important' customers, who are not always those that buy the most products or spend the most but those that tend to influence the buying behavior of others the most."

Brilliant - and now possible.

Hadoop is, as noted, not the be-all and end-all of Big-Data manipulation. Another technology, called the "next generation data warehouse" (NGDW), has emerged. NGDWs are similar to MPP systems that can work at the tera- and sometimes petabyte level. But they also have the ability to provide near-real-time results to complex SQL queries. That's a feature lacking in Hadoop, which achieves its efficiencies by operating in batch-processing mode.

The two are somewhat more complementary than competitive, and results produced by Hadoop can be ported to NGDWs, where they can be integrated with more structured data for further analysis. Unsurprisingly, some vendors have appeared that offer bundled versions of the different technologies.

For their part, rest assured that the major players aren't idling their engines on the sidelines while all of this races past. Some examples: IBM has entered the space in a big way, offering its own Hadoop platform; Big Blue also recently acquired a leading NGDW, as did HP; Oracle has a Big-Data appliance that joins Hadoop from Cloudera with its own NoSQL programming tools; EMC scooped up Hadoop vendor Greenplum; Amazon employs Hadoop in its Elastic MapReduce cloud; and Microsoft will support Hadoop on its Azure cloud.

And then there's government. In addition to the executive-branch projects mentioned earlier, there is also the rather creepy, new, -billion NSA facility being built in Utah. Though its purpose is top secret, what is known is that it's being designed with the capability of storing and analyzing the electronic footprint - voice, email, Web searches, financial transactions, and more - of every citizen in the US. Big Data indeed.

The New Big World

From retail to finance to government to health care - where an estimated 0 billion a year could be saved by the judicious use of Big Data - this technology is game-changing. Not necessarily for the better, as the superspy facility may portend.

And even outside the NSA, there are any number of serious implications to deal with. Issues related to privacy, security, intellectual property, liability, and much more will need to be addressed in a Big-Data world.

We'd better get down to it, because this tech is coming right at us - and it is not stoppable.

In fact, the only thing slowing it at all is a shortage of expertise. It's happened so fast that the data scientists with the proper skill sets are in extremely short supply - a situation that is projected to get worse before it gets better. Management consulting firm McKinsey & Co. predicts that by 2018, "the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills, as well as [a further shortage of] 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions."

If you know any bright young kids with the right turn of mind, this is definitely one direction in which to steer them.

The opportunity exists not just for aspiring information-miners. Just as the relational database - which started as a set of theoretical papers by a frustrated IBM engineer fed up with the current status quo in the field - has grown from academic experiments and open-source projects into a multibillion-dollar-per-year industry with players like Microsoft and Oracle and IBM, so too is Big Data in the beginning of a rapid growth curve. From today's small companies and hobby projects will come major enterprises. Stories like MySQL - an open-source project acquired by Sun Microsystems for billion in 2008 - are coming in Big Data.

While there's no pure way to invest in the innovators working to manage Big Data, there are opportunities in technology that - so far - are under most investors' radar screens. One involves directly investing in peer-to-peer lending, which Alex Daley - our chief technology investment strategist and editor of Casey Extraordinary Technology - will detail at the Casey Research/Sprott, Inc. Navigating the Politicized Economy Summit in Carlsbad, California from September 7-9.

Alex will be joined by a host of wildly successful institutional investors that includes our own Doug Casey... Eric Sprott of Sprott, Inc... resource investing legend Rick Rule... as well as many other financial luminaries, some of whom you've probably never heard of (but only because they have no interest in the limelight). Together, they'll show you how governmental meddling in markets has created a politicized economy that works against most investors. More important, they'll provide you with investment strategies and specific stock recommendations that they're using right now to leverage this new economy to protect and grow their own wealth.

Navigating the Politicized Economy is a rare opportunity to not only discover the action plans of some of the most successful investors in the world, but to also ask them specific questions about your portfolio and stocks you've been following. Just as important, you'll also have the opportunity to network with like-minded individuals, including the executives of some of our favorite companies and the research team of Casey Research.

Right now, you can save 0 on registration, but this opportunity is only good through 11:59 p.m. August 3. Get the details and reserve your seat today.

About the Authors

Editor: Casey's Gold & Resource Report
dhornig [at] caseyresearch [dot] com ()

Senior Editor
info [at] caseyresearch [dot] com ()
randomness