Three Ways to Build a Big Data System


This is an excerpt from Chapter 10, “Doing Business in a Big Data World”, from Dale Neef’s book Digital escape: what everyone should know about big data, digitization and digital innovation. Neef is a technology consultant, speaker and author who focuses on big data management, electronic monitoring and reporting.

In the chapter, Neef explores the architectural, organizational and security issues that organizations must take into account when planning a Big Data system and integrating Hadoop clusters, NoSQL databases and other Big Data technologies with their current systems.

Many organizations wonder if the benefits of big data research and analysis justify an infrastructure disruption, and what is the best approach to combine these two different frameworks for their particular organization. Three main configuration choices are available to them.

1. Do it yourself, building on a company’s current IT structure

While not entirely happy with the current level of data capture and analysis, most companies considering adopting big data technologies already have a well-endowed, relatively modern, systems-based IT framework. management of relational databases (RDB) and conventional data. storage.

Any business already managing a large amount of structured data with enterprise systems and data warehouses is therefore quite familiar with the day-to-day issues of large-scale data management. It would seem natural for these companies to assume that since big data is the next big evolution in information technology evolution, it would make sense for them to simply build a NoSQL / Hadoop-like infrastructure on their own, embedded directly. within their current conventional framework. Indeed, ESG, the IT consulting and market research firm, estimated that by early 2014, more than half of large organizations will have started this type of do-it-yourself approach. As we have seen, as open source software, the price of a Hadoop-like framework (free) is attractive, and it is relatively easy, provided the company has employees with the skills required to start working on Hadoop applications using either home data or data stored in the cloud.

There are also various methods of experimenting with Hadoop-like technologies using data outside of normal business operations, through pilot programs, or what Paul Barth and Randy Bean on the HarvardBusinessReview blogging network describe as a “bac analytical sand ”, in which companies can try out their opportunity to apply big data analytics to structured and unstructured data to see what kinds of patterns, correlations or information they can uncover.

But experimenting with certain Hadoop / NoSQL applications for the marketing department is a far cry from developing a fully integrated big data system capable of capturing, storing, and analyzing large, multistructured data sets. In fact, the successful implementation of enterprise-wide Hadoop frameworks is still relatively rare and is primarily the domain of very large, data-intensive companies experienced in the financial services or pharmaceutical industries. As we’ve seen, many of these big data projects still primarily involve structured data and rely on SQL and relational data models. Large-scale analysis of totally unstructured data, for the most part, still remains in the rarefied domain of powerful internet technology companies like Google, Yahoo, Facebook, and Amazon, or large retailers like Wal-Mart.

While cloud-based tools have obvious advantages, each business has different data and analytical requirements.

Since so many big data projects are still largely based on structured or semi-structured data and relational data models that complement today’s data management operations, many companies are turning to their primary support vendors – like Oracle. or SAP – to help them bridge the old and new and integrate Hadoop-like technologies directly into their existing data management approach. Oracle’s Big Data Appliance, for example, claims that its preconfigured offering – once the various costs are factored in – is almost 40% cheaper than an equivalent self-built system and can be up and running in one. third less time.

And, of course, the more big data technologies are integrated directly into a company’s IT framework, the more complexity and the potential for spreading data. Depending on the configurations, full integration into a single, massive data pool (as big data purists advocate) means pulling unstructured and dirty data into a company’s central data pool (even if that data is distributed) and potentially share them for analysis, copying and possibly modification by various users across the enterprise, often using different configurations of Hadoop or NoSQL written by different programmers for different reasons. Add to that the need to hire expensive Hadoop programmers and data scientists. For traditional RDB managers, this kind of approach raises the specter of untold additional data disasters, costs and life-saving work demands for already overwhelmed IT staff.

2. Let someone else do it in the cloud

The obvious alternative to the build-it-yourself approach is to efficiently rent major big data, compute, and storage applications using a Hadoop-like cloud solution, pulling data from your own organization. in a common repository kept in the cloud and accessible (or potentially even fully administered) by your own data engineers. In this scenario, this cloud-based repository can include both structured and unstructured data and can be completely separated from the day-to-day structured operational, financial and transactional data of the enterprise, which would remain confined to the enterprise. of the company and Relational Database Management System. This approach requires a bit of thought and data management up front, but once the cloud repository of structured and unstructured data is available, companies can experiment with large datasets and big data-based analytics technologies. the cloud, regardless of the underlying framework.

The best thing about this approach – besides the fact that businesses don’t have to buy and maintain hardware and software infrastructure – is that it’s scalable. Businesses can experiment with different types of data from different sources without a huge initial capital investment. Projects can be as small (analyzing a handful of products, customers, or social media sites) or as complex as a business wants. And, more importantly, a business does not need to modify its current systems or run a parallel internal system itself.

It seems like the perfect solution, but, as always, there are downsides. First, even though rental technologies are really capable of handling extremely variable data, that doesn’t mean that the resulting patterns or correlations will have meaning unless a thorough process of cleaning and sorting the data takes place. produce first. While cloud-based tools have obvious advantages, every business has different data and analytical requirements, and as we’ve seen in the past, universal tools are rarely as productive or as easy to use as they are. is announced. And, of course, when the reports come back with skewed results (and after a futile effort to fix the technical issues themselves), marketing or sales users will most likely turn to IT for help. , in any event. Essentially, this means that a good chunk of IT staff still needs to be engaged in big data management and trained in the tools and data schema preparation that will allow this approach to work. And as stated earlier, at the end of the day, using small subsets of data, even when that data comes from various sources and is analyzed with Hadoop or NoSQL technologies, is really more conventional business intelligence (with bells and whistles) as big data.

Cloud-based providers are obviously aware of these issues. They know that for this model to work, cloud-based businesses need to make their offering as simple, flexible, and powerful as possible. A good example is the strategic alliance between Hortonworks and Red Hat (Hortonworks provides Hadoop and Red Hat provides cloud-based storage), which they say includes preconfigured, user-friendly and reusable data models, and puts the focus on the collaborative client. Support.

3. Running parallel database frameworks

A third configuration involves building a big data system separately and in parallel (rather than integrated with) the company’s existing production and enterprise systems. In this model, most businesses still leverage the cloud for data storage, but develop and experiment with enterprise-owned big data applications themselves. This two-state approach allows the business to build the big data framework of the future, while creating valuable resources and proprietary knowledge within the business. This provides full internal control in exchange for duplicating much of the functionality of the current system and allows for future migration to a full-fledged big data platform that will eventually allow both systems (conventional and big data) to merge.

The problem with this approach is that, in many ways, the very nature of a big data framework is different from conventional computing. Traditional computing still involves applications, operating systems, software interfaces, hardware, and database management, while big data involves some database work but is mostly about complex analysis and structuring. meaningful relationships – something that requires a different skill set than what is found in most information technology. departments today. Although this side-by-side configuration assumes a certain level of economy of scale (sharing of existing computing power, use of existing staff, etc.), the reality is that these savings can only come at the expense of ‘complicated interfaces between old and new systems. that need to be designed and managed.


Comments are closed.