The Emperor’s New Clothes… Resisting the Hadoop Imperative

Did You Jump on the Hadoop Bandwagon? Did You Do it for the Right Reasons?

I recently saw a statistic that only 15% of Hadoop projects are in production. So that would mean that 85% of the Hadoop projects are not yet in production, right? Hmmm…could that be because Hadoop is not the right choice for all data problems?

When the only tool you have is a hammer, everything looks like a nail.

As a builder of data warehouses I have been confounded by the proliferation of Hadoop and its ancillary “big data” products, into the Data Warehouse ecosystem. Don’t get me wrong, Hadoop does have its place for certain use cases. And there are environments with such high talent density and capacity that they can handle the complexity without blinking an eye, for example Workday or Netflix.

Unfortunately, many people don’t get that Hadoop comes with a price. A fairly steep price. Yes, the software may be all or mostly open source, so your licensing costs will be low, and it can run on commodity machines, but the complexity and the number of tools and skills your people will have to learn and integrate into your existing environment, or you will have to pay to get (and integrate), can be EXPENSIVE. Especially if your businesses core focus does not require a lot of Java engineers.

In 20 years of database and data warehouse experience I have never needed to know Java. Not once. In 2006, I did a tutorial and learned the basics of Java. Then last year, I took a Hadoop class. Guess what, the primary programming language used with Hadoop is Java. So I did another whole set of online Java courses. Nothing wrong with Java tons of systems and applications are written in Java. Just not that many data warehouse professionals have had to learn it to accomplish their jobs.

Hadoop Responsiveness and Complexity:

But wait, you might say, there are constructs and indeed a whole ecosystem built on top of Hadoop to let you use SQL language. Sure, but at what cost and what complexity? And what about the time lag? The fundamental architecture of Hadoop and the Hadoop Distributed File System (HDFS) is divide and conquer.  IF I have petabytes of data that I need to analyze, frequently this can be a good strategy. The user should not expect instant results. They submit their query to Hive which then translates the query to Java and submits it to the master machine which sends the query out to all of the slave machines, each to process a portion of the data, the process results are aggregated and then sent back to the master to send back to Hive to send back to the user. Hadoop was created for the problem of big data for less money, it is not trying to be responsive nor is it designed for responsiveness with small data sets.

So how is it that so many projects are trying to move to Hadoop?

I have two theories:

1. The Bandwagon effect or simply “peer pressure”:

Back when Hadoop and HDFS first justifiably started getting positive publicity. Someone senior, perhaps the CIO, reads an article extolling the virtues of Hadoop. He was curious so he asked about it with his direct reports, say at the director level. The directors either had or had not yet heard about Hadoop but not wanting to look bad, and perhaps interpreting curiosity as interest and intention, start researching and discussing with their direct reports. Before long the scuttlebutt around the water cooler was that the CIO wants to do Hadoop and if you want to look good, get promoted and get a bonus, you should be doing Hadoop. Or at least putting it in your budget and planning for it.

Then word got out, that ABC company was doing Hadoop projects and company XYZ’s CIO didn’t want to fall behind the competition, so he started talking to his direct reports and the whole thing repeated itself again.

At both companies there may have been people who questioned this wisdom, who knew or suspected that Hadoop was not the right tool, but like the story of the emperor’s cloths, we don’t want to be the only one who doesn’t see it. We don’t want to look stupid.

And there is a corollary concerning the story of the emperor’s clothes, it says that the kid who pointed out that the emperor was naked, did not win the Emperor’s royal scholarship that year, or any year thereafter.

2. Hot new tool (buzz word) effect:

I did hear one other explanation for why it is that some companies might be doing Hadoop projects even though their use case might not fit the “big data” profile… they do Hadoop to keep their good people. The theory is that if you don’t let your people work with the hot new tools and add the buzz words to their resumes, they will go somewhere else that will let them do just that. I’ve been ruminating on that one for a while. At first I accepted it as perhaps a necessary inefficiency for keeping good people. After thinking about it for a while and discussing it with a variety of friends and acquaintances in the industry, I’ve come to the conclusion that there needs to be some other way to engage your people to design, architect and implement an appropriate efficient solution with a minimal amount of waste.

An example from the field: working with a client recently I was the data architect on an outsourced project converting an end-user-created manual legacy process to an engineered Informatica and Oracle implementation that could be supported by the IT department.  About half way through the project, one of the corporate architects came by and asked, “Why aren’t you doing this in Hadoop?” The senior ETL architect and I looked at each other, then looked at him, a little dumbfounded… Um, because we are only processing 20 gigabytes of data once a month?

An Alternative to Hadoop (for many use cases):

Until recently, I did not have a good alternative for this complexity inflating, budget killing, risky tendency to try to put everything in Hadoop. Then I attended the Denver Agile Analytics meetup and the presenter that night was Kent Graziano, the senior evangelist for Snowflake Computing. His presentation was about his experience and some techniques used to do agile data warehouse development.

After his agile BI presentation he did a separate presentation on Snowflake Computing.

It rocked my world!

Snowflake is a new elastic data warehouse database engine built from the ground up for doing data warehouses on Amazon Web Services (AWS).

Kent referenced his blog post: 10 Cool Things I like About Snowflake and went though some of his top 10 things he liked about Snowflake.

In my next blog post titled “Snowflake – Will it give Hadoop a run for the money?” I will tell you why I am so excited about this product and its many useful features. In a nutshell it reduces the complexity by at least an order of magnitude and allows for the delivery of data warehouses at a whole new pace. At a recent Cloud Analytics City Tour, a Snowflake customer did a presentation and had deployed a fully functional data warehouse from scratch in 90 days.  With traditional data warehouse tools and vendors it can take more time than that just to negotiate the vendor contracts.

Posted in Process&Tools | Leave a comment

Bernie’s List of Frequently Observed DB Performance Issues

I have put this list together a couple times over the years.   This one is adapted from one I did for a client several years ago.  This particular client had a lot of .NET programmers who had been pressed into double duty as database programmers.  I have seen these types of things in many different environments.

This was written primarily from an Oracle perspective but many of the concepts hold for Microsoft SQL server and most databases I have worked with.

My plan is to edit this to include other items observed since I put this list together.

1. Unnecessary use of Distinct – Forces Oracle to sort and de-duplicate.  Expensive on large record sets.  If you know your data and have unique indexes don’t use distinct.

2. UNION instead of UNION ALL – Very similar to the distinct issue forces oracle to sort and de-duplicate if your set is unique already don’t make Oracle do the extra work.

3. LOGGING instead of NOLOGGING for Operational Data Store (ODS), Data Mart (DM), and Materialized Views (MV) that have no System of Record information. – This causes more disk io than necessary and takes up space in the archive logs, and it takes more resources to backup this redundant information.

4. Logging indexes – very similar to 3, there is no data in an index that can’t be recreated, logging takes extra i.o. and backup space. 

5. Excessive or complex use of db_links.
a. Referencing table through a db link takes network io.
b. Complex sql involving links is more difficult to tune and sometimes causes Oracle 600 errors. Where possible architect to reference a table on database you are on for intense queries.  Where possible reserve db links for small simple selects and materialized views. 

6. Failure to use direct path when viable.
a. Look at using /*+ append */ hint in big insert statements.
b. When loading external data with sqlloader use direct=true to use direct path.

7. No use or mis-use of Oracle Parallel Query.
a. If cpu is underutilized, parallel query can provide big improvement on queries that do full table scans of large tables. Or when inserting a lot of data into tables.
b. Tune query before using parallel query. Don’t add parallel to a badly tuned query.
c. Don’t overdo it. Parallelism of 2-6 is generally plenty. Test your query with parallelism of 2, 4, 6 etc.  generally you will see very little improvement in execution time between 4 and 6 or higher. Every rule has exceptions.

8. Missing indexes or over indexing. Self explanatory?

9. Bit map indexes on OLTP tables. – Oracle locking mechanism locks TOO many records with bitmaps indexes. Can cause issues on when used in OLTP environment.

10. Indexes missing on FK columns. Many times we look up records based on the value of a FK record. If so index will probably help.

11. Failure to use incremental/Fast Refresh Technology – Use of change data capture to propagate just the changes.  Can be much faster than full refreshes if data changes are small compared to full volume. More complex to set up, but frequently worth it.

12. Materializing a view instead of tuning or going to the ODS/DW for info. – For a variety of reasons some queries were difficult or impossible to tune so MVs were used as a band aid. Sometimes they were placed in a less than optimal location.

13. Misplacement of data objects. Place so dependencies are minimized. Or at least flow in one direction. 

14. If you must truncate and reload a large table with many indexes, most times it will be significantly faster to drop the indexes load the data and re-apply the indexes. Remember to use /*+ append */ hint.  

15. Failure to log metadata.  Need to capture good metadata about your jobs and any errors.  Makes troubleshooting easier, and you can’t manage what you don’t measure.

16. Failure to gather statistics or failure to correctly gather statistics.  This one is so basic it probably belongs at the top of the list.  In order for Oracle’s cost based optimizer to make smart decisions about how to access the data needed to fulfill a query, it needs to know some information about the size and distribution of the data in question.  This is particularly critical as the complexity of the query increases.  This subject could consume it’s own blog post.  And I assume there are many articles if not whole books already written on this subject so I will just touch on some high points.
Statistics should be gathered on tables and indexes “anytime there is a significant change” to the data.  In an OLTP system, the data changes are a function of end users using applications and doing business.  It might take several days, weeks or even months for the changes to a table to become “significant”.  So many DBAs will set up a weekly batch job to update statistics on any of the tables that have significant changes.  Oracle can keep track of changes so it will know.  However on reporting systems that are loaded in batch modes, best practices dictate that the ETL Architect/lead developer, should design in a mechanism to track the quantity of the changes and if necessary gather statistics immediately after updating a particular table.  Some common mistakes I’ve seen over the years.  a) No stats gathered at all.  b) Stats gathered at inappropriate and intermittently catastrophic times. DBA sets up a weekly script that works fine most of the time, except one week an off cycle batch job truncates tables right before the stats job runs, then loads the table right after the stats job gathers stats on an empty table.  Monday morning nothing works because Oracle thinks the tables are empty.  c) Over collecting.  Either too often, or gathering stats on too much of the data.  Use auto sample size most of the time gets the right amount of data sampled.

Posted in Oracle, Process&Tools | Leave a comment

Basic Configuration Management Tools for (almost) free

Starting in September of 2006 and for about one year I took a hiatus from my career as a Data Warehouse focused Data Architect, Data Modeler and ETL Architect.

As I prepared to re-enter the market as an BI consultant I started reading articles about all things IT especially database and BI related.  I was reading an article about the lack of version control systems in many IT shops.  The author was saying that to his amazement so many companies still go without a version control system of any kind, and that seemed crazy given that Subversion aka SVN and other great tools are open source and hence “free”.

Sure enough when I got to my next client, a $4 Billion a year manufacturing company, the global BI group was not using any source code control for their database code or their data models. My first role with this team was as Data Architect/Data Modeler responsible for generating reporting data models and the DDL required to create the tables, views, partitions, indexes ect. for the Datastage ETL developers to load and the Business Objects developers to build reports from.

Finding no Version Control System (VCS) and a very manual process for configuration management and recalling the article mentioned above I downloaded my first copy of TortoiseSVN (Windows integrated SVN client) and started reading the online manual.  Initially I used a file based repository on my laptop”s hard drive, then as I learned a little more I put the file base repo on a network shared drive that was being backed up.

The other issue we had was change management (new requirements, issues, defects) was being handled in email.  When the database needed to change due to a request from an analyst or ETL developer or from a refactoring after a design review, there was very little documentation of who, what and why.  I was first asked to generate a “paper” change request system, a form to be filled out whenever someone wanted a change.  Trying to get all the right things on the form and then route and process the form via email was just making things worse.

I recalled that while at another very large financial client, we had managed a truly amazing amount of change with another tool/application called Mantis.  Mantis is an open source tool (free) that was designed as a bug tracker. It allowed us to create issues, defects, requests etc. that were managed in a central database. Each item could be assigned to specific team members and then re-assigned to flow the work item through a process, allowing us to track who requested what and when, and then who changed it, approved it, deployed it, loaded it, and developed the report for it. Next using the Mantis Item number in the Subversion commit message allowed us to link code changes back to request and defects and we had the basis of a poor mans Application Lifecycle Management system.

We conducted a proof of concept with Mantis and Subversion running on someone”s desktop.  Eventually we made the case to management that it should be installed on a server that would get backed up and supported. At first we requested to share resources with existing servers but none of the admins knew enough about SVN or Mantis that they wanted it on their servers. No thanks, too risky, not on my box.

Part of the problem of trying to find a host for our SVN and Mantis was that each required several parts, a LAMP stack (Linux, Apache Web Server, MYSQL database and PHP), which prior to virtual appliances had to be installed, configured and maintained separately for each application. Enter Virtual Appliances and Jumpbox.com.

Then we discovered that we had a nascent but growing VIrtual Machine infrastructure and “virtual appliances“.  According to Wikipedia:

“A virtual appliance is a pre-configured virtual machine image, ready to run on a hypervisor; virtual appliances are a subset of the broader class of software appliances. Installation of a software appliance on a virtual machine and packaging that into an image creates a virtual appliance. Like software appliances, virtual appliances are intended to eliminate the installation, configuration and maintenance costs associated with running complex stacks of software.”

Jumpbox.com offered virtual appliances for both Mantis, and SVN, each in its own appliance, preconfigured, plug and play for less than a the price of taking the team out for a sushi lunch.

Additionally Jumpbox pricing was for an unlimited number of instances, so we were able to create dev, test, and prod instances of our appliances so we could follow our process with the tools as well as for the applications we were supporting.

Posted in Process&Tools | Leave a comment

Very slow Heartbeat

It”s been another year and as I look back my major excuse for not posting is that I don”t feel like I know how to make pretty posts in a timely manner.  My story: I don”t have enough experience with WordPress.  I”m giving that up and will be writing and publishing posts and will get better as I go.

It is now July 2014 and I am consulting with Datasource Consulting at a health care client.  We are building a data mart to analyse external encounter claims.

I”ve been able to get to point where we are doing Continuous Integration for the database side of things.  Something I have been building towards for the last 4 or 5 years.

 

Posted in Status | Leave a comment

Heartbeat

I have not updated this blog in quite a while so I thought I would post a quick status.  I am back Denver consulting at Comcast, via Apex Systems and working with a amazing group of people. The current area of interest that excites me the most is Application Life-cycle Management (ALM) and more specifically the whole Continuous Integration -> Continuous Deployment/Delivery process.  For the last several years I”ve been focused on being able to quickly script an end to end database build and while I continue to perfect those techniques I now find myself developing skills needed to glue it all together. To that end I”m developing my expertise with tools needed to make that work productively with teams.

  • Rally – Agile Project Management tool
  • Jenkins – Open source continuous integration server
  • Maven – Apache Build Manager for Java projects
  • Scripted Database Deployments – I created an open source githup project for this.
  • Git – Fast growing distributed version control tool
  • GitHub – Online project hosting
  • bash shell scripting (side effect of learning git, but very useful for lots of things)
  • Markdown (MD) formatting – Plain text to html
  • Wiki formatting – Confluence, WordPress, and github. Why can”t there be just one.

More to come.

Posted in Status | Leave a comment

Who I Am (Careerwise)

I was recently asked to write a few sentences about myself for submission with my CV to a potential client it seems appropriate to repost that info here:

I love technology and I love working with people.  My entire career has been driven by the desire to have technology make peoples lives better, technology should do the drudge work so people can do the thinking.  Prior to getting my Bachelors of Science in Electrical Engineering and Computer Science the Navy taught me to operate a nuclear reactor.  One of the most important aspects about operating a nuclear reactor is being able to troubleshoot and repair the system in an emergency.  To do that effectively you have to have instrumentation, that is measurements/data, and you have to understand what all those measurements mean.  This early experience has guided my designs throughout my career.  On any engagement I do what the Navy taught me to do, understand the system end to end.  Then when given an opportunity to design or alter an existing design I make sure that the system captures the data necessary for us to monitor and troubleshoot it.  Where ever feasible I simplify (See the principles behind the Agile Manifesto) but from the end to end perspective.  Many times a well meaning “simplification” in one area or subsystem kills the total effectiveness of the system.  Without a macroscopic view competing and conflicting good deeds can drag an initiative to its knees.

 The other major factor in any endeavor that takes more than one or two people is the interaction and cooperation of people in teams.  Since 1994 I have spent a significant amount of resources developing my capacity to interact powerfully with people, one on one and as a presenter at the front of the room.  I continue to develop my skills by attending seminars as well as reading and being inspired by some of the best leaders in the field. At the moment I’m very excited about Tribal Leadership – Leveraging Natural Groups to Build a Thriving Organization by Dave Logan, John King & Halee Fischer-Wright.

Posted in Status | Leave a comment