Integration testing Hive jobs

JavaTestingHadoopMapreduceHive

Java Problem Overview


I'm trying to write a non-trivial Hive job using the Hive Thrift and JDBC interfaces, and I'm having trouble setting up a decent JUnit test. By non-trivial, I mean that the job results in at least one MapReduce stage, as opposed to only dealing with the metastore.

The test should fire up a Hive server, load some data into a table, run some non-trivial query on that table, and check the results.

I've wired up a Spring context according to the [Spring reference][1]. However, the job fails on the MapReduce phase, complaining that no Hadoop binary exists:

> java.io.IOException: Cannot run program "/usr/bin/hadoop" (in > directory "/Users/yoni/opower/workspace/intellij_project_root"): > error=2, No such file or directory

The problem is that the Hive Server is running in-memory, but relies upon local installation of Hive in order to run. For my project to be self-contained, I need the Hive services to be embedded, including the HDFS and MapReduce clusters. I've tried starting up a Hive server using the same Spring method and pointing it at [MiniDFSCluster][2] and [MiniMRCluster][3], similar to the pattern used in the Hive [QTestUtil][4] source and in [HBaseTestUtility][5]. However, I've not been able to get that to work.

After three days of trying to wrangle Hive integration testing, I thought I'd ask the community:

  1. How do you recommend I integration test Hive jobs?
  2. Do you have a working JUnit example for integration testing Hive jobs using in-memory HDFS, MR, and Hive instances?

Additional resources I've looked at:

Edit: I am fully aware that working against a Hadoop cluster - whether local or remote - makes it possible to run integration tests against a full-stack Hive instance. The problem, as stated, is that this is not a viable solution for effectively testing Hive workflows.

Java Solutions


Solution 1 - Java

Ideally one would be able to test hive queries with LocalJobRunner rather than resorting to mini-cluster testing. However, due to HIVE-3816 running hive with mapred.job.tracker=local results in a call to the hive CLI executable installed on the system (as described in your question).

Until HIVE-3816 is resolved, mini-cluster testing is the only option. Below is a minimal mini-cluster setup for hive tests that I have tested against CDH 4.4.

Configuration conf = new Configuration();

/* Build MiniDFSCluster */
MiniDFSCluster miniDFS = new MiniDFSCluster.Builder(conf).build();

/* Build MiniMR Cluster */
System.setProperty("hadoop.log.dir", "/path/to/hadoop/log/dir"); // MAPREDUCE-2785
int numTaskTrackers = 1;
int numTaskTrackerDirectories = 1;
String[] racks = null;
String[] hosts = null;
miniMR = new MiniMRCluster(numTaskTrackers, miniDFS.getFileSystem().getUri().toString(),
                           numTaskTrackerDirectories, racks, hosts, new JobConf(conf));

/* Set JobTracker URI */
System.setProperty("mapred.job.tracker", miniMR.createJobConf(new JobConf(conf)).get("mapred.job.tracker"));

There is no need to run a separate hiveserver or hiveserver2 process for testing. You can test with an embedded hiveserver2 process by setting your jdbc connection URL to jdbc:hive2:///

Solution 2 - Java

I come to find one pretty good tool: HiveRunner. It is framework on top of jUnit to test hive scripts. Under the hood it starts a stand alone HiveServer with in memory HSQL as the metastore.

Solution 3 - Java

I have implemented HiveRunner.

https://github.com/klarna/HiveRunner

We tested it on Mac and had some trouble with Windows, however with a few changes listed below the util served well.

For windows here are some of the changes that were done in order to have HiveRunner work in windows environment. After these changes unit testing is possible for all Hive queries.

1.Clone the project at https://github.com/steveloughran/winutils to anywhere on your computer, Add a new environment variable, HADOOP_HOME, pointing to the /bin directory of that folder. no forward slashes or spaces allowed. 2.Clone the project at https://github.com/sakserv/hadoop-mini-clusters to anywhere on your computer. Add a new environment variable HADOOP_WINDOWS_LIBS, pointing to the /lib directory of that folder. Again, no forward slashes or spaces allowed. 3.I also installed cygwin, assuming severla win utils for linux might be available through.

This pull on gitbub helped with making it work on windows, https://github.com/klarna/HiveRunner/pull/63

Solution 4 - Java

Hive supports embedded mode only in the sense that the RDBMS which stores the meta information for the Hive tables can run locally or on a stand alone server (see https://cwiki.apache.org/confluence/display/Hive/HiveClient for details). Furthermore, hive with it's accompanying database is merely an orchestrator for a string of MapReduce jobs, which requires the Hadoop framework to be running as well.

I recommend using this virtual machine that has a pre-configured Hadoop stack http://hortonworks.com/products/hortonworks-sandbox/ . Hortonworks is one of 2 leading Hadoop distribution providers, so it is well-supported.

Solution 5 - Java

I'm uncertain of what has changed since the accepted answer in Feb. 2014, but as of Hive 1.2.0, the following works around the issue described by OP:

System.setProperty(HiveConf.ConfVars.SUBMITLOCALTASKVIACHILD.varname, "false");

Do be aware of the warning given in the config documentation:

> Determines whether local tasks (typically mapjoin hashtable generation > phase) runs in separate JVM (true recommended) or not. Avoids the > overhead of spawning new JVM, but can lead to out-of-memory issues.

This works around the issue because in MapredLocalTask.java:

  @Override
  public int execute(DriverContext driverContext) {
    if (conf.getBoolVar(HiveConf.ConfVars.SUBMITLOCALTASKVIACHILD)) {
      // send task off to another jvm
      return executeInChildVM(driverContext);
    } else {
      // execute in process
      return executeInProcess(driverContext);
    }
  }

The default config value causes the executeInChildVM() method to be called, which literally calls hadoop jar. The other code path has so far worked out in my testing. Potential memory issues can likely be resolved by tweaking Java heap configs (Xmx, Xms, etc).

Solution 6 - Java

Another Hive JUnit runner is at https://github.com/edwardcapriolo/hive_test

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionyoniView Question on Stackoverflow
Solution 1 - Javaoby1View Answer on Stackoverflow
Solution 2 - JavaLuís BianchinView Answer on Stackoverflow
Solution 3 - JavaPrachi SharmaView Answer on Stackoverflow
Solution 4 - JavaDmitriusanView Answer on Stackoverflow
Solution 5 - JavaAndreyView Answer on Stackoverflow
Solution 6 - JavagliptakView Answer on Stackoverflow