Spark unable to see tables/dbs in the metastore when using the Cloudera (CDH) Quickstart Docker image


 If you've used the Cloudera CDH Quickstart docker container you might have found that if you create some databases and tables using Hive you may not be able to see them when using spark. The  reason is because the spark configuration directory doesn't have the hive-site.xml but there is a workaround that can be used without having to restart spark.


This is explained in the link https://www.tutorialspoint.com/spark_sql/spark_sql_hive_tables.htm

 "Hive comes bundled with the Spark library as HiveContext, which inherits from SQLContext. Using HiveContext, you can create and find tables in the HiveMetaStore and write queries on it       using HiveQL. Users who do not have an existing Hive deployment can still create a HiveContext. When not configured by the hive-site.xml, the context automatically creates a metastore called  metastore_db and a folder called warehouse in the current directory."

If you have started spark already you will need to delete the Sqlite based metastore that is created otherwise it will ignore our workaround. Which is what appears to be happening after running pyspark or spark-shell, perhaps because you have to use the --files parameter the first time around or it never uses the metastore configured in the hive-site.xml

The workaround involves passing the hive-site.xml via the --files parameter:
pyspark --master yarn --files /etc/hive/conf/hive-site.xml

 You should then be able to see the databases/tables.


 >>> for db in sqlContext.sql("SHOW DATABASES").collect():
  ...     print db

 16/12/21 13:55:43 INFO scheduler.DAGScheduler: Job 0 finished: collect at <stdin>:1, took 1.424182 s
 Row(result=u'default')
 Row(result=u'trainingdata')

This has post has been cross posted to the GI Architects Blog.

Comments