如何提高spark批量读取HBase数据的性能

Configuration conf = HBaseConfiguration.create();
String tableName = "testTable";
Scan scan = new Scan();
scan.setCaching(10000);
scan.setCacheBlocks(false);
conf.set(TableInputFormat.INPUT_TABLE, tableName);
ClientProtos.Scan proto = ProtobufUtil.toScan(scan);
String ScanToString = Base64.encodeBytes(proto.toByteArray());
conf.set(TableInputFormat.SCAN, ScanToString);
JavaPairRDD<ImmutableBytesWritable, Result> myRDD = sc
.newAPIHadoopRDD(conf, TableInputFormat.class,
ImmutableBytesWritable.class, Result.class);
在Spark使用如上Hadoop提供的标准接口读取HBase表数据（全表读），读取5亿左右数据，要20M+，而同样的数据保存在Hive中，读取却只需要1M以内，性能差别非常大。
现在项目已基本选型要使用HBase作为大数据的存储，而Spark读取HBase数据的性能却如此慢，已经偿试了直接读取HFile，但只读取解析一片HFile文件的性能也很慢（400M数据大约需90s)，还有没有其它的解决办法？难道Spark就不能以Hbase作为基础存储了吗？

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

最新的hbase已经提供了spark接口
看看这个HBase Doc