hadoop - How to bulk load data to hbase in python -
i wrote mr job in python running streaming jar package. want know how use bulk load put data hbase.
i konw there 2 ways data hbase bulk loading.
- generate hfiles in mr job, , use completebulkload load data hbase.
- use importtsv option , use completebulkload load data.
i don't know how use python generate hfile fits in hbase. , try use importtsv utility. got failure. followed instructions in [example](http://hbase.apache.org/book.html#importtsv).but got exception:
exception in thread "main" java.lang.noclassdeffounderror: org/apache/hadoop/hbase/filter/filter...
now want ask 3 questions:
- whether python used generate hfile streaming jar or not.
- how use importtsv.
- could bulkload used update table in hbase. big file bigger 10gb every day. bulkload used push file hbase.
the hadoop version is: hadoop 2.8.0
the hbase version is: hbase 1.2.6
both running in standalone mode.
thanks answer.
--- update ---
importtsv works correctly.
but stil want know how generate hfile in mr job streaming jar in python language.
you try happybase.
table = connection.table("mytable") table.batch(batch_size=1000) b: in range(1200): b.put(b'row-%04d'.format(i), { b'cf1:col1': b'v1', b'cf1:col2': b'v2', })
as may have imagined already, batch keeps mutations in memory until batch sent, either calling batch.send() explicitly, or when block ends. doesn’t work applications need store huge amounts of data, since may result in batches big send in 1 round-trip, or in batches use memory. these cases, batch_size argument can specified. batch_size acts threshold: batch instance automatically sends pending mutations when there more batch_size pending operations.
this need thrift server stand before hbase. suggestion.
Comments
Post a Comment