Create Hadoop HBase + Python HappyBase wrapper client application

2014-03-24

3570
0
2014-03-26

摘要:Create Hadoop HBase + Python HappyBase wrapper client application

網路上對於Hadoop HBase(NoSQL)與Python HappyBase(Python HBase wrapper)的資料相較於傳統的Relational-DB來說非常少，以Python使用HBase的繁體中文資源更是幾乎沒有。

即使是官方網站也沒有講清楚、說明白，他們假設了開發者可能已經知道很多東西了，因此整理了一份從無到有讓Python能使用Hadoop HBase架構的資料。

HBase可以說是儲存「巨量」資料的最佳選擇，雖然我覺得MongoDB比較好用、第三方支援又較多。

HappyBase是HBase+Thrift的Python wrapper的一個3rd-party的元件，簡化串接複雜的Thrift與HBase架構。

Thrift creator, users and brief introduction: http://xahxy.blog.hexun.com.tw/83023794_d.html

You can see Facebook, Twitter, Evernote, LINE, ... are all using it.

NoSQL Database: http://nosql-database.org/

HBase experience sharing from NHN LINE:

You need to know the underlying architecture of HBase so that you can design your own service in best practice.

Slides: http://2013.nosql-matters.org/cgn/wp-content/uploads/2013/05/HBase-Schema-Design-NoSQL-Matters-April-2013.pdf
Video by Trend Micro: http://www.youtube.com/watch?v=8DMzNmVrXEI

[Hadoop HBase] See reference: https://hbase.apache.org/book/quickstart.html)

HBase runs in standalone mode by default.

HBase standalone does not require HDFS.

This is a standalone example. If you want to create distributed, please check the docs: https://hbase.apache.org/book/standalone_dist.html

useradd -m hadoop (I suggest you create another hadoop user and run hadoop hbase on that account)

passwd hadoop

su - hadoop

cd ~

wget [The Stable version of HBase (http://apache.cdpa.nsysu.edu.tw/hbase/)]

tar xvf hbase-xxx.tar.gz

ln -s hbase-0.9x.xx hbase (make a symbolic link to type less word. XD)

cd hbase

If Java is already installed, use "which java" to check the installation location.

=> /usr/bin/java

It might be a symbolic link, so check the real installation location because it might not the same for different computers:

cd /usr/bin

ls -al | grep java

=> Output: java -> /etc/alternatives/java

cd /etc/alternatives

ls -al | grep java

=> Output: java -> /usr/lib/jvm/jdkxxxxxxxx/bin/

vi conf/hbase-env.sh :

uncomment and set your Java home "exported JAVA_HOME=/usr/lib/jvm/jdkxxxxxxxx"

vi conf/hbase-site.xml (Change to the place where you want the DB to store, for example):

<configuration>

<property>

<name>hbase.rootdir</name>

<value>file:///home/hadoop/HBASE_DATASTORE/hbase</value>

</property>

<property>

<name>hbase.zookeeper.property.dataDir</name>

<value>/home/hadoop/HBASE_DATASTORE/zookeeper</value>

</property>

</configuration>

Start HBase (in standalone mode no need to start hadoop first):

./bin/start-hbase.sh (success when showing "starting master, ..." with no error message)

Use HBase shell to add some example data:

./bin/hbase shell

create 'test', 'cf' (successful when you see the message: "0 row(s) in 1.6040 seconds")

create 'test2', 'cf'

list

exit

Stop HBase (in standalone mode no need to start hadoop):

./bin/stop-hbase.sh

[HBase Web Interface (Close some related ports if you do not want to see attackers)]

master=>http://localhost:60010

region server=>http://localhost:60030

--------------------------------------------------------------------------------------------------------

[Python+HappyBase]

Reference: http://happybase.readthedocs.org/en/latest/installation.html

sudo apt-get install python-virtualenv

sudo virtualenv hadoop_hbase

source hadoop_hbase/bin/activate

=> Output: (hadoop_hbase)

"virtualenv" is to create virtual working environment for Python, some references:

pip install happybase

python -c 'import happybase'

=> Successfully output: nothing (Fail if you see the "ImportError: No module named happybase" error message)

Merely start the HBase is not enough, you need to start Thrift first to let Happybase transmit data by Thrift:

./bin/hbase thrift start & (&: put in the background, default port: 9090, default: threadpool on)

(Notice: in my VMWare the Thrift server crashes because OutOfMemoryException. But in my dedicated server it works normally. So you should have enough memory for it or modify the HBase or Thrift configuration.)

Python test script:

import happybase

connection = happybase.Connection('127.0.0.1')

# The Connection class is the main entry point for application developers. It connects to the HBase Thrift server and provides methods for table management.

connection.open()

print connection.tables()

connection.close()

------------------------------------------------

python hbase_test.py

=>['test', 'test2']

You can see some ports open to the public now:

ZooKeeper: 2181
HBase Master Web interface: 60010
Thrift web interface: 9095
Thrift communication: 9090
HBase region server: 60030

But usually the db servers are in internal network.

You should really take care of it if you're building a standalone server with a standalone HBase DB.

--------------------------------------------------------------------------------------------------------

[HBase Management Tool Recommendation 1: HBase Manager]

http://sourceforge.net/projects/hbasemanagergui/

java -jar HBase-Manager-xxx.jar

hbase.zookeeper.quorum: 127.0.0.1

hbase.zookeeper.property.clientPort: 2181

hbase.master: 127.0.0.1

--------------------------------------------------------------------------------------------------------

[HBase Management Tool Recommendation 2: HBaseExplorer]

http://sourceforge.net/projects/hbaseexplorer/

I like Microsoft because of the great and easy-to-use tool they provided. But as for HBase I would suggest HBaseExplorer.

It's a "WAR" Java Application deployment so you need to create a Java Web server first.

Take Tomcat7 for example:

sudo apt-get install tomcat7 (若無法安裝則需要先行apt-get remove tomcat6-common)
sudo apt-get install tomcat7-admin
sudo apt-get install tomcat7-common

/usr/share/tomcat7：CATALINA_HOME目錄，在bin下可設定Tomcat要跑的JVM，使用的記憶體等

/var/lib/tomcat7：CATALINA_BASE目錄，內含log、conf、webapps，為主要目錄

/var/log/tomcat7/：log目錄（含access url、error等）

/var/log/tomcat7/catalina.out：主要log檔案

WEB UI Manager Installation Directory：/usr/share/tomcat7-admin/manager

Because the HBaseExplorer war exceeds the 50M limits, go to: "/usr/share/tomcat7-admin/manager/WEB-INF/web.xml"

Modify <multipart-config> tag to extend to 100MB max

"sudo service tomcat7 restart

Also, set the web manager user to get into the web UI on: http://localhost:8080/manager

Deploy the "HBaseExplorer" war now!

Go to: http://localhost:8080/HBaseExplorer/

Add a new instance with "Quorum Client Port=2181"

Quorum Servers: 127.0.0.1

HBase Master URL: http://localhost:60010/

Have fun!!

回首頁