摘要:Create Hadoop HBase + Python HappyBase wrapper client application
網路上對於Hadoop HBase(NoSQL)與Python HappyBase(Python HBase wrapper)的資料相較於傳統的Relational-DB來說非常少,以Python使用HBase的繁體中文資源更是幾乎沒有。
即使是官方網站也沒有講清楚、說明白,他們假設了開發者可能已經知道很多東西了,因此整理了一份從無到有讓Python能使用Hadoop HBase架構的資料。
HBase可以說是儲存「巨量」資料的最佳選擇,雖然我覺得MongoDB比較好用、第三方支援又較多。
HappyBase是HBase+Thrift的Python wrapper的一個3rd-party的元件,簡化串接複雜的Thrift與HBase架構。
Thrift creator, users and brief introduction: http://xahxy.blog.hexun.com.tw/83023794_d.html
You can see Facebook, Twitter, Evernote, LINE, ... are all using it.
NoSQL Database: http://nosql-database.org/
HBase experience sharing from NHN LINE:
You need to know the underlying architecture of HBase so that you can design your own service in best practice.
[Hadoop HBase] See reference: https://hbase.apache.org/book/quickstart.html)
HBase runs in standalone mode by default.
HBase standalone does not require HDFS.
This is a standalone example. If you want to create distributed, please check the docs: https://hbase.apache.org/book/standalone_dist.html
useradd -m hadoop (I suggest you create another hadoop user and run hadoop hbase on that account)
passwd hadoop
su - hadoop
cd ~
wget [The Stable version of HBase (http://apache.cdpa.nsysu.edu.tw/hbase/)]
tar xvf hbase-xxx.tar.gz
ln -s hbase-0.9x.xx hbase (make a symbolic link to type less word. XD)
cd hbase
If Java is already installed, use "which java" to check the installation location.
=> /usr/bin/java
It might be a symbolic link, so check the real installation location because it might not the same for different computers:
cd /usr/bin
ls -al | grep java
=> Output: java -> /etc/alternatives/java
cd /etc/alternatives
ls -al | grep java
=> Output: java -> /usr/lib/jvm/jdkxxxxxxxx/bin/
vi conf/hbase-env.sh :
uncomment and set your Java home "exported JAVA_HOME=/usr/lib/jvm/jdkxxxxxxxx"
vi conf/hbase-site.xml (Change to the place where you want the DB to store, for example):
<configuration>
<property>
<name>hbase.rootdir</name><value>file:///home/hadoop/HBASE_DATASTORE/hbase</value>
</property><property>
<name>hbase.zookeeper.property.dataDir</name><value>/home/hadoop/HBASE_DATASTORE/zookeeper</value>
</property>
</configuration>
Start HBase (in standalone mode no need to start hadoop first):
./bin/start-hbase.sh (success when showing "starting master, ..." with no error message)
Use HBase shell to add some example data:
./bin/hbase shell
create 'test', 'cf' (successful when you see the message: "0 row(s) in 1.6040 seconds")create 'test2', 'cf'listexit
Stop HBase (in standalone mode no need to start hadoop):
./bin/stop-hbase.sh
[HBase Web Interface (Close some related ports if you do not want to see attackers)]
master=>http://localhost:60010
region server=>http://localhost:60030
--------------------------------------------------------------------------------------------------------
[Python+HappyBase]
sudo apt-get install python-virtualenv
sudo virtualenv hadoop_hbase
source hadoop_hbase/bin/activate
=> Output: (hadoop_hbase)
"virtualenv" is to create virtual working environment for Python, some references:
pip install happybase
python -c 'import happybase'
=> Successfully output: nothing (Fail if you see the "ImportError: No module named happybase" error message)
Merely start the HBase is not enough, you need to start Thrift first to let Happybase transmit data by Thrift:
./bin/hbase thrift start & (&: put in the background, default port: 9090, default: threadpool on)
(Notice: in my VMWare the Thrift server crashes because OutOfMemoryException. But in my dedicated server it works normally. So you should have enough memory for it or modify the HBase or Thrift configuration.)
Python test script:
import happybaseconnection = happybase.Connection('127.0.0.1')# The Connection class is the main entry point for application developers. It connects to the HBase Thrift server and provides methods for table management.connection.open()print connection.tables()connection.close()------------------------------------------------python hbase_test.py=>['test', 'test2']
You can see some ports open to the public now:
- ZooKeeper: 2181
- HBase Master Web interface: 60010
- Thrift web interface: 9095
- Thrift communication: 9090
- HBase region server: 60030
But usually the db servers are in internal network.
You should really take care of it if you're building a standalone server with a standalone HBase DB.
--------------------------------------------------------------------------------------------------------
[HBase Management Tool Recommendation 1: HBase Manager]
java -jar HBase-Manager-xxx.jar
hbase.zookeeper.quorum: 127.0.0.1
hbase.zookeeper.property.clientPort: 2181
hbase.master: 127.0.0.1
--------------------------------------------------------------------------------------------------------
[HBase Management Tool Recommendation 2: HBaseExplorer]
I like Microsoft because of the great and easy-to-use tool they provided. But as for HBase I would suggest HBaseExplorer.
It's a "WAR" Java Application deployment so you need to create a Java Web server first.
Take Tomcat7 for example:
- sudo apt-get install tomcat7 (若無法安裝則需要先行apt-get remove tomcat6-common)
- sudo apt-get install tomcat7-admin
- sudo apt-get install tomcat7-common
/usr/share/tomcat7:CATALINA_HOME目錄,在bin下可設定Tomcat要跑的JVM,使用的記憶體等
/var/lib/tomcat7:CATALINA_BASE目錄,內含log、conf、webapps,為主要目錄
/var/log/tomcat7/:log目錄(含access url、error等)
/var/log/tomcat7/catalina.out:主要log檔案
WEB UI Manager Installation Directory:/usr/share/tomcat7-admin/manager
Because the HBaseExplorer war exceeds the 50M limits, go to: "/usr/share/tomcat7-admin/manager/WEB-INF/web.xml"
Modify <multipart-config> tag to extend to 100MB max
"sudo service tomcat7 restart
Also, set the web manager user to get into the web UI on: http://localhost:8080/manager
Deploy the "HBaseExplorer" war now!
Add a new instance with "Quorum Client Port=2181"
Quorum Servers: 127.0.0.1
HBase Master URL: http://localhost:60010/
Have fun!!