Hosted Private Cloud


0. 準備 Gluster storage node

準備兩台 PC ,作為 Gluster storage node,稱 s1 以及 s2
每台 PC 各有兩張實體網卡,兩顆硬碟(sda安裝作業系統 / sdb 預留給 Gluster)
安裝 Scientific Linux 6.3 ( /home/build/iso/Base/CAKE-2014-04-11.iso )

每台 PC 將其中一張網卡 eht0 ,切成三張虛擬網卡

# s1
vconfig add eth0 101; vconfig add eth0 1501; vconfig add eth0 1502
ifconfig eth0.101 netmask
ifconfig eth0.1501 netmask
ifconfig eth0.1502 netmask

# s2
vconfig add eth0 101; vconfig add eth0 1501; vconfig add eth0 1502
ifconfig eth0.101 netmask
ifconfig eth0.1501 netmask
ifconfig eth0.1502 netmask

兩台 PC 安裝 Gluster
wget -P /etc/yum.repos.d/
yum install gdisk glusterfs{,-server,-fuse,-geo-replication} -y

其餘的 client 則必須安裝 Gluster-client
wget -P /etc/yum.repos.d/
yum install glusterfs{,-fuse} -y


1. 建立 Gluster volume (multi-tenancy)

準備 Gluster brick

mkfs.xfs -i size=512 -n size=8192 /dev/sdb
mkdir /mnt/gfbrick
mount /dev/sdb /mnt/gfbrick
echo "/dev/sdb /mnt/gfbrick xfs defaults 1 2" >> /etc/fstab

mkdir /mnt/gfbrick/RootCAKE
mkdir /mnt/gfbrick/Tenant1
mkdir /mnt/gfbrick/Tenant2

假設 sdb 已經 format

更改 SELinux 模式

# To put SELinux in permissive mode
setenforce 0
# to see the current mode of SELinux
# Change “/etc/selinux/configand makeSELINUX=disabledorSELINUX=permissivein it

新增防火牆規則 (both side)
# vim /usr/local/virus/iptables/iptables.allow
# 根據開啟的 glusterd 數量
iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 24007:24008 -j ACCEPT
# 根據使用的 volume 數量
iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 49152:49156 -j ACCEPT

如果不會弄就直接 `service iptables stop` 吧

加入 Gluster trusted pool
echo " s1" >> /etc/hosts
echo " s2" >> /etc/hosts
# s1
gluster peer probe s2
# s2
gluster peer probe s1

其實只要其中一台,去 probe 另外一台就完成了 (這邊是為了讓兩邊都有 hostname)
When using hostnames, the first server needs to be probed from one other server to set its hostname

建立 Gluster volume

gluster volume create RootCAKE replica 2 s1:/mnt/gfbrick/RootCAKE s2:/mnt/gfbrick/RootCAKE
gluster volume create Tenant1 replica 2 s1:/mnt/gfbrick/Tenant1 s2:/mnt/gfbrick/Tenant1
gluster volume create Tenant2 replica 2 s1:/mnt/gfbrick/Tenant2 s2:/mnt/gfbrick/Tenant2

啟用 Gluster volume
gluster volume start RootCAKE
gluster volume start Tenant1
gluster volume start Tenant2

gluster volume set RootCAKE auth.allow 172.31.0.*
gluster volume set Tenant1 auth.allow 172.31.1.*
gluster volume set Tenant2 auth.allow 172.31.2.*

啟用 quota
gluster volume quota RootCAKE enable
gluster volume quota Tenant1 enable
gluster volume quota Tenant2 enable

設定 quota 大小
gluster volume quota RootCAKE limit-usage / 500GB
gluster volume quota Tenant1 limit-usage / 500GB
gluster volume quota Tenant2 limit-usage / 500GB

修正 client 端,看到的錯誤 size (應等同 quota 大小)
gluster volume set RootCAKE quota-deem-statfs on
gluster volume set Tenant1 quota-deem-statfs on
gluster volume set Tenant2 quota-deem-statfs on


2. client 掛載使用

我們必須加入不同的 /etc/hosts
for RootCAKE

echo " s1" >> /etc/hosts
echo " s2" >> /etc/hosts

for Tenant1
echo " s1" >> /etc/hosts
echo " s2" >> /etc/hosts

for Tenant2
echo " s1" >> /etc/hosts
echo " s2" >> /etc/hosts



掛載 Gluster volume

mount -t glusterfs s1:/RootCAKE /mnt/storage
mount -t glusterfs s1:/Tenant1 /mnt/storage
mount -t glusterfs s1:/Tenant2 /mnt/storage

由於有設定存取條件,所以順利的達成 multi-tenancy


由於 CAKE starter 目前會不定期把 /etc/hosts flush 掉
所以建議`chattr +i /etc/hosts`把他 lock 住


3. 測試 HA

 1. kill one brick pid
 2. 直接關機 PC
 3. 拔 switch 上的網路線


gluster volume status
kill -9 xxx

查看 heal 進度

gluster volume heal <VOL> info

Note: In case of larger files it may take a while for the self-heal operation to be successfully done.
   You can check the heal status using the following command.




附錄一 - Gluster Dashboard

可參考 oVirt Gluster Dashboard
  • volume created
  • volume deleted
  • volume started/stopped
  • brick(s) added
  • brick(s) removed
  • brick(s) replaced
  • new option set
  • value of existing option changed
  • option reset
  • Server removed (peer detach)
  • brick process went down / came up


附錄二 -

預設為神秘的數字 42

直接下 command 修改(5秒)

gluster volume set <VOL> 5

經測試, 結果無效


vim /etc/glusterfs/glusterd.vol
volume management
    type mgmt/glusterd
    option working-directory /var/lib/glusterd
    option transport-type socket,rdma
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2
    option off
    option ping-timeout 30
#  option base-port 49152

volume RootCAKE
    type protocol/client
    option ping-timeout 5

改完必須重啟 glusterd


附錄三 - preventing split-brain


To prevent split-brain in the trusted storage pool, you must configure server-side and client-side quorum

The quorum configuration in a trusted storage pool determines the number of server failures that the trusted storage pool can sustain.

If too many server failures occur, or if there is a problem with communication between the trusted storage pool nodes, it is essential that the trusted storage pool be taken offline to prevent data loss.

Configuring Server-Side Quorum

# configure the quorum ratio for a trusted storage pool
gluster volume set all cluster.server-quorum-ratio <PERCENTAGE>
gluster volume set all cluster.server-quorum-ratio 51%

In this example, the quorum ratio setting of 51% means that more than half of the nodes in the trusted storage pool must be online and have network connectivity between them at any given time.

If a network disconnect happens to the storage pool, then the bricks running on those nodes are stopped to prevent further writes.

For a replicated volume with two nodes and one brick on each machine, if the server-side quorum is enabled and one of the nodes goes offline, the other node will also be taken offline because of the quorum configuration. As a result, the high availability provided by the replication is ineffective.

Configuring Client-Side Quorum

Replication in Red Hat Storage Server allows modifications as long as at least one of the bricks in a replica group is online.

In a network-partition scenario, different clients connect to different bricks in the replicated environment.

In this situation different clients may modify the same file on different bricks.

When a client is witnessing brick disconnections, a file could be modified on different bricks at different times while the other brick is off-line in the replica.

For example, in a 1 X 2 replicate volume, while modifying the same file, it can so happen that client C1 can connect only to brick B1 and client C2 can connect only to brick B2.

These situations lead to split-brain and the file becomes unusable and manual intervention is required to fix this issue.
In the above scenario, when the client-side quorum is not met for replica group A, only replica group A becomes read-only.
Replica groups B and C continue to allow data modifications.
gluster volume set VOLNAME quorum-type auto


附錄四 - split-brain recovery made easy


git clone splitmount 
cd splitmount
python install

splitmount <GF-server-1> <VOL> /tmp/sbfix

你可以去看 r1 或 r2,選出一個你不想要的 → 刪除
rm /tmp/sbfix/r1/ImpPool/6603b030-a225-4ac2-a8eb-dc26ec0a9bbf/1a2be703-f9c3-4770-bf0d-212789366208.img

gluster volume heal <VOL> full

If that's all you have to heal, just umount and clean up.
umount /tmp/sbfix/r*
rm -rf /tmp/sbfix


Red Hat Storage 3 - Administration Guide

NOTE: 接下來以兩台 Storage Server 作為範例說明


NOTE: 接下來以兩個租戶作為範例說明


 當使用 hostname 來加入 Trusted Storage Pool 時,
 第一台 Server 必須再被另外一台〝probe〞,以設定 hostname


TBD full
diagnostics.brick-log-level INFO 64


# gluster volume profile your-volume start
# gluster volume profile your-volume info > /tmp/dontcare
# sleep 60
# gluster volume profile your-volume info > profile-for-last-minute.log
"sar -n DEV 2" will show you network utilization
, and "iostat -mdx /dev/sd? 2" on your server will show block device queue depth
(latter two tools require sysstat rpm)


100% cpu on brick replication

主旨:Self-heal and high load 2013-May
原因:Occasionally the connection between these 2 Gluster servers breaks or drops momentarily.
結果:The problem with this is that whilst the self-heal happens one the gluster servers will be inaccessible from the client, meaning no files can be read or written, causing problems for our users.
配置:1x2 replicated volume (two nodes)

主旨:100% cpu on brick replication 2015-May
原因:when you create a new replicated brick or when you bring back online an old one.
結果:The problem is that the cpu load on the online brick is so high that I cannot do normal operations.
配置:1x2 replicated volume (two nodes)
主旨:100% CPU WAIT 2014-October
原因:self heal is running after a bad communication between 2 nodes, or after a node crashed.
結果:In our case, we have several million files in our Gluster cluster, and when a self-heal hits, we can kiss our Gluster goodbye for a couple of hours.
配置:1x2 replicated volume (two nodes)


100% cpu on brick replication

> Setting to 5 is generally not recommended.
> As a matter of fact, it would not be advisable to alter the ping timeout
> from the default value.
cluster.self-heal-daemon off
cluster.metadata-self-heal off

# enables/disables directory self-heal
cluster.entry-self-heal off
Please don't use 3.6.1 with EXT4
> This particular readdir issue is present because of the way gluster 
> is handling EXT4's 64 bit offsets in readdir.

apt-get install xfsprogs
mkfs.xfs -i size=512 /dev/sdb1
12-18 02:41:23.557523] I [dht-common.c:1822:dht_lookup_cbk]
0-gv0-dht: Entry /html/some_site/some_folder/asdf.php missing on subvol gv0-replicate-0

> Because this log entry appeared to just be informational,
> I didn't pay much attention to it.
> However I began to notice many of them for one particular site
> that is hosted on this cluster.
> I finally decided to remove that site temporarily from the cluster
> and much to my surprise AND delight, the problem went away!


glusterfs nfs

預設狀態下,GlusterFS Client 僅能掛載一整個Volume

# mount -t glusterfs node1:/gv1 /mnt/test1
# mount -t nfs node1:/gv1 /mnt/test1

將 nfs.export-dirs 開啟 (預設不開放子目錄)
gluster v set gv1 nfs.export-dirs true



# mount -t nfs node1:/gv1/III /mnt/test1


此時你可以將 nfs.export-volumes 關閉 ( 注意:此舉會影響所有volume )


gluster v set gv1 nfs.export-volumes off
gluster v set gv1 nfs.export-dir /Bob,/Kevin,/Stuart
# showmount -e node1
Export list for node1:
/gv1/Bob      *
/gv1/Kevin       *
/gv1/Stuart  *


除非特別註明,本頁內容採用以下授權方式: Creative Commons Attribution-ShareAlike 3.0 License