Hosted Private Cloud

 

0. 準備 Gluster storage node

準備兩台 PC ,作為 Gluster storage node,稱 s1 以及 s2
 
每台 PC 各有兩張實體網卡,兩顆硬碟(sda安裝作業系統 / sdb 預留給 Gluster)
 
安裝 Scientific Linux 6.3 ( /home/build/iso/Base/CAKE-2014-04-11.iso )

每台 PC 將其中一張網卡 eht0 ,切成三張虛擬網卡

# s1
vconfig add eth0 101; vconfig add eth0 1501; vconfig add eth0 1502
ifconfig eth0.101 172.31.0.250 netmask 255.255.255.0
ifconfig eth0.1501 172.31.1.250 netmask 255.255.255.0
ifconfig eth0.1502 172.31.2.250 netmask 255.255.255.0

# s2
vconfig add eth0 101; vconfig add eth0 1501; vconfig add eth0 1502
ifconfig eth0.101 172.31.0.251 netmask 255.255.255.0
ifconfig eth0.1501 172.31.1.251 netmask 255.255.255.0
ifconfig eth0.1502 172.31.2.251 netmask 255.255.255.0

 
兩台 PC 安裝 Gluster
wget -P /etc/yum.repos.d/ https://download.gluster.org/pub/gluster/glusterfs/LATEST/RHEL/glusterfs-epel.repo
yum install gdisk glusterfs{,-server,-fuse,-geo-replication} -y

 
其餘的 client 則必須安裝 Gluster-client
wget -P /etc/yum.repos.d/ https://download.gluster.org/pub/gluster/glusterfs/LATEST/RHEL/glusterfs-epel.repo
yum install glusterfs{,-fuse} -y

 
 

1. 建立 Gluster volume (multi-tenancy)

準備 Gluster brick

mkfs.xfs -i size=512 -n size=8192 /dev/sdb
mkdir /mnt/gfbrick
mount /dev/sdb /mnt/gfbrick
echo "/dev/sdb /mnt/gfbrick xfs defaults 1 2" >> /etc/fstab

mkdir /mnt/gfbrick/RootCAKE
mkdir /mnt/gfbrick/Tenant1
mkdir /mnt/gfbrick/Tenant2

假設 sdb 已經 format

 
更改 SELinux 模式

# To put SELinux in permissive mode
setenforce 0
 
# to see the current mode of SELinux
getenforce
 
# Change “/etc/selinux/configand makeSELINUX=disabledorSELINUX=permissivein it

 
新增防火牆規則 (both side)
# vim /usr/local/virus/iptables/iptables.allow
 
# 根據開啟的 glusterd 數量
iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 24007:24008 -j ACCEPT
 
# 根據使用的 volume 數量
iptables -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 49152:49156 -j ACCEPT

如果不會弄就直接 `service iptables stop` 吧


 
加入 Gluster trusted pool
echo "172.31.0.250 172.31.1.250 172.31.2.250 s1" >> /etc/hosts
echo "172.31.0.251 172.31.1.251 172.31.2.251 s2" >> /etc/hosts
# s1
gluster peer probe s2
# s2
gluster peer probe s1

其實只要其中一台,去 probe 另外一台就完成了 (這邊是為了讓兩邊都有 hostname)
When using hostnames, the first server needs to be probed from one other server to set its hostname

 
建立 Gluster volume

gluster volume create RootCAKE replica 2 s1:/mnt/gfbrick/RootCAKE s2:/mnt/gfbrick/RootCAKE
gluster volume create Tenant1 replica 2 s1:/mnt/gfbrick/Tenant1 s2:/mnt/gfbrick/Tenant1
gluster volume create Tenant2 replica 2 s1:/mnt/gfbrick/Tenant2 s2:/mnt/gfbrick/Tenant2

 
啟用 Gluster volume
gluster volume start RootCAKE
gluster volume start Tenant1
gluster volume start Tenant2

 
設定存取權限
gluster volume set RootCAKE auth.allow 172.31.0.*
gluster volume set Tenant1 auth.allow 172.31.1.*
gluster volume set Tenant2 auth.allow 172.31.2.*

 
啟用 quota
gluster volume quota RootCAKE enable
gluster volume quota Tenant1 enable
gluster volume quota Tenant2 enable

 
設定 quota 大小
gluster volume quota RootCAKE limit-usage / 500GB
gluster volume quota Tenant1 limit-usage / 500GB
gluster volume quota Tenant2 limit-usage / 500GB

 
修正 client 端,看到的錯誤 size (應等同 quota 大小)
gluster volume set RootCAKE quota-deem-statfs on
gluster volume set Tenant1 quota-deem-statfs on
gluster volume set Tenant2 quota-deem-statfs on

 

2. client 掛載使用

首先,因為要區隔三種網段的使用者
RootCAKE:172.31.0.*
Tenant1:172.31.1.*
Tenant2:172.31.2.*
我們必須加入不同的 /etc/hosts
 
for RootCAKE

echo "172.31.0.250 s1" >> /etc/hosts
echo "172.31.0.251 s2" >> /etc/hosts

 
for Tenant1
echo "172.31.1.250 s1" >> /etc/hosts
echo "172.31.1.251 s2" >> /etc/hosts

 
for Tenant2
echo "172.31.2.250 s1" >> /etc/hosts
echo "172.31.2.251 s2" >> /etc/hosts

 

再次提醒,防火牆必須關閉(或新增規則)

 
掛載 Gluster volume

mount -t glusterfs s1:/RootCAKE /mnt/storage
mount -t glusterfs s1:/Tenant1 /mnt/storage
mount -t glusterfs s1:/Tenant2 /mnt/storage

 
由於有設定存取條件,所以順利的達成 multi-tenancy

不同網段的用戶也無法取得彼此的資料
 

 
由於 CAKE starter 目前會不定期把 /etc/hosts flush 掉
所以建議`chattr +i /etc/hosts`把他 lock 住
 

 

3. 測試 HA

三種方式:
 1. kill one brick pid
 2. 直接關機 PC
 3. 拔 switch 上的網路線
 

方法一

gluster volume status
kill -9 xxx

 
查看 heal 進度

gluster volume heal <VOL> info

Note: In case of larger files it may take a while for the self-heal operation to be successfully done.
   You can check the heal status using the following command.

方法二

方法三

 
 
 
 
 

附錄一 - Gluster Dashboard

可參考 oVirt Gluster Dashboard
 
Gluster_Dashboard.png
 
可能的功能如下:
  • volume created
  • volume deleted
  • volume started/stopped
  • brick(s) added
  • brick(s) removed
  • brick(s) replaced
  • new option set
  • value of existing option changed
  • option reset
  • Server removed (peer detach)
  • brick process went down / came up

 

附錄二 - network.ping-timeout

詳見:http://thornelabs.net/2015/02/24/change-gluster-volume-connection-timeout-for-glusterfs-native-client.html
 
預設為神秘的數字 42
 


 
直接下 command 修改(5秒)

gluster volume set <VOL> network.ping-timeout 5

 
經測試, 結果無效

參考上面那篇的解法如下

vim /etc/glusterfs/glusterd.vol
volume management
    type mgmt/glusterd
    option working-directory /var/lib/glusterd
    option transport-type socket,rdma
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2
    option transport.socket.read-fail-log off
    option ping-timeout 30
#  option base-port 49152
end-volume

volume RootCAKE
    type protocol/client
    option ping-timeout 5
end-volume

 
改完必須重啟 glusterd

才會生效
 

附錄三 - preventing split-brain

詳見: https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/sect-Managing_Split-brain.html
 


 
To prevent split-brain in the trusted storage pool, you must configure server-side and client-side quorum

The quorum configuration in a trusted storage pool determines the number of server failures that the trusted storage pool can sustain.

If too many server failures occur, or if there is a problem with communication between the trusted storage pool nodes, it is essential that the trusted storage pool be taken offline to prevent data loss.
 


 
Configuring Server-Side Quorum

# configure the quorum ratio for a trusted storage pool
gluster volume set all cluster.server-quorum-ratio <PERCENTAGE>
gluster volume set all cluster.server-quorum-ratio 51%

In this example, the quorum ratio setting of 51% means that more than half of the nodes in the trusted storage pool must be online and have network connectivity between them at any given time.

If a network disconnect happens to the storage pool, then the bricks running on those nodes are stopped to prevent further writes.

For a replicated volume with two nodes and one brick on each machine, if the server-side quorum is enabled and one of the nodes goes offline, the other node will also be taken offline because of the quorum configuration. As a result, the high availability provided by the replication is ineffective.
 


 
Configuring Client-Side Quorum

Replication in Red Hat Storage Server allows modifications as long as at least one of the bricks in a replica group is online.

In a network-partition scenario, different clients connect to different bricks in the replicated environment.

In this situation different clients may modify the same file on different bricks.

When a client is witnessing brick disconnections, a file could be modified on different bricks at different times while the other brick is off-line in the replica.

For example, in a 1 X 2 replicate volume, while modifying the same file, it can so happen that client C1 can connect only to brick B1 and client C2 can connect only to brick B2.

These situations lead to split-brain and the file becomes unusable and manual intervention is required to fix this issue.
 
6355.png
 
In the above scenario, when the client-side quorum is not met for replica group A, only replica group A becomes read-only.
Replica groups B and C continue to allow data modifications.
 
gluster volume set VOLNAME quorum-type auto

 

附錄四 - split-brain recovery made easy

參考: https://joejulian.name/blog/glusterfs-split-brain-recovery-made-easy/


git clone https://github.com/joejulian/glusterfs-splitbrain.git splitmount 
cd splitmount
python setup.py install

 
splitmount <GF-server-1> <VOL> /tmp/sbfix

 
你可以去看 r1 或 r2,選出一個你不想要的 → 刪除
rm /tmp/sbfix/r1/ImpPool/6603b030-a225-4ac2-a8eb-dc26ec0a9bbf/1a2be703-f9c3-4770-bf0d-212789366208.img

 
gluster volume heal <VOL> full

 
If that's all you have to heal, just umount and clean up.
umount /tmp/sbfix/r*
rm -rf /tmp/sbfix

  

 
Red Hat Storage 3 - Administration Guide
 

 
NOTE: 接下來以兩台 Storage Server 作為範例說明
 

 

 
NOTE: 接下來以兩個租戶作為範例說明
 

 

 
NOTE:
 當使用 hostname 來加入 Trusted Storage Pool 時,
 第一台 Server 必須再被另外一台〝probe〞,以設定 hostname
 

 

TBD

cluster.data-self-heal-algorithm full
diagnostics.brick-log-level INFO
performance.io-thread-count 64
 

TBD

# gluster volume profile your-volume start
# gluster volume profile your-volume info > /tmp/dontcare
# sleep 60
# gluster volume profile your-volume info > profile-for-last-minute.log
"sar -n DEV 2" will show you network utilization
, and "iostat -mdx /dev/sd? 2" on your server will show block device queue depth
(latter two tools require sysstat rpm)

 

100% cpu on brick replication

主旨:Self-heal and high load 2013-May
原因:Occasionally the connection between these 2 Gluster servers breaks or drops momentarily.
結果:The problem with this is that whilst the self-heal happens one the gluster servers will be inaccessible from the client, meaning no files can be read or written, causing problems for our users.
配置:1x2 replicated volume (two nodes)

主旨:100% cpu on brick replication 2015-May
原因:when you create a new replicated brick or when you bring back online an old one.
結果:The problem is that the cpu load on the online brick is so high that I cannot do normal operations.
配置:1x2 replicated volume (two nodes)
 
主旨:100% CPU WAIT 2014-October
原因:self heal is running after a bad communication between 2 nodes, or after a node crashed.
結果:In our case, we have several million files in our Gluster cluster, and when a self-heal hits, we can kiss our Gluster goodbye for a couple of hours.
配置:1x2 replicated volume (two nodes)

 

100% cpu on brick replication

> Setting network.ping-timeout to 5 is generally not recommended.
> As a matter of fact, it would not be advisable to alter the ping timeout
> from the default value.
cluster.self-heal-daemon off
cluster.metadata-self-heal off

# enables/disables directory self-heal
cluster.entry-self-heal off
Please don't use 3.6.1 with EXT4
> This particular readdir issue is present because of the way gluster 
> is handling EXT4's 64 bit offsets in readdir.

apt-get install xfsprogs
mkfs.xfs -i size=512 /dev/sdb1
12-18 02:41:23.557523] I [dht-common.c:1822:dht_lookup_cbk]
0-gv0-dht: Entry /html/some_site/some_folder/asdf.php missing on subvol gv0-replicate-0

> Because this log entry appeared to just be informational,
> I didn't pay much attention to it.
> However I began to notice many of them for one particular site
> that is hosted on this cluster.
> I finally decided to remove that site temporarily from the cluster
> and much to my surprise AND delight, the problem went away!

 

glusterfs nfs

預設狀態下,GlusterFS Client 僅能掛載一整個Volume

# mount -t glusterfs node1:/gv1 /mnt/test1
# mount -t nfs node1:/gv1 /mnt/test1

 
將 nfs.export-dirs 開啟 (預設不開放子目錄)
gluster v set gv1 nfs.export-dirs true

此時除了volume之外,volume底下的子目錄也可以進行掛載

 

# mount -t nfs node1:/gv1/III /mnt/test1

 
但有些情況是,你不希望整個volume也可以被client端掛載,只想要提供子目錄給不同client

此時你可以將 nfs.export-volumes 關閉 ( 注意:此舉會影響所有volume )

並且根據你想開放的子目錄進行設定

gluster v set gv1 nfs.export-volumes off
gluster v set gv1 nfs.export-dir /Bob,/Kevin,/Stuart
# showmount -e node1
Export list for node1:
/gv1/Bob      *
/gv1/Kevin       *
/gv1/Stuart  *

此時,便可根據需求讓不同的Client掛載不同的子目錄。但再也無法掛載整個volume。

除非特別註明,本頁內容採用以下授權方式: Creative Commons Attribution-ShareAlike 3.0 License