Leasing Routable IP Addresses with Podman Containers

Leasing Routable IP Addresses with Podman Containers

Leasing routable IP addresses with Podman containers

图 "Relic" by BFS Man is licensed under CC BY 2.0

Old Southern Pacific RR caboose sitting beside US 90 just east of Luling, TX. This was near another old railway car converted into a roadside diner, which had gone out of business. The 'FOR LEASE' sign is actually for the diner, but I suppose the caboose comes with it.

来自 https://www.flickr.com/photos/bfs_man/6265442738/in/photostream/

本文主要是受 RedHat "Leasing routable IP addresses with Podman containers" 一文启发而写的. 配图也采用了原文章的.

其实标题的意思就是使用 macvlan 网络.

在一些特定场景中,比如一些传统应用或者监控应用需要直接使用 HOST 的物理网络,则可以使用 kernel 提供的 macvlan 的方式,macvlan 是在 HOST 网卡上创建多个子网卡,并分配独立的 IP 地址和 MAC 地址,把子网卡分配给容器实例来实现实例与物理网络的直通,并同时保持容器实例的隔离性。Host 收到数据包后,则根据不同的 MAC 地址把数据包从转发给不同的子接口,在外界来看就相当于多台主机。macvlan 要求物理网卡支持混杂 promisc 模式并且要求 kernel 为 v3.9-3.19 和 4.0+,因为是直接通过子接口转发数据包,所以可想而知,性能比 bridge 要高,不需要经过 NAT。

创建 macvlan 类型的网络

在 podman 里面创建一个 macvlan 类型的网络, 可以使用如下命令:

sudo podman network create -d macvlan --macvlan enp0s31f6

/etc/cni/net.d/cni-podman1.conflist

enp0s31f6 是物理机的以太网口, 可以通过 ip addr show 找出来, 也可以通过 nmcli 比如:

❯ nmcli connect 
NAME                UUID                                  TYPE      DEVICE      
Wired connection 1  5b263355-a1ee-3aa5-b34d-c36eb142a04c  ethernet  enp0s31f6   
cni-podman0         bf710dcd-26b5-47ec-ac99-53673fef2c27  bridge    cni-podman0 
docker0             dff45e06-a6e5-4510-a636-255f3f00d55c  bridge    docker0     
virbr0              be3e2c28-1028-41f7-bfdc-1aed730a59e7  bridge    virbr0 

可以看到, podman 实际上是使用的 CNI 配置.

什么是 CNI

Container Network Interface - networking for Linux containers

CNI 的仓库历史已经有 6 年之久了. 目前 CNI 是一个 CNCF 项目, 主要被 k8s 等应用.

同时, CNI 自己维护了一些核心插件,放在独立的仓库: https://github.com/containernetworking/plugins

CNI 核心插件主要包含:

Plugins supplied:

Main: interface-creating

  • bridge: Creates a bridge, adds the host and the container to it.
  • ipvlan: Adds an ipvlan interface in the container.
  • loopback: Set the state of loopback interface to up.
  • macvlan: Creates a new MAC address, forwards all traffic to that to the container.
  • ptp: Creates a veth pair.
  • vlan: Allocates a vlan device.
  • host-device: Move an already-existing device into a container.

Windows: Windows specific

  • win-bridge: Creates a bridge, adds the host and the container to it.
  • win-overlay: Creates an overlay interface to the container.

IPAM: IP address allocation

  • dhcp: Runs a daemon on the host to make DHCP requests on behalf of the container
  • host-local: Maintains a local database of allocated IPs
  • static: Allocate a static IPv4/IPv6 addresses to container and it's useful in debugging purpose.

Meta: other plugins

  • tuning: Tweaks sysctl parameters of an existing interface
  • portmap: An iptables-based portmapping plugin. Maps ports from the host's address space to the container.
  • bandwidth: Allows bandwidth-limiting through use of traffic control tbf (ingress/egress).
  • sbr: A plugin that configures source based routing for an interface (from which it is chained).
  • firewall: A firewall plugin which uses iptables or firewalld to add rules to allow traffic to/from the container.

Sample

The sample plugin provides an example for building your own plugin.

我们这里用到的主要是 macvlan (属于接口创建类的插件) 和 dhcp (属于 IPAM 类的插件).

IPAM 即 IP address allocation 的缩写, 即 IP 地址分配. 工作是分离的, 即接口创建 和 接口 IP 分配 由不同的插件完成.

好了, 我们来看一下 podman 创建的 CNI 配置吧.

  ~
❯ bat /etc/cni/net.d/cni-podman1.conflist
{
   "cniVersion": "0.4.0",
   "name": "cni-podman1",
   "plugins": [
      {
         "type": "macvlan",
         "master": "enp0s31f6",
         "ipam": {
            "type": "dhcp"
         }
      }
   ]
}

没错, RedHat 的人写那个文章的时候, podman network 还没有创建 macvlan 的支持, 因此需要手动创建 CNI 配置. 老灯现在用的 podman 版本较新, 创建 macvlan 还是比较方便了:

  ~ 
sudo podman version
Version:      3.2.2
API Version:  3.2.2
Go Version:   go1.16.5
Git Commit:   d577c44e359f9f8284b38cf984f939b3020badc3
Built:        Fri Jul  9 20:54:04 2021
OS/Arch:      linux/amd64

可以使用 inspect 查看一个网络的配置, 其实内容跟查看原生的 CNI 配置文件差不多:

  ~ 
sudo podman network inspect cni-podman1
[
    {
        "cniVersion": "0.4.0",
        "name": "cni-podman1",
        "plugins": [
            {
                "ipam": {
                    "type": "dhcp"
                },
                "master": "enp0s31f6",
                "type": "macvlan"
            }
        ]
    }
]

首先, 这是一个 .conflist 的配置, 说明里面是有多个配置的 (plugins 里面是个数组). 但是这里其实只有一个配置([]里的第一层花括号), 这个配置有3个key: ipam, mastertype. 所以主要的插件是 macvlan, 但是其实还依赖了 dhcp 插件.

使用 podman network ls 可以列出当前可用网络:

sudo podman network ls
NETWORK ID    NAME         VERSION     PLUGINS
2f259bab93aa  podman       0.4.0       bridge,portmap,dnsname,firewall,tuning
95eef1959fbc  macvlan001   0.4.0       macvlan
b932778640d3  cni-podman1  0.4.0       macvlan

cni-podman1 是刚才通过命令创建的, macvlan001 是我之前手动编写 CNI 配置文件创建的.

使用 macvlan 创建容器

创建容器时指定网络为我们刚才创建的 cni-podman1 即可.

sudo podman run -it --rm --network cni-podman1 docker.io/library/alpine sh
/ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP 
    link/ether 5a:cf:de:c3:0a:27 brd ff:ff:ff:ff:ff:ff
    inet 192.168.8.51/24 brd 192.168.8.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::58cf:deff:fec3:a27/64 scope link 
       valid_lft forever preferred_lft forever
/ # ping -c4 192.168.8.123
PING 192.168.8.123 (192.168.8.123): 56 data bytes
64 bytes from 192.168.8.123: seq=0 ttl=42 time=0.186 ms
64 bytes from 192.168.8.123: seq=1 ttl=42 time=0.267 ms
64 bytes from 192.168.8.123: seq=2 ttl=42 time=0.323 ms
64 bytes from 192.168.8.123: seq=3 ttl=42 time=0.282 ms

--- 192.168.8.123 ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 0.186/0.264/0.323 ms
/ # 

然后我们可以发现这个容器被分配了一个 192.168.8.51 的 IP, 这个网段与 host 机处在同一个网段. 通过路由器的 dhcp server 可以看到这个地址被分配了.

我们可以直接 ping 通同一网段里的其它主机, 比如上面的 192.168.8.123,同时,从同一局域网的其它主机可以 ping 通这个地址:

//yangjunsss.github.io/2018-07-29/Docker-%E7%BD%91%E7%BB%9C-Host-Bridge-Macvlan-%E5%9F%BA%E6%9C%AC%E5%8E%9F%E7%90%86%E5%92%8C%E9%AA%8C%E8%AF%81/
❯ ssh [email protected]
Warning: Permanently added '192.168.8.123' (ED25519) to the list of known hosts.
Activate the web console with: systemctl enable --now cockpit.socket

Last login: Fri Jul  9 21:07:11 2021 from 192.168.8.100
  root in homenas in ~ 
❯ ping 192.168.8.51
PING 192.168.8.51 (192.168.8.51) 56(84) bytes of data.
64 bytes from 192.168.8.51: icmp_seq=1 ttl=64 time=0.456 ms
64 bytes from 192.168.8.51: icmp_seq=2 ttl=64 time=0.227 ms
64 bytes from 192.168.8.51: icmp_seq=3 ttl=64 time=0.249 ms
64 bytes from 192.168.8.51: icmp_seq=4 ttl=64 time=0.180 ms
^C
--- 192.168.8.51 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3105ms
rtt min/avg/max/mdev = 0.180/0.278/0.456/0.105 ms

注意: 从本机是无法 ping 通这个地址的,这是 macvlan 的限制.

  ~ took 23s 
ping 192.168.8.51
PING 192.168.8.51 (192.168.8.51) 56(84) bytes of data.
From 192.168.8.100 icmp_seq=1 Destination Host Unreachable
From 192.168.8.100 icmp_seq=2 Destination Host Unreachable
From 192.168.8.100 icmp_seq=3 Destination Host Unreachable
^C
--- 192.168.8.51 ping statistics ---
5 packets transmitted, 0 received, +3 errors, 100% packet loss, time 4066ms
pipe 3

macvlan 的四种通信模式

有一个配置选项,刚才 podman 里面并没有展示出来, 那就是 macvlan 插件有一个可选的 mode 参数, 默认值是 bridge

mode (string, optional): one of “bridge”, “private”, “vepa”, “passthru”. Defaults to “bridge”. (https://www.cni.dev/plugins/current/main/macvlan/#network-configuration-reference)

macvlan 是一种网卡虚拟化技术,能够将一张网卡虚拟出多张网卡。

macvlan 的四种通信模式,常用模式是 bridge。

macvlan 支持四种模式:

  • private:子接口之间不允许通信,子接口能与物理网络通讯,所有数据包都经过父接口 eth0

  • vepa(Virtual Ethernet Port Aggregator):子接口之间、子接口与物理网络允许通讯,数据包都经过 eth0 进出,要求交换机支持 IEEE 802.1Q。

  • bridge:子接口之间直接通讯,不经过父接口 eth0 ,性能较高,但是父接口 down 之后也同样丧失通讯能力。

  • passthru:Allows a single VM to be connected directly to the physical interface. The advantage of this mode is that VM is then able to change MAC address and other interface parameters.

所有模式都不能与 父接口 eth0 通信,并且 macvlan 在公有云上的支持并不友好。

应用

macvlan 这种网络模式一般在云平台都会被限制, 因此大部分情况下应该是有直接操作物理 host 机权限的时候才可以使用这种模式.

由于 IP 是可路由的, 因此非常适合用来做一些路由器相关的事情. 比如跑一个 openwrt 容器等.

Refs

本文主要参考: https://www.redhat.com/sysadmin/leasing-ips-podman

https://github.com/containernetworking/cni

https://github.com/containernetworking/plugins

https://www.cni.dev/plugins/current/main/macvlan/

https://www.cni.dev/plugins/current/ipam/dhcp/

https://docs.docker.com/network/network-tutorial-macvlan/

http://yangjunsss.github.io/2018-07-29/Docker-%E7%BD%91%E7%BB%9C-Host-Bridge-Macvlan-%E5%9F%BA%E6%9C%AC%E5%8E%9F%E7%90%86%E5%92%8C%E9%AA%8C%E8%AF%81/

https://cloud.tencent.com/developer/article/1432601

https://cloud.tencent.com/developer/news/563351