리눅스커널소개 김남형 2016-06-21 ( 화 ) 김남형리눅스커널소개 2016-06-21 ( 화 ) 1 / 29
Outline 1 Introduction 2 cgroup 3 Namespace 4 Kernel changes 5 Q & A 김남형리눅스커널소개 2016-06-21 ( 화 ) 2 / 29
Introduction Who am I Namhyung Kim Linux kernel developer 2010 년부터오픈소스개발참여 LG 전자 open source contribution team perf / tracing subsystem 김남형리눅스커널소개 2016-06-21 ( 화 ) 3 / 29
cgroup cgroup control group core: 태스크그룹관리 controller: 리소스관리 (distribution) cgroup core cgroupfs 파일시스템 shell 상에서간단히관리가능 v4.5 부터 v2 지원 김남형리눅스커널소개 2016-06-21 ( 화 ) 4 / 29
cgroup resource distribution models weight 비중에따른분배 ex) cpu.weight limit / protection 최대 / 최소치 (over-commit 허용 ) ex) io.max / memory.low allocation 할당량 (over-commit 금지 ) ex) cpu.rt.max 김남형리눅스커널소개 2016-06-21 ( 화 ) 5 / 29
cgroup cgroup v1 vs v2 multiple vs unified hierarchy thread vs process 단위 internal (non-leaf group) task 허용여부 mount 옵션사용여부 김남형리눅스커널소개 2016-06-21 ( 화 ) 6 / 29
cgroup cgroup controllers control each resources cpu (v2 WIP) memory block IO device network 김남형리눅스커널소개 2016-06-21 ( 화 ) 7 / 29
cgroup cgroup information (v1) $ cat /proc/cgroups #subsys_name hierarchy num_cgroups enabled debug 0 1 1 cpu 2 3 1 cpuacct 1 170 1 freezer 0 1 1 $ mount grep cgroup none /acct cgroup rw,relatime,cpuacct 0 0 none /dev/cpuctl cgroup rw,relatime,cpu 0 0 $ cat /proc/self/cgroup 2:cpu:/ 1:cpuacct:/uid/2000 김남형리눅스커널소개 2016-06-21 ( 화 ) 8 / 29
cgroup cgroup v2 interface # mount -t cgroup2 none /sys/fs/cgroup # cd /sys/fs/cgroup # mkdir cgrp-a # echo `pidof firefox` > cgrp-a/procs # cat cgrp-a/controllers pid memory io # echo "+memory" > subtree_control # echo 512M > cgrp-a/memory.low 김남형리눅스커널소개 2016-06-21 ( 화 ) 9 / 29
cgroup network cgroup controllers traffic control setting cgroup v1 에서만동작 net_cls for classful qdisc set specific class id net_prio SO_PRIORITY 와동일한효과 xt_cgroup xtables netfilter matching module cgroup path 매칭 김남형리눅스커널소개 2016-06-21 ( 화 ) 10 / 29
Namespace Namespace 리소스관리 (isolation) container 환경에서주로사용 light-weight virtualization file 형태로관리 (file descriptor) life-time 관리 bind mount 를통해 persistency 보장 김남형리눅스커널소개 2016-06-21 ( 화 ) 11 / 29
Namespace Namespace types Available namespaces mount (CLONE_NEWNS) ipc (CLONE_NEWIPC) uts (CLONE_NEWUTS) network (CLONE_NEWNET) pid (CLONE_NEWPID) user (CLONE_NEWUSER) cgroup (CLONE_NEWCGROUP) 김남형리눅스커널소개 2016-06-21 ( 화 ) 12 / 29
Namespace Namespace information $ ls -la /proc/self/ns total 0 lrwxrwxrwx 1 namhyung users 0 Jun lrwxrwxrwx 1 namhyung users 0 Jun lrwxrwxrwx 1 namhyung users 0 Jun lrwxrwxrwx 1 namhyung users 0 Jun lrwxrwxrwx 1 namhyung users 0 Jun 1 22:23 ipc -> 'ipc:[4026531839]' 1 22:23 mnt -> 'mnt:[4026531840]' 1 22:23 net -> 'net:[4026531969]' 1 22:23 pid -> 'pid:[4026531836]' 1 22:23 uts -> 'uts:[4026531838]' 김남형리눅스커널소개 2016-06-21 ( 화 ) 13 / 29
Namespace Namespace API system call clone unshare setns util-linux unshare (-i/-m/-n/-p/-u/-u/-c) <command> nsenter (-i/-m/-n/-p/-u/-u/-c/-t) <command> iproute / tc ip netns <command> tc -n <ns> <do something> 김남형리눅스커널소개 2016-06-21 ( 화 ) 14 / 29
Namespace Namespace usage pid namespace 새로운프로세스생성시 pid 할당실행도중에 pid 가변경되지않음 container 가동일한 pid 를유지하는것이가능제일처음생성되는프로세스는 init (pid 1) 처럼처리 orphan 프로세스정리 (wait) signal handling 예외. 종료시새로운프로세스생성불가 김남형리눅스커널소개 2016-06-21 ( 화 ) 15 / 29
Namespace Namespace usage user namespace 일반 (non-privileged) 사용자가생성가능다른 namespace 의경우 root 권한필요 (SYS_CAP_ADMIN) 새로운 uid/gid (root) 할당가능기본값은 overflowuid/gid (65534) user/group 매핑필요 one-time operation 오직해당 namespace 에만영향을줌 non-0 uid 프로세스가 exec() 호출시 cap 제거됨 김남형리눅스커널소개 2016-06-21 ( 화 ) 16 / 29
Namespace Namespace usage network namespace # ip netns add myns # ip netns exec myns ip link set dev lo up # ip link add veth0 type veth peer name veth1 # ip link set veth1 netns myns # ip netns exec myns ifconfig veth1 10.1.1.1/24 up # ifconfig veth0 10.1.1.2/24 up 김남형리눅스커널소개 2016-06-21 ( 화 ) 17 / 29
Kernel changes ebpf extened Berkeley Packet Filter in-kernel virtual machine JIT compiler bpf(2) 시스템콜 다양한용도로활용 LLVM backend use subset of C language 김남형리눅스커널소개 2016-06-21 ( 화 ) 18 / 29
Kernel changes cbpf vs ebpf socket only vs generic purpose efficiency + genericity 내부적으로 cbpf 는 ebpf 로변경 CPU 종류에따라 JIT compile 가능 김남형리눅스커널소개 2016-06-21 ( 화 ) 19 / 29
Kernel changes ebpf virtual machine 64-bit machine JIT-friendly ISA 10개의범용 register 512 byte stack map: kernel-user data sharing 김남형리눅스커널소개 2016-06-21 ( 화 ) 20 / 29
Kernel changes ebpf ISA 최근 cpu 들의 instruction 과 1:1 대응 load 시프로그램검증후 JIT compile /proc/sys/net/core/bpf_jit_enable interpreter 모드 제한된 kernel 함수호출 김남형리눅스커널소개 2016-06-21 ( 화 ) 21 / 29
Kernel changes ebpf verifier 2-path 프로그램검증 1. CFG validation 2. DFA - simulated execution loop 사용불가능 메모리접근오류방지 김남형리눅스커널소개 2016-06-21 ( 화 ) 22 / 29
Kernel changes ebpf program ELF section 형태로구성 context argument socket buffer seccomp data kprobe registers tracepoint arguments 김남형리눅스커널소개 2016-06-21 ( 화 ) 23 / 29
Kernel changes ebpf map kernel <-> user 통신 bpf(2) 시스템콜을통해생성 file 형태로관리 persistent map /sys/fs/bpf filesystem 김남형리눅스커널소개 2016-06-21 ( 화 ) 24 / 29
Kernel changes ebpf 활용 Socket filtering (tcpdump, wireshark, xt_bpf, ) Traffic control (tc-ebpf) : classifer and action SO_REUSEPORT : packet -> socket mapping KCM (Kernel Connection Multiplexer) : determine length Seccomp filter : allow system calls Dynamic tracing 김남형리눅스커널소개 2016-06-21 ( 화 ) 25 / 29
Kernel changes ebpf compilers tcpdump bpf_asm LLVM BCC PLY systemtap 김남형리눅스커널소개 2016-06-21 ( 화 ) 26 / 29
Kernel changes ebpf example SEC("socket") int bpf_prog(struct sk_buff *skb) { long *value; int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); if (skb->pkt_type!= PACKET_OUTGOING) return 0; value = bpf_map_lookup_elem(&my_map, &index); if (value) sync_fetch_and_add(value, skb->len); return 0; } char _license[] SEC("license") = "GPL"; 김남형리눅스커널소개 2016-06-21 ( 화 ) 27 / 29
Q & A Thank You Q & A 김남형리눅스커널소개 2016-06-21 ( 화 ) 28 / 29
Q & A References http://www.kernel.org/doc/documentation/cgroup-v2.txt http://lwn.net/articles/531114/ http://www.kernel.org/doc/documentation/networking/ filter.txt 김남형리눅스커널소개 2016-06-21 ( 화 ) 29 / 29