路由查找树的节点为key_vector,对于非叶子节点,tnode存储该节点的子节点,数据结构如下:
- struct key_vector {
- t_key key;
- unsigned char pos; /* 2log(KEYLENGTH) bits needed */
- unsigned char bits; /* 2log(KEYLENGTH) bits needed */
- unsigned char slen;
- union {
- /* This list pointer if valid if (pos | bits) == 0 (LEAF) */
- struct hlist_head leaf;
- /* This array is valid if (pos | bits) > 0 (TNODE) */
- struct key_vector __rcu *tnode[0];
- };
- };
key_vector各字段的意义参考https://www.kernel.org/doc/html/latest/networking/fib_trie.html。
为子节点的索引,共有bits位二进制用于索引子节点,也就是该节点下能存储
个子节点;
下图拷贝自https://vincent.bernat.ch/en/blog/2017-ipv4-route-lookup-linux,从上到下、从左到右,以第一个
的节点为例:
,那么该节点可以存储
也就是4个子节点;
,那么子节点索引就是从第4位开始,又因为
,所以,子节点索引就是
,下图中可以看到,该节点可以存储4个子节点,除去一个没有指向任何子节点的空指针外,共有3个子节点,分别为192.0.2.0、192.0.2.47、192.0.2.48,这几个子节点的
二进制位分别为0、2、3,也就是在父节点对应的索引,key_vector使用数组存储子节点指针,对应子节点的索引实际也是数组的索引,子节点指针存储在tnode指针数组里面。
key的子节点的索引的计算是通过get_cindex宏来实现的,cindex是child index的缩写,get_cindex除了计算key的子节点索引外,还计算了key是否与节点的前缀匹配,get_cindex定义如下:
#define get_cindex(key, kv) (((key) ^ (kv)->key) >> (kv)->pos)
,节点
的
位为0,亦或操作
,也就是相当于取
的值;
,节点
的
为该节点匹配的前缀,如果
与
的前缀相同,那么
就为0,否则就不为0,不为0就有
,也就是通过比较get_cindex的值是否大于等于
(
)就可以判断出前缀是否相同。
- static struct key_vector *fib_find_node(struct trie *t,
- struct key_vector **tp, u32 key)
- {
- struct key_vector *pn, *n = t->kv; // n: 查找树的根节点, pn: 查找树节点的父节点
- unsigned long index = 0; // 子节点索引(0.0.0.0的第一个子节点也是0.0.0.0)
-
- do {
- pn = n; // 记录父节点
- n = get_child_rcu(n, index); // 获取index子节点(第1次循环时,n指向0.0.0.0的根节点,index为0,根节点下只有一个子节点,索引就是0)
-
- if (!n) // 子节点为空,那么父节点就是最长匹配的节点
- break; // 退出循环
-
- index = get_cindex(key, n); // 子节点n不为空,需要在子节点n里面查找更长的匹配节点
-
- /* This bit of code is a bit tricky but it combines multiple
- * checks into a single check. The prefix consists of the
- * prefix plus zeros for the bits in the cindex. The index
- * is the difference between the key and this value. From
- * this we can actually derive several pieces of data.
- * if (index >= (1ul << bits))
- * we have a mismatch in skip bits and failed
- * else
- * we know the value is cindex
- *
- * This check is safe even if bits == KEYLENGTH due to the
- * fact that we can only allocate a node with 32 bits if a
- * long is greater than 32 bits.
- */
- if (index >= (1ul << n->bits)) { // index大于等于1ul << n->bits,那么key与kv的前缀不匹配,父节点就是最长匹配的节点
- n = NULL; // n为NULL,表示是查找中间节点时退出循环,非查找到叶子节点之后导致没有更长匹配而退出循环的
- break;
- }
-
- /* keep searching until we find a perfect match leaf or NULL */
- } while (IS_TNODE(n)); // n是非叶子节点,那么继续在n的子节点里面进行匹配查找
-
- *tp = pn; // tp记录key所在的最小非叶子节点
-
- return n;
- }
prefix_mismatch用于 检查key的前缀是否不匹配:
为0,那么
,0 & 任何数都为0,prefix_mismatch返回0,也就是key匹配;
不为0,那么key的后缀不为0或者前缀不匹配,
为
的各位取反加1, 首先
后缀为0的部分先变成了1,加1之后又变成了0,并且向
最低为1的位进1(取反之后该位变成0了)最终该位有变成1了,最终结果就是
后缀为0的部分还是为0,其次
不为0的前缀部分取反,其中夹杂的0的二进制位变成了1,
就是使前缀部分变为1,把不为0的部分当作子网的话,这里一定程度上就是计算子网掩码,
最终就是检查
的不为0的前缀是否不同。- static inline t_key prefix_mismatch(t_key key, struct key_vector *n)
- {
- t_key prefix = n->key;
-
- return (key ^ prefix) & (prefix | -prefix);
- }
fib_table_lookup主要原理就是深度优先查找最长匹配的路由,没有查找的路由的情况下使用回溯加分支限界减少查找路径查找父节点、兄弟节点,更详细的说明查看如下代码中的注释说明,fib_table_lookup代码实现如下:
- int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp,
- struct fib_result *res, int fib_flags)
- {
- struct trie *t = (struct trie *) tb->tb_data;
- #ifdef CONFIG_IP_FIB_TRIE_STATS
- struct trie_use_stats __percpu *stats = t->stats;
- #endif
- const t_key key = ntohl(flp->daddr);
- struct key_vector *n, *pn;
- struct fib_alias *fa;
- unsigned long index;
- t_key cindex;
-
- trace_fib_table_lookup(tb->tb_id, flp);
-
- pn = t->kv; // 获取路由表的根节点
- cindex = 0; // 根节点的子节点索引(根节点只有一个子节点,索引为0)
-
- n = get_child_rcu(pn, cindex); // 获取根节点的第0个索引的子节点
- if (!n) // 如果子节点为空,此时应该还没建立好路由表,返回-EAGAIN,需要再次查找路由表
- return -EAGAIN;
-
- #ifdef CONFIG_IP_FIB_TRIE_STATS
- this_cpu_inc(stats->gets);
- #endif
-
- /* Step 1: Travel to the longest prefix match in the trie */ // 在查找树里面遍历最长的前缀匹配
- for (;;) {
- index = get_cindex(key, n); // key的子节点索引
-
- /* This bit of code is a bit tricky but it combines multiple
- * checks into a single check. The prefix consists of the
- * prefix plus zeros for the "bits" in the prefix. The index
- * is the difference between the key and this value. From
- * this we can actually derive several pieces of data.
- * if (index >= (1ul << bits))
- * we have a mismatch in skip bits and failed
- * else
- * we know the value is cindex
- *
- * This check is safe even if bits == KEYLENGTH due to the
- * fact that we can only allocate a node with 32 bits if a
- * long is greater than 32 bits.
- */
- if (index >= (1ul << n->bits)) // key与n前缀不同(与n的子节点也不可能相同,没必要查找子节点),退出当前循环
- break;
-
- /* we have found a leaf. Prefixes have already been compared */
- if (IS_LEAF(n)) // 如果n是叶子节点,n指向fib_alias路由表项
- goto found; // 找到了key的路由表项,跳转到found
-
- /* only record pn and cindex if we are going to be chopping
- * bits later. Otherwise we are just wasting cycles.
- */
- if (n->slen > n->pos) { // 节点n的后缀长度slen大于pos,那么记录回溯的节点为当前节点(put_child如果子节点的slen大于父节点的slen,那么使用子节点的slen作为父节点的slen,也就是父节点比子节点匹配更长的前缀)
- pn = n; // 记录父节点为n节点,backtrace回溯的时候,从pn开始回溯(剪掉不需要回溯的分支)
- cindex = index; // 记录key在pn中的子节点索引(回溯的时候,需要往更小索引的子节点查找)
- }
-
- n = get_child_rcu(n, index); // 获取key在n的子节点
- if (unlikely(!n)) // 子节点为空,找不到匹配的路由
- goto backtrace; // 跳转到backtrace,回溯查找路由
- }
-
- /* Step 2: Sort out leaves and begin backtracing for longest prefix */
- for (;;) {
- /* record the pointer where our next node pointer is stored */
- struct key_vector __rcu **cptr = n->tnode;
-
- /* This test verifies that none of the bits that differ
- * between the key and the prefix exist in the region of
- * the lsb and higher in the prefix.
- */
- if (unlikely(prefix_mismatch(key, n)) || (n->slen == n->pos)) // 前缀不匹配需要回溯其他节点;不为0的前缀匹配但是slen等于pos,n的子节点的slen小于等于n节点的slen,n的子节点的前缀长度大于等于n的前缀长度,那么也需要回溯,这里比较绕,以192.168.2.0/24子网为例,192.168.3.1的前23位与192.168.2.0的前23位是匹配的,但是子网是24位,192.168.3.1并不在192.168.2.0/24这个子网里面,192.168.3.1也不会在192.168.2.0下面划分的更小的子网里面,所以需要回溯其他子网
- goto backtrace;
-
- /* exit out and process leaf */
- if (unlikely(IS_LEAF(n))) // 是叶子节点
- break; // 跳出当前循环,也就是跳转到了found
-
- /* Don't bother recording parent info. Since we are in
- * prefix match mode we will have to come back to wherever
- * we started this traversal anyway
- */
-
- while ((n = rcu_dereference(*cptr)) == NULL) {
- backtrace:
- #ifdef CONFIG_IP_FIB_TRIE_STATS
- if (!n)
- this_cpu_inc(stats->null_node_hit);
- #endif
- /* If we are at cindex 0 there are no more bits for
- * us to strip at this level so we must ascend back
- * up one level to see if there are any more bits to
- * be stripped there.
- */
- while (!cindex) { // cindex为0,也就是cindex是pn的最小子节点索引,回溯的时候是从当前子节点往更小的子节点回溯,因为当前已经是最小的索引的子节点了,那么只能往上一层节点回溯;回溯的过程就是缩短前缀的匹配
- t_key pkey = pn->key; // 获取父节点的key
-
- /* If we don't have a parent then there is
- * nothing for us to do as we do not have any
- * further nodes to parse.
- */
- if (IS_TRIE(pn)) // (n)->pos >= KEYLENGTH,pn没有父节点
- return -EAGAIN; // 找不到匹配的节点,返回-EAGAIN
- #ifdef CONFIG_IP_FIB_TRIE_STATS
- this_cpu_inc(stats->backtrack);
- #endif
- /* Get Child's index */
- pn = node_parent_rcu(pn); // 获取pn的父节点
- cindex = get_index(pkey, pn); // 获取key在父节点的索引(回溯从更新索引的子节点查找)
- }
-
- /* strip the least significant bit from the cindex */
- cindex &= cindex - 1; // 清除cindex最低的1的比特位,减少前缀匹配的位数(减1操作,后面的为0的二进制位都需要向前借1,导致后面的0都变成1了,最后向最低的为1的二进制位借1之后,最低的为1的二进制位就变成0了,旧的1&新的0最终还是使原来最低的为1的二进制位变成了0,低位的0借位变成1之后呢,因为原来是0,与操作之后又回到了0,整个操作就是将最低位的1变成0)
-
- /* grab pointer for next child node */
- cptr = &pn->tnode[cindex]; // 下一个查找的子节点
- }
- }
-
- found:
- /* this line carries forward the xor from earlier in the function */
- index = key ^ n->key; // 获取key的后缀(前缀部分相同,亦或之后就变成0了,n->key后缀部分为0,任何数与0亦或还是该数)
-
- /* Step 3: Process the leaf, if that fails fall back to backtracing */
- hlist_for_each_entry_rcu(fa, &n->leaf, fa_list) { // 遍历fib_alias下面的路由,查找符合flp条件的路由,fib_alias数据结构参考机械工业出版社《Linux内核源码剖析:TCP/IP实现(上册)》"19.3.4 fib_alias结构"
- struct fib_info *fi = fa->fa_info;
- int nhsel, err;
-
- if ((BITS_PER_LONG > KEYLENGTH) || (fa->fa_slen < KEYLENGTH)) {
- if (index >= (1ul << fa->fa_slen))
- continue;
- }
- if (fa->fa_tos && fa->fa_tos != flp->flowi4_tos)
- continue;
- if (fi->fib_dead)
- continue;
- if (fa->fa_info->fib_scope < flp->flowi4_scope)
- continue;
- fib_alias_accessed(fa);
- err = fib_props[fa->fa_type].error;
- if (unlikely(err < 0)) {
- #ifdef CONFIG_IP_FIB_TRIE_STATS
- this_cpu_inc(stats->semantic_match_passed);
- #endif
- return err;
- }
- if (fi->fib_flags & RTNH_F_DEAD)
- continue;
- for (nhsel = 0; nhsel < fi->fib_nhs; nhsel++) {
- const struct fib_nh *nh = &fi->fib_nh[nhsel];
- struct in_device *in_dev = __in_dev_get_rcu(nh->nh_dev);
-
- if (nh->nh_flags & RTNH_F_DEAD)
- continue;
- if (in_dev &&
- IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) &&
- nh->nh_flags & RTNH_F_LINKDOWN &&
- !(fib_flags & FIB_LOOKUP_IGNORE_LINKSTATE))
- continue;
- if (!(flp->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF)) {
- if (flp->flowi4_oif &&
- flp->flowi4_oif != nh->nh_oif)
- continue;
- }
-
- if (!(fib_flags & FIB_LOOKUP_NOREF))
- atomic_inc(&fi->fib_clntref);
-
- res->prefixlen = KEYLENGTH - fa->fa_slen; // 填充返回结果
- res->nh_sel = nhsel;
- res->type = fa->fa_type;
- res->scope = fi->fib_scope;
- res->fi = fi; // fib_info包含fib_nh,fib_nh包含具体的网关、输出网卡等路由信息
- res->table = tb;
- res->fa_head = &n->leaf;
- #ifdef CONFIG_IP_FIB_TRIE_STATS
- this_cpu_inc(stats->semantic_match_passed);
- #endif
- trace_fib_table_lookup_nh(nh);
-
- return err;
- }
- }
- #ifdef CONFIG_IP_FIB_TRIE_STATS
- this_cpu_inc(stats->semantic_match_miss);
- #endif
- goto backtrace;
- }
如下图所示,下图是一个简要的查找路径,红色线条从跟节点开始深度优先查找,找到最底层的时候找到空节点,然后绿箭头回溯父节点,第2层子节点索引为2,存在更小的子节点索引,缩小前缀匹配位数之后,可能匹配更小子节点的索引,所以往更小的子节点索引的兄弟节点查找路由。

https://www.kernel.org/doc/html/latest/networking/fib_trie.html
https://vincent.bernat.ch/en/blog/2017-ipv4-route-lookup-linux