kr 第三阶段（五）32 位逆向

如何寻找 main 函数

对于低版本的 VC 编译器（VC 6.0），main 函数在 PE 入口点 mainCRTStartup 函数中是倒数第 3 个函数调用，且参数个数为 3 个（wmain 函数为 4 个参数）。
对于高版本的 VC 编译器
- 程序入口点 mainCRTStartup 函数调用了一个 __scrt_common_main 函数。release 版程序的 __scrt_common_main 函数会被内联到 mainCRTStartup 函数中。
- __scrt_common_main 函数调用 __security_init_cookie 函数和 __scrt_common_main_seh 两个函数。前一个函数是初始化 cookie ，在 GS 保护时会用到。后一个函数是调用 main 函数的关键函数（release 版会优化为一个 jmp）。
- __scrt_common_main_seh 函数会调用 invoke_main 函数，invoke_main 函数会调用 main 函数。定位 invoke_main 函数可以找 __scrt_common_main_seh 函数末尾连续两个 if 判断之后立即调用且没有参数的函数。如果是 release 版程序由于 invoke_main 函数被内联因此直接在函数末尾找 3 个参数的函数调用即可。
如果是其他编译器可以先编译一个程序调试查看函数调用堆栈寻找定位特征。
IDA 内置常见库的签名文件，因此一般能够通过代码特征识别出 main 函数。

制作 IDA 签名文件

以 VC 6.0 为例，之前安装目录下的 VC98\Lib 中有很多 lib 库，其中有一组前缀为 LIBC 的 libc 库中存放着很多库函数。这些 libc 库的后缀有如下含义：

I：导入版（没有具体实现，只是引入动态链接库）
D：调试版
MT：多线程
P：C++

这里根据逆向分析的程序特征分析出所使用的 lib 库为 LIBC.lib 。

LIBC.lib 实际上是由多个 obj 类型的文件组成。每个 obj 文件对应一个库函数。在链接时，链接器从中递归查找所需要的库函数链接进代码中，从而避免将多余的代码链接进程序中。我们使用 VC 6.0 自带的链接器可以查看 LIBC.lib 中的 obj 文件。

> link -lib /list .\LIBC.LIB
Microsoft (R) Library Manager Version 6.00.8168
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.

..\build\intel\st_obj\util.obj
..\build\intel\st_obj\matherr.obj
..\build\intel\st_obj\ldexp.obj
..\build\intel\st_obj\ieeemisc.obj
..\build\intel\st_obj\frexp.obj
..\build\intel\st_obj\fpexcept.obj
..\build\intel\st_obj\bessel.obj
...
1
2
3
4
5
6
7
8
9
10
11
12

另外我们还可以将其中某个特定的 obj 文件提取出来。

link -lib /extract:build\intel\st_obj\printf.obj .\LIBC.LIB
1

提取出 obj 文件后，有一个 Flair 的工具可以制作签名文件。

首先使用 pcf （ELF 文件使用 pelf）从 printf.obj 中提取特征，得到 printf.pat 文件。

pcf .\printf.obj
1

之后使用 sigmake 将提取的特征文件 printf.pat 制作成 IDA 的签名文件 printf.sig 。这里 -n 用来添加备注。

sigmake -n"TestSig" .\printf.pat printf.sig
1

对于 IDA 7.7 需要将签名文件放到其安装目录下 sig 文件夹下的具体文件夹中，这里我放在 pc 文件夹中。之后再 IDA 中使用快捷键 Shift + F5 打开签名窗口右键添加签名文件。
在这里插入图片描述
此时 printf 函数被识别出来并且 printf 函数所在位置被 IDA 标记为浅蓝色，表示 IDA 将这段代码其标记为库函数代码。

上面只是添加了一个函数的签名文件，实际我们可以使用 lib2sig.bat 脚本将制作签名的过程自动化。

md %1_objs
cd %1_objs
for /f %%i in ('link -lib /list %1.lib') do link -lib /extract:%%i %1.lib
for %%i in (*.obj) do pcf %%i
sigmake -n"%1.lib" *.pat %1.sig
if exist %1.exc for %%i in (%1.exc) do find /v ";" %%i > abc.exc
if exist %1.exc for %%i in (%1.exc) do > abc.exc more +2 "%%i"
copy abc.exc %1.exc
del abc.exc
sigmake -n"%1.lib" *.pat %1.sig
copy %1.sig ..\%1.sig
cd ..
del %1_objs /s /q
rd %1_objs
1
2
3
4
5
6
7
8
9
10
11
12
13
14

使用如下命令就可以制作出 LIBC.LIB 的签名文件 LIBC.sig 。注意脚本中使用的 lib 名为小写，因此需要手动修改 lib 库的名称为 LIBC.lib 。

.\lib2sig.bat LIBC
1

然而 LIBC.lib 中有部分 16 位的 obj 文件 pcf 无法处理且 pcf 退出之前会用 getchar 阻塞，因此脚本会被卡住。需要将 pcf 脱壳之后 patch 掉 getchar 解决。

> pcf nset.obj
nset.obj is not ar/coff file

press enter to exit.
1
2
3
4

表达式

基本概念

表达式的类型

运算的表达式类型分为波兰表达式，中缀表达式，逆波兰表达式。以 $a + b \times c - d / f$ 为例，该表达式可以转换为如下表达式树：在这里插入图片描述
而三种表达式对应于该树的三种遍历方式：

波兰表达式：前序遍历， $-+a×b\ c/d\ f$ 。
中缀表达式：中序遍历， $a + b \times c - d / f$ 。
逆波兰表达式：后续遍历， $a\ b\ c×+\ d\ f/-$ 。

在表达式树中值一定在叶子节点而运算符一定不在叶子节点。因为波兰表达式和逆波兰表达式可以确定直接与一个运算符相连的两个叶子节点，从而确定出表达式的运算顺序，而中缀表达式需要额外提供运算优先级信息才能还原出运算顺序，因此计算机通常采用波兰表达式来表示一个表达式。比如上面的例子写为函数调用形式是 sub(add(a, mul(b, c)), div(d, f))，这就是一个波兰表达式。

大多数编译器会将各种运算的表达式转换为波兰表达式。

表达式的优化

运算必须传回结果的值，否则不产生任何代码。常见传值有三种方式：赋值运算，函数传参和返回值。
如果运算表达式的中的值都是常量会触发常量折叠，即用表达式的运算结果代替表达式。当然如果表达式中的值不都是常也会有常量折叠，例如 3 * n + 6 * n 可以被优化为 9 * n 。
Release 版程序针对表达式还会有窥孔优化，即使用各种优化方案在某个局部代码处，如果能成功使用则重新扫描并再次尝试优化，如果各方案都不能使用则优化结束。
另外 Release 版程序针对表达式还会有常量传播优化，即如果表达式其中一个值为变量，而这个变量的值可以推算出是确定的值，则将表达式中对应的值替换为常量。可以与常量折叠配合使用。
复写传播：类似常量传播，只不过传播的是变量而不是常量。例如 x = x + 8; x = x + 8; 优化为 x = (x + 8) + 8 即 x = x + 16 。
强度削弱：使用的代价的指令序列来替换高代价的指令序列。例如使用 lea 和移位指令替换乘法和除法指令。

取整

取整有三种类型：

向上取整： $\left \lceil \frac{a}{b} \right \rceil$
向下取整： $\left \lfloor \frac{a}{b} \right \rfloor$
向零取整： $\left [ \frac{a}{b} \right ]$

C/C++ 以及绝大多数变成语言都采用向零取整的策略（当然也有一些例外，例如 python 是向下取整）。

除法

除数为变量时无法优化，根据除数类型选择 div 或 idiv 。
除数为常量时可以优化。

被除数无符号，除数为 2 的整数次幂

直接 shr 移位即可。

被除数有符号，除数为 2 的整数次幂

如果被除数有符号除数为 2 的整数次幂那么会转换为算术移位操作。但是如果被除数为负数需要做特殊处理。
在这里插入图片描述
首先对于一个负数 $- x$ ，如果是进行算术右移 $n$ 位，高位会补符号位。实际上这个右移的过程我们可以看做是上图中蓝色的部分右移的过程，也就是说右移完之后 $- x$ 变成了 $-\left \lfloor \frac{x-1}{2^n} \right \rfloor -1=-\left \lfloor \frac{x+2^n-1}{2^n} \right \rfloor=-\left \lceil \frac{x}{2^n} \right \rceil$ 。

然而我们期望得到的结果是 $\left [ \frac{-x}{2^n} \right ]=-\left \lfloor \frac{x}{2^n} \right \rfloor$ 。

因此我们可以在右移前在分子上加上 $2^n-1$ ，也就是上图蓝色的部分减少 $2^n -1$ 。

因此原式变为 $-\left \lfloor \frac{x-1-(2^n-1)}{2^n} \right \rfloor -1=-\left \lfloor \frac{x}{2^n} \right \rfloor=\left [ \frac{-x}{2^n} \right ]$ 。

例如 x / 8 可以优化为如下汇编代码：

mov eax, x
cdq
and edx, 7
add eax, edx
sar eax, 3
1
2
3
4
5

cdq 指令的作用是取 eax 的最高位填充到 edx 中，即如果 i < 0 则 edx = 0xFFFFFFFF，否则 edx = 0 。
如果除数为负数，由于向零取整的特性会把符号提出类。对应到汇编代码中是正常优化后再最后结果用 neg 指令取负数。

特别的，如果除数为 2 需要被除数加 1 ，因此直接被除数减去 edx 即可。

mov eax, x
cdq
sub eax, edx
sar eax, 1
1
2
3
4

注意：如果除数为 2 的整数次幂且无符号则无论被除数正负均向下取整。

被除数无符号，除数为非 2 的整数次幂

对于被除数无符号除数为非 2 的整数次幂的情况，为了避免使用除法指令，可以做如下转换：
$\left \lfloor \frac{a}{b} \right \rfloor =\left \lfloor \frac{a\times \left \lceil \frac{2^n}{b} \right \rceil }{2^n} \right \rfloor\ (n\ge \left \lceil \log_2 a \right \rceil )$
关于上式证明如下：

不妨设
$2^n=b\times k+r\ (0< r < b,k\in \mathbb{N} )$
则
$\frac{a\times \left \lceil \frac{2^n}{b} \right \rceil }{2^n}=\frac{a\times k+a}{b\times k+r}$
若要使得源等式成立，则要满足
$\frac{a}{b} \le \frac{a\times k+a}{b\times k+r} <\frac{a+b}{b}$
首先显然有如下不等式成立：
$\frac{a\times k+a}{b\times k+r}> \frac{a\times k+a}{b\times k+b}$
由于 $k\in \mathbb{N}$ ，因此
$\frac{a\times k+a}{b\times k+r}>\frac{a}{b}$
而不等式
$\frac{a\times k+a}{b\times k+r} <\frac{a+b}{b}$
等价为
$b\times (2^n-a)+a\times r>0$
显然也成立。

综上，原命题得证。

对于 32 位程序，这里的 $n$ 的值要比 32 大，因此会用到 eax 和 edx 编译器。例如 $\left \lfloor \frac{a}{23} \right \rfloor$ 对应的汇编代码如下：

.text:00401000 mov     eax, 0B21642C9h
.text:00401005 mul     [esp+a]
.text:00401009 shr     edx, 4
.text:0040100C push    edx
1
2
3
4

这里 0B21642C9h 即 $\left \lceil \frac{2^{36}}{23} \right \rceil$ ，我们称之为 MagicNumber。由于运算结果大小超过 32 位，因此使用 edx 寄存器存储高 32 位，这就是为什么后面要将 edx 右移 4 位。

特别的，如果除数特别小时会出现 MagicNumber 也超过 32 位的情况。例如 $\left \lfloor \frac{a}{7} \right \rfloor$ 优化后为 $\left \lfloor \frac{a\times \left \lceil \frac{2^{35}}{7} \right \rceil }{2^{35}} \right \rfloor$ 然而 $\left \lceil \frac{2^{35}}{7} \right \rceil$ 的值为 124924925h 大于 $2^{32}$ 因此有如下转换：
$\left \lfloor \frac{a\times \left \lceil \frac{2^{35}}{7} \right \rceil }{2^{35}} \right \rfloor=\left \lfloor \frac{\frac{a-\left \lfloor \frac{a\times (\left \lceil \frac{2^{35}}{7} \right \rceil-2^{32} )}{2^{32}} \right \rfloor }{2}+\left \lfloor \frac{a\times (\left \lceil \frac{2^{35}}{7} \right \rceil-2^{32} )}{2^{32}} \right \rfloor }{2^2} \right \rfloor$
具体推导如下：

首先
$\left \lfloor \frac{a\times (\left \lceil \frac{2^{35}}{7} \right \rceil-2^{32} )}{2^{32}} \right \rfloor =\left \lfloor \frac{a\times \left \lceil \frac{2^{35}}{7} \right \rceil }{2^{32}} \right \rfloor -a$
因此

\begin{aligned} ⌊ \frac{\frac{a - ⌊ \frac{a \times (⌈ \frac{2^{35}}{7} ⌉ - 2^{32})}{2^{32}} ⌋}{2} + ⌊ \frac{a \times (⌈ \frac{2^{35}}{7} ⌉ - 2^{32})}{2^{32}} ⌋}{2^{2}} ⌋ & = ⌊ \frac{\frac{2 \times a - ⌊ \frac{a \times ⌈ \frac{2^{35}}{7} ⌉}{2^{32}} ⌋}{2} + ⌊ \frac{a \times ⌈ \frac{2^{35}}{7} ⌉}{2^{32}} ⌋ - a}{2^{2}} ⌋ \\ = ⌊ \frac{⌊ \frac{a \times ⌈ \frac{2^{35}}{7} ⌉}{2^{32}} ⌋}{2^{3}} ⌋ \\ = ⌊ \frac{a \times ⌈ \frac{2^{35}}{7} ⌉}{2^{35}} ⌋ \end{aligned}

\frac{\frac{a - ⌊ \frac{a \times ( ⌈ \frac{2}{7} ⌉ - 2 ^{32} )}{2 ^{32}} ⌋}{2} + ⌊ \frac{a \times ( ⌈ \frac{2 ^{35}}{7} ⌉ - 2 ^{32} )}{2 ^{32}} ⌋}{2 ^{2}} = \frac{\frac{2 \times a - ⌊ \frac{a \times ⌈ \frac{2}{7} ⌉}{2 ^{32}} ⌋}{2} + ⌊ \frac{a \times ⌈ \frac{2 ^{35}}{7} ⌉}{2 ^{32}} ⌋ - a}{2 ^{2}} = \frac{⌊ \frac{a \times ⌈ \frac{2 ^{35}}{7} ⌉}{2 ^{32}} ⌋}{2 ^{3}} = \frac{a \times ⌈ \frac{2 ^{35}}{7} ⌉}{2 ^{35}}

其中

\left \lceil \frac{2^{35}}{7} \right \rceil -2^{32}<2^{32}

，因此整个表达式中参与运算的常量不会超过

2^{32}

。

对应汇编代码如下：

.text:00401000 mov     ecx, [esp+a]
.text:00401004 mov     eax, 24924925h
.text:00401009 mul     ecx
.text:0040100B sub     ecx, edx
.text:0040100D shr     ecx, 1
.text:0040100F add     ecx, edx
.text:00401011 shr     ecx, 2
.text:00401014 push    ecx
1
2
3
4
5
6
7
8

被除数有符号，除数为非 2 的整数次幂

对于被除数为有符号数的情况，汇编代码如下：

.text:00401000 mov     ecx, [esp+a]
.text:00401004 mov     eax, 38E38E39h
.text:00401009 imul    ecx
.text:0040100B sar     edx, 1
.text:0040100D mov     eax, edx
.text:0040100F shr     eax, 31			; 取 (MagicNumber * a) >> 31 的符号位即除法运算结果
.text:00401012 add     edx, eax			; 如果结果为负数则需要将结果加上 1
.text:00401014 push    edx
1
2
3
4
5
6
7
8

分析汇编可知，前 4 行与无符号数操作相似，唯一的不同是移位指令换做是有符号数的移位指令。后面 3 行指令是被除数为有符号数时特有的操作，因为如果被除数为负数则需要将结果加上 1，原理与前面被除数有符号除数为 2 的整数次幂相同，只不过指令简单粗暴的在最后结果上加 1（需要保证被除数不能被除数整除）。编译器取巧的直接将被除数的符号位加到结果上。

然而这样计算的前提是被除数不能被除数整除，因为一旦整除，但实际上编译器通过设置 ${\left \lceil \frac{2^n}{b} \right \rceil }$ 中的 $n$ 使得在 $a$ 能取到的值的范围内这种情况不会发生。例如上面的例子中通过设置 $n = 33$ 使得 MagicNumber 为奇数，这样 $a$ 至少为 $2^{33}$ 才能出现这种情况，但实际上 $a$ 取不到这么大的值。

另外，a / 7 的汇编代码如下，这里不同的是在乘完 MagicNumber 后结果还要再加上 a 。这是因为 92492493h 是负数，而为了保留 a 的符号乘法使用的是有符号乘法。但是这里 MagicNumber 需要是无符号数才有实际意义。因为 mov eax, 92492493h; imul ecx; 实际为 mov eax, -6DB6DB6Dh; imul ecx;，而 -6DB6DB6Dh = 92492493h - 100000000h，因此要想使得结果正确需要在 MagicNumber 上加上 100000000h，也就是在 MagicNumber * a 结果的高 32 位加上 a 。

.text:00401000 mov     ecx, [esp+a]
.text:00401004 mov     eax, 92492493h
.text:00401009 imul    ecx
.text:0040100B add     edx, ecx		; 加上 a * 100000000h
.text:0040100D sar     edx, 2
.text:00401010 mov     eax, edx
.text:00401012 shr     eax, 1Fh
.text:00401015 add     edx, eax		; 负数向上取整
.text:00401017 push    edx
1
2
3
4
5
6
7
8
9

进一步，如果是 a / -7 那么汇编代码如下。由于是向零取整，因此可以把符号拿到分子上，这样做的后果是原来的 -6DB6DB6Dh (92492493h - 100000000h) 变为了 6DB6DB6Dh (100000000h - 92492493h)。而 MagicNumber 需要被当做 -92492493h（这个数字不能用 32 位有符号数表示）看待，原来多减了 100000000h 现在变成少减了 100000000h 。

.text:00401001 mov     esi, [esp+a]
.text:00401005 mov     eax, 6DB6DB6Dh
.text:0040100A imul    esi
.text:0040100C sub     edx, esi		; 减去 a * 100000000h
.text:0040100E sar     edx, 2
.text:00401011 mov     eax, edx
.text:00401013 shr     eax, 1Fh
.text:00401016 add     edx, eax		; 负数向上取整
.text:00401018 push    edx
1
2
3
4
5
6
7
8
9

这里需要注意，上面的情况是应当将 MagicNumber 作为正数运算，但是看做有符号数之后 MagicNumber 变为负数导致的运算错误需要修正。如果 MagicNumber 本来就是要当做负数运算就不需要修正。例如下面的 a / 17 和 a / -17 ：

.text:00401001 mov     esi, [esp+a]
.text:00401005 mov     eax, 78787879h
.text:0040100A imul    esi
.text:0040100C sar     edx, 3
.text:0040100F mov     eax, edx
.text:00401011 shr     eax, 1Fh
.text:00401014 add     edx, eax
.text:00401016 push    edx
...
.text:00401021 mov     eax, -78787879h	; 将 -17 中的负号转移至 MagicNumber 上
.text:00401026 imul    esi
.text:00401028 sar     edx, 3
.text:0040102B mov     ecx, edx
.text:0040102D shr     ecx, 1Fh
.text:00401030 add     edx, ecx
.text:00401032 push    edx
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

因此有如下结论：

MagicNumber 为正，imul 和 sar 之间有对乘积的高一半减去乘数的调整，故认定除数为负常量，该 MagicNumber 是求补后的结果，需要对 MagicNumber 再次求补即可得到除数的绝对值。
MagicNumber 为负，imul 和 sar 之间有对乘积的高一半加上乘数的调整，故认定除数为正常量，只不过大小超过有符号数整数表示范围。
MagicNumber 为负，imul 和 sar 之间未见调整，故认定除数为负常量。

取模

模（除）数为变量时无法优化，根据模（除）数类型选择 div 或 idiv 。
模（除）数为常量时可以优化（部分高版本编译器，例如 VS2019）。

被除数无符号，模（除）数为 2 的整数次幂

如果被除数为无符号数，则 $a\bmod 2^n$ 相当于 $a\&(2^n-1)$ 。

被除数有符号，模（除）数为 2 的整数次幂

高版本会采用类似除数为正数，被除数为有符号数时的 jns 方法。不过 VS 2022 优化策略和老版本一样。

如果被除数是有符号数，则需要判断被除数小于 0 的情况，因此在取模的同时保留符号位。如果被除数小于 0 则最终结果是负数或 0，如果最后结果是负数需要补上符号位。这里为了避免分支需要将结果减 1 确保最终结果一定为负数再补上符号位，最后再将结果加 1 恢复正确结果。

.text:00401000 mov     eax, [esp+a]
.text:00401004 and     eax, 10000000000000000000000001111111b 	; & 127，并且保留符号位
.text:00401009 jns     short loc_401010                			; 如果 a 是非负数直接跳转到结束
.text:0040100B dec     eax                             			; 减 1 避免结果为 0 的情况计算错误
.text:0040100C or      eax, 11111111111111111111111110000000b 	; 补上符号位
.text:0040100F inc     eax                             			; 恢复前面的减一操作
.text:00401010 loc_401010:                             
.text:00401010 push    eax
1
2
3
4
5
6
7
8

如果是低版本 VC 编译器则按照下面的公式采用无分支的取模策略进行优化。
$\bmod 2^n = \left\{$

\begin{aligned} a & (2^{n} - 1) & (a \geq 0) \\ - ((- a) & (2^{n} - 1)) & (a < 0) \end{aligned}

\right.

a mod 2^{n} = {a & (2^{n} - 1) - ((- a) & (2^{n} - 1)) (a \geq 0) (a < 0)

.text:00401000 mov     eax, [esp+a]
.text:00401004 cdq
.text:00401005 xor     eax, edx
.text:00401007 sub     eax, edx              ; 如果 a 小于 0 则将 a 取反加 1（a = abs(a)）
.text:00401009 and     eax, 7                ; 模 8
.text:0040100C xor     eax, edx
.text:0040100E sub     eax, edx              ; 如果 a 小于 0 则将取模结果取反加 1
.text:00401010 push    eax
1
2
3
4
5
6
7
8

被除数无符号，模数不为 2 的整数次幂

对于被除数无符号，模数不为 2 的整数次幂的情况， $\bmod b$ 会被编译器优化为 $a-\left \lfloor \frac{a}{b} \right \rfloor \times b$ ，其中 $\left \lfloor \frac{a}{b} \right \rfloor$ 会按照被除数无符号，除数为非 2 的整数次幂的除法优化。

以 x % 7 为例，汇编代码如下：

; 无符号除法 ecx = x / 7
.text:004010AF mov     esi, [ebp+x]
.text:004010B2 mov     eax, 24924925h
.text:004010B7 mul     esi
.text:004010B9 mov     ecx, esi
.text:004010BB sub     ecx, edx
.text:004010BD shr     ecx, 1
.text:004010BF add     ecx, edx
.text:004010C1 shr     ecx, 2
; eax = x / 7 * 8 - x / 7 = x / 7 * 7
.text:004010C4 lea     eax, ds:0[ecx*8]
.text:004010CB sub     eax, ecx
; esi = x - x / 7 * 7 = x % 7
.text:004010CD sub     esi, eax
.text:004010CF push    esi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

被除数有符号，模数不为 2 的整数次幂

如果模数为正数，则 $\bmod b$ 会同样按照 $a-\left \lfloor \frac{a}{b} \right \rfloor \times b$ 来进行优化，只不过这里的除法按照被除数有符号，除数为非 2 的整数次幂的情况进行优化。

; 有符号除法 ecx = x / 7
.text:004010AF mov     esi, [ebp+x]
.text:004010B2 mov     eax, 92492493h
.text:004010B7 imul    esi
.text:004010B9 add     edx, esi
.text:004010BB sar     edx, 2
.text:004010BE mov     ecx, edx
.text:004010C0 shr     ecx, 1Fh
.text:004010C3 add     ecx, edx
; eax = x / 7 * 8 - x / 7 = x / 7 * 7
.text:004010C5 lea     eax, ds:0[ecx*8]
.text:004010CC sub     eax, ecx
; esi = x - x / 7 * 7 = x % 7
.text:004010CE sub     esi, eax
.text:004010D0 push    esi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

如果模数为负数，结果与模数为正数相同，即 $\bmod b = a \bmod \left | b \right |$ 。

控制流程

三目运算符

在不开优化时三目运算会被编译成 if-else 形式，但是开启优化后编译器针对减少程序分支做一些优化。

等式型

等式型三目运算符的形式为 x == a ? b : c 。实际上所有等式型的三目运算符都可以通过 x == 0 ? 0 : -1 转换过来。

x == 0 ? 0 : -1 对应的汇编指令为：

.text:00401003 mov     eax, [ebp+x]		; eax = x
.text:00401006 neg     eax				; CF = eax == 0 ? 0 : 1; eax = -x; 
.text:00401008 sbb     eax, eax			; eax = eax - eax - CF = -CF = x == 0 ? 0 : -1
1
2
3

如果是 x == a ? b : c 则对应的汇编指令为：

.text:00401000 mov     eax, [esp+x]
.text:00401004 sub     eax, a
.text:00401007 neg     eax
.text:00401009 sbb     eax, eax
.text:0040100B and     eax, c - b		; x == a ? 0 : c - b
.text:0040100E add     eax, b			; x == a ? b : c
1
2
3
4
5
6

特别的，如果 b + 1 == c 那么可以通过 setnz 的方式设置差值 1 。例如 x == a ? b : b + 1 对应的汇编指令为：

.text:00401003 xor     eax, eax
.text:00401005 cmp     [ebp+x], a
.text:00401009 setnz   al				; x == a ? 0 : 1
.text:0040100C add     eax, b			; x == a ? b : b + 1
1
2
3
4

对于高版本的编译器，针对三目运算会使用 cmovxx 条件传送指令进行优化：

.text:00401003 cmp     [ebp+x], a
.text:00401007 mov     eax, c
.text:0040100C mov     ecx, b
.text:00401011 cmovz   eax, ecx
1
2
3
4

不等式型

以 x > a ? b : c 为例对应的汇编指令如下：

.text:00401000 mov     ecx, [esp+x]		; ecx = x
.text:00401004 xor     eax, eax			; eax = 0
.text:00401006 cmp     ecx, a
.text:00401009 setle   al				; al = x <= a ? 1 : 0
.text:0040100C dec     eax				; eax = x <= a ? 0 : -1
.text:0040100D and     al, c - b		; eax = x <= a ? 0 : b - c
.text:0040100F add     eax, b			; eax = x <= a ? c : b
1
2
3
4
5
6
7

对于高版本的编译器，针对三目运算会使用 cmovxx 条件传送指令进行优化：

.text:00401003 cmp     [ebp+x], a
.text:00401007 mov     eax, c			; eax = c
.text:0040100C mov     ecx, b			; ecx = b
.text:00401011 cmovg   eax, ecx			; eax = x > a ? ecx : eax
1
2
3
4

表达式型

我们定义形如 条件 ? 表达式1 : 表达式2 的三目运算为表达式型三目运算。

以下面的代码为例。

#include 

int main(int argc, char *argv[]) {
    return argc < 8 ? argc / 8 : argc / 2;
}
1
2
3
4
5

VC6.0 编译器生成的汇编代码如下，该编译器将表达式型三目运算按照 if-else 的形式编译成汇编，没有进行减少分支跳转的优化。

.text:00401000 mov     eax, [esp+x]
.text:00401004 cmp     eax, 8
.text:00401007 cdq							; edx = x < 0 ? -1 : 0
.text:00401008 jge     short loc_401013
.text:0040100A and     edx, 7				; edx = x < 0 ? 7 : 0
.text:0040100D add     eax, edx				; eax = x < 0 ? eax + 7 : eax
.text:0040100F sar     eax, 3				; eax >>= 3
.text:00401012 retn
.text:00401013 loc_401013: 
.text:00401013 sub     eax, edx				; eax = x < 0 ? eax + 1 : eax
.text:00401015 sar     eax, 1				; eax >>= 1
.text:00401017 retn
1
2
3
4
5
6
7
8
9
10
11
12

高版本编译器仍会使用 cmovxx 条件传送指令进行优化：

.text:00401003 push    esi					; 保存 esi
.text:00401004 mov     esi, [ebp+x]			; esi = x
.text:00401007 mov     eax, esi
.text:00401009 cdq							; edx = x < 0 ? -1 : 0
.text:0040100A sub     eax, edx				; eax = x < 0 ? x + 1 : x
.text:0040100C mov     ecx, eax				; ecx = x < 0 ? x + 1 : x
.text:0040100E mov     eax, esi
.text:00401010 cdq							; edx = x < 0 ? -1 : 0
.text:00401011 and     edx, 7				; edx = x < 0 ? 7 : 0
.text:00401014 sar     ecx, 1				; ecx = (x < 0 ? x + 1 : x) >> 1
.text:00401016 add     eax, edx				; eax = x < 0 ? x + 7 : x
.text:00401018 sar     eax, 3				; eax =  (x < 0 ? x + 7 : x) >> 3
.text:0040101B cmp     esi, 8
.text:0040101E pop     esi					; 恢复 esi
.text:0040101F cmovge  eax, ecx				; eax = x >= 8 ? (x < 0 ? x + 1 : x) >> 1 : (x < 0 ? x + 7 : x) >> 3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

绝对值型

对于 x >= 0 ? x : -x 类型的三目运算符，高版本的 VC 会将其优化为 abs(x) 函数（abs 函数在高版本和低版本实现相同且会内联）。abs(x) 对应的汇编代码如下：

.text:00401000 mov     eax, [esp+x]
.text:00401004 cdq							; edx = x < 0 ? -1 : 0
.text:00401005 xor     eax, edx				; eax = x < 0 ? ~x : x
.text:00401007 sub     eax, edx				; eax = x < 0 ? (~x) + 1 : x
1
2
3
4

VC6.0 则会直接编译为 if 语句。

.text:00401000 mov     eax, [esp+x]
.text:00401004 test    eax, eax
.text:00401006 jge     short locret_40100A
.text:00401008 neg     eax
.text:0040100A locret_40100A:
... 
1
2
3
4
5
6

if 语句

if 型

在这里插入图片描述
汇编示例如下：

.text:00401003 cmp     [ebp+x], 0
.text:00401007 jle     short loc_401016

.text:00401009 push    offset string				; "x > 0"
.text:0040100E call    _puts
.text:00401013 add     esp, 4

.text:00401016 loc_401016: 
1
2
3
4
5
6
7
8

if-else 型

在这里插入图片描述
汇编示例如下：

.text:00401003 cmp     [ebp+x], 0
.text:00401007 jle     short loc_401018

.text:00401009 push    offset string1                  ; "x > 0"
.text:0040100E call    _puts
.text:00401013 add     esp, 4
.text:00401016 jmp     short loc_401025

.text:00401018 loc_401018:
.text:00401018 push    offset string2                  ; "x <= 0"
.text:0040101D call    _puts
.text:00401022 add     esp, 4

.text:00401025 loc_401025:      
1
2
3
4
5
6
7
8
9
10
11
12
13
14

if-else if-else 型

在这里插入图片描述
汇编示例如下：

.text:00401003 cmp     [ebp+x], 0
.text:00401007 jle     short loc_401018

.text:00401009 push    offset string1                  ; "x > 0"
.text:0040100E call    _puts
.text:00401013 add     esp, 4
.text:00401016 jmp     short loc_40103A

.text:00401018 loc_401018:
.text:00401018 cmp     [ebp+x], 0
.text:0040101C jnz     short loc_40102D

.text:0040101E push    offset string2                  ; "x == 0"
.text:00401023 call    _puts
.text:00401028 add     esp, 4
.text:0040102B jmp     short loc_40103A

.text:0040102D loc_40102D:
.text:0040102D push    offset string3                  ; "x <= 0"
.text:00401032 call    _puts
.text:00401037 add     esp, 4

.text:0040103A loc_40103A:    
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

switch 语句

分支较少

与前面 if 语句不同，switch 语句将满足条件执行的代码放在一起，这样做是为了满足 switch 中不加 break 会继续执行的特性。
在这里插入图片描述

汇编示例如下：

.text:00401004 mov     eax, [ebp+x]
.text:00401007 mov     [ebp+val], eax

.text:0040100A cmp     [ebp+val], 0
.text:0040100E jz      short loc_40101E

.text:00401010 cmp     [ebp+val], 1
.text:00401014 jz      short loc_40102D

.text:00401016 cmp     [ebp+val], 2
.text:0040101A jz      short loc_40103C

.text:0040101C jmp     short loc_40104B

.text:0040101E loc_40101E:
.text:0040101E push    offset string1                  ; "x == 0"
.text:00401023 call    _puts
.text:00401028 add     esp, 4
.text:0040102B jmp     short loc_401058

.text:0040102D loc_40102D:
.text:0040102D push    offset string2                  ; "x == 1"
.text:00401032 call    _puts
.text:00401037 add     esp, 4
.text:0040103A jmp     short loc_401058

.text:0040103C loc_40103C:
.text:0040103C push    offset string3                  ; "x == 2"
.text:00401041 call    _puts
.text:00401046 add     esp, 4
.text:00401049 jmp     short loc_401058

.text:0040104B loc_40104B:
.text:0040104B push    offset string4                  ; "default"
.text:00401050 call    _puts
.text:00401055 add     esp, 4

.text:00401058 loc_401058:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

分支较多但比较连续

在这里插入图片描述
汇编示例如下：

.text:00401074 jpt_40101C dd offset $LN4               ; jump table for switch statement
.text:00401074 dd offset $LN5
.text:00401074 dd offset def_40101C
.text:00401074 dd offset $LN6
.text:00401074 dd offset $LN7


.text:00401004 mov     eax, [ebp+x]
.text:00401007 mov     [ebp+val], eax

.text:0040100A mov     ecx, [ebp+val]
.text:0040100D sub     ecx, 3                          ; switch 5 cases
.text:00401010 mov     [ebp+val], ecx

.text:00401013 cmp     [ebp+val], 4
.text:00401017 ja      short def_40101C                ; jumptable 0040101C default case, case 5

.text:00401019 mov     edx, [ebp+val]
.text:0040101C jmp     ds:jpt_40101C[edx*4]            ; switch jump

.text:00401023 $LN4:
.text:00401023 push    offset string1                  ; jumptable 0040101C case 3
.text:00401028 call    _puts
.text:0040102D add     esp, 4
.text:00401030 jmp     short loc_40106C

.text:00401032 $LN5:
.text:00401032 push    offset string2                  ; jumptable 0040101C case 4
.text:00401037 call    _puts
.text:0040103C add     esp, 4
.text:0040103F jmp     short loc_40106C

.text:00401041 $LN6:
.text:00401041 push    offset string3                  ; jumptable 0040101C case 6
.text:00401046 call    _puts
.text:0040104B add     esp, 4
.text:0040104E jmp     short loc_40106C

.text:00401050 $LN7:
.text:00401050 push    offset string4                  ; jumptable 0040101C case 7
.text:00401055 call    _puts
.text:0040105A add     esp, 4
.text:0040105D jmp     short loc_40106C

.text:0040105F def_40101C:
.text:0040105F push    offset string5                  ; jumptable 0040101C default case, case 5
.text:00401064 call    _puts
.text:00401069 add     esp, 4

.text:0040106C loc_40106C:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

分支较多且比较不连续

分支较多且比较不连续且 case 最大值和最小值之差在 256 以内，会将跳转表去重然后额外使用一个 byte 数组保存跳转表下标。
在这里插入图片描述

汇编示例如下：

.text:0040107C jpt_401026 dd offset $LN4
.text:0040107C dd offset $LN6                          ; jump table for switch statement
.text:0040107C dd offset $LN7
.text:0040107C dd offset $LN5
.text:0040107C dd offset $LN8

.text:00401090 byte_401090 db 0, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4
.text:00401090 db 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4 ; indirect table for switch statement
.text:00401090 db 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4
.text:00401090 db 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4
.text:00401090 db 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4
.text:00401090 db 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4
.text:00401090 db 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4
.text:00401090 db 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3


.text:00401004 mov     eax, [ebp+x]
.text:00401007 mov     [ebp+val], eax

.text:0040100A mov     ecx, [ebp+val]
.text:0040100D sub     ecx, 3                          ; switch 254 cases
.text:00401010 mov     [ebp+val], ecx

.text:00401013 cmp     [ebp+val], 253
.text:0040101A ja      short $LN8                      ; jumptable 00401026 default case, cases 4,5,7-79,81-255

.text:0040101C mov     edx, [ebp+val]
.text:0040101F movzx   eax, ds:byte_401090[edx]
.text:00401026 jmp     ds:jpt_401026[eax*4]            ; switch jump

.text:0040102D $LN4:
.text:0040102D push    offset string1                  ; jumptable 00401026 case 3
.text:00401032 call    _puts
.text:00401037 add     esp, 4
.text:0040103A jmp     short loc_401076

.text:0040103C $LN5:
.text:0040103C push    offset string2                  ; jumptable 00401026 case 256
.text:00401041 call    _puts
.text:00401046 add     esp, 4
.text:00401049 jmp     short loc_401076

.text:0040104B $LN6:
.text:0040104B push    offset string3                  ; jumptable 00401026 case 6
.text:00401050 call    _puts
.text:00401055 add     esp, 4
.text:00401058 jmp     short loc_401076

.text:0040105A $LN7:
.text:0040105A push    offset string4                  ; jumptable 00401026 case 80
.text:0040105F call    _puts
.text:00401064 add     esp, 4
.text:00401067 jmp     short loc_401076

.text:00401069 $LN8:
.text:00401069 push    offset string5                  ; jumptable 00401026 default case, cases 4,5,7-79,81-255
.text:0040106E call    _puts
.text:00401073 add     esp, 4

.text:00401076 loc_401076:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60

分支较多且特别不连续

此时会形成类似二叉树的 if-else 分支嵌套，还会在其中嵌套查表的方法，较为复杂。
在这里插入图片描述

循环语句

do-while

在这里插入图片描述
汇编示例如下：

.text:00401003 loc_401003:
.text:00401003 mov     eax, [ebp+x]
.text:00401006 add     eax, 1
.text:00401009 mov     [ebp+x], eax

.text:0040100C mov     ecx, [ebp+x]
.text:0040100F cmp     ecx, [ebp+y]
.text:00401012 jl      short loc_401003
1
2
3
4
5
6
7
8

while

在这里插入图片描述
汇编示例如下：

.text:00401003 loc_401003:
.text:00401003 mov     eax, [ebp+x]
.text:00401006 cmp     eax, [ebp+y]
.text:00401009 jge     short loc_401016

.text:0040100B mov     ecx, [ebp+x]
.text:0040100E add     ecx, 1
.text:00401011 mov     [ebp+x], ecx
.text:00401014 jmp     short loc_401003

.text:00401016 loc_401016:  
1
2
3
4
5
6
7
8
9
10
11

for

在这里插入图片描述
汇编示例如下：

.text:00401004 mov     eax, [ebp+x]
.text:00401007 mov     [ebp+i], eax
.text:0040100A jmp     short loc_401015

.text:0040100C loc_40100C:
.text:0040100C mov     ecx, [ebp+i]
.text:0040100F add     ecx, 1
.text:00401012 mov     [ebp+i], ecx

.text:00401015 loc_401015:
.text:00401015 mov     edx, [ebp+i]
.text:00401018 cmp     edx, [ebp+y]
.text:0040101B jge     short loc_401030

.text:0040101D mov     eax, [ebp+i]
.text:00401020 push    eax
.text:00401021 push    offset _Format                  ; "%d\n"
.text:00401026 call    _printf
.text:0040102B add     esp, 8
.text:0040102E jmp     short loc_40100C

.text:00401030 loc_401030:                     
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

循环的优化

对流程结构的优化

在开启优化之后，while 和 for 类型的循环都会被优化为 if + do...while 形式。
在这里插入图片描述
汇编示例如下（修正了流水线优化造成的指令乱序，因此指令地址对不上）：

.text:00401004 mov     esi, [ebp+x]
.text:00401008 mov     edi, [ebp+y]

.text:0040100B cmp     esi, edi
.text:0040100D jge     short loc_401023

.text:00401010 loc_401010:
.text:00401010 push    esi
.text:00401011 push    offset _Format                  ; "%d\n"
.text:00401016 call    _printf
.text:0040101C add     esp, 8

.text:0040101B inc     esi

.text:0040101F cmp     esi, edi
.text:00401021 jl      short loc_401010

.text:00401023 loc_401023: 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

流水线优化（高版本）

编译器有时会对循环进行循环展开来减少循环次数。

当循环条件为常量表达式时，例如下面的函数：

int Fun() {
    int nSum = 0;
    for (int i = 0; i < 100; i++) {
        nSum += i;
    }
    return nSum;
}
1
2
3
4
5
6
7

经过编译器优化后产生如下代码：

int __cdecl Fun()
{
  int nSum0; // eax
  int i; // ecx
  int nSum2; // edx
  int nSum1; // esi
  int nSum3; // edi

  nSum0 = 0;
  i = 0;
  nSum2 = 0;
  nSum1 = 0;
  nSum3 = 0;
  do
  {
    nSum0 += i;
    nSum3 += i + 1;
    nSum1 += i + 2;
    nSum2 += i + 3;
    i += 4;
  }
  while ( i < 100 );
  return nSum3 + nSum2 + nSum1 + nSum0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

对应汇编如下：

.text:00401002 xor     eax, eax
.text:00401004 xor     ecx, ecx
.text:00401006 xor     edx, edx
.text:00401008 xor     esi, esi
.text:0040100A xor     edi, edi

.text:00401010 loc_401010:
.text:00401010 inc     edi
.text:00401011 add     esi, 2
.text:00401014 add     edx, 3
.text:00401017 add     eax, ecx				; eax = eax + ecx + 0
.text:00401019 add     edi, ecx				; edi = edi + ecx + 1
.text:0040101B add     esi, ecx				; esi = esi + ecx + 2
.text:0040101D add     edx, ecx				; edx = edx + ecx + 3

.text:0040101F add     ecx, 4				; ecx = ecx + 4
.text:00401022 cmp     ecx, 64h
.text:00401025 jl      short loc_401010

.text:00401027 lea     ecx, [edx+esi]		; ecx = edx + esi
.text:0040102A add     ecx, edi				; ecx = edx + esi + edi
.text:0040102C pop     edi
.text:0040102D add     eax, ecx				; eax = eax +  edx + esi + edi
.text:00401030 retn
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

当循环条件为变量表达式时，由于编译器无法确定循环次数，因此在循环展开的时候会判断边界。例如下面的函数：

int Fun(int x) {
    int nSum = 0;
    for (int i = 0; i < x; i++) {
        nSum += i;
    }
    return nSum;
}
1
2
3
4
5
6
7

经过编译器优化后产生如下代码：

int __fastcall Fun(int x)
{
  int nSum1; // edx
  int nSum2; // esi
  int i; // eax

  nSum1 = 0;
  nSum2 = 0;
  i = 0;
  if ( x >= 2 )
  {
    do
    {
      nSum1 += i;
      nSum2 += i + 1;
      i += 2;
    }
    while ( i < x - 1 );
  }
  if ( i >= x )
    i = 0;
  return nSum1 + nSum2 + i;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

对应汇编如下：

.text:00401001 xor     edx, edx
.text:00401003 xor     esi, esi
.text:00401005 xor     eax, eax

.text:00401008 cmp     ecx, 2
.text:0040100B jl      short loc_40101C

.text:0040100D lea     edi, [ecx-1]

.text:00401010 loc_401010:
.text:00401010 inc     esi
.text:00401011 add     edx, eax				; edx = edx + eax + 0
.text:00401013 add     esi, eax				; esi = esi + eax + 1

.text:00401015 add     eax, 2				; eax = eax + 2
.text:00401018 cmp     eax, edi
.text:0040101A jl      short loc_401010

.text:0040101C loc_40101C:
.text:0040101C xor     edi, edi				; edi = 0
.text:0040101E cmp     eax, ecx
.text:00401020 cmovge  eax, edi				; eax = eax >= ecx ? 0 : eax
.text:00401023 add     eax, esi
.text:00401026 add     eax, edx				; eax = eax + esi + edx
.text:00401029 retn
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

代码外提

如果循环条件是一个表达式且表达式的结果不会受循环次数影响，则编译器会将该表达式对应的代码放到循环外面并使用一个局部变量存放表达式计算的结果，避免循环时重复计算。

例如下面这段代码：

int Fun(int x) {
    int nSum = 0;
    for (int i = 0; i < x / 7; i++) {
        nSum += i;
    }
    return nSum;
}
1
2
3
4
5
6
7

经过编译器优化后产生如下代码：

int __cdecl Fun(int x)
{
  int nSum; // esi
  int i; // ecx
  int tmp; // eax

  nSum = 0;
  tmp = x / 7;	// 代码外提
  if (tmp != 0) {
    do {
      nSum += i;
    } while(i < tmp)
  }
  return nSum;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

对应汇编如下：

.text:00401000 push    ebp
.text:00401001 mov     ebp, esp

.text:00401003 mov     eax, [ebp+x]			; eax = x
.text:00401007 push    7
.text:00401009 pop     ecx					; ecx = 7
.text:0040100A cdq							; eax 符号扩展至 edx ，因为 idiv 指令的被除数是 eax 和 edx 。
.text:0040100B xor     esi, esi				; esi = 0 (nSum = 0)
.text:0040100D idiv    ecx					; eax = eax / ecx (tmp = x / 7)
.text:0040100F mov     ecx, esi				; ecx = 0 (i = 0)

.text:00401011 test    eax, eax
.text:00401013 jle     short loc_40101C

.text:00401015 loc_401015:
.text:00401015 add     esi, ecx				; esx += ecx
.text:00401017 inc     ecx					; ecx++

.text:00401018 cmp     ecx, eax
.text:0040101A jl      short loc_401015

.text:0040101C loc_40101C:
.text:0040101C mov     eax, esi

.text:0040101F pop     ebp
.text:00401020 retn
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

需要注意的是，如果循环条件调用是是函数，那么除非函数内联且编译器能分析出返回值固定，否则无法对函数实现代码外提。

强度削弱

编译器会在循环中用低代价周期的指令替换高代价高周期的指令，如乘法和移位代替除法，加法代替乘法。

例如下面这段代码：

int Fun(int x) {
    int nSum = 0;
    for (int i = 0; i < x; i++) {
        nSum += i * x;
    }
    return nSum;
}
1
2
3
4
5
6
7

经过编译器优化后产生如下代码：

int __cdecl Fun(int x)
{
  int nSum; // eax
  int tmp; // edx
  int i; // esi

  nSum = 0;
  if ( x > 0 )
  {
    tmp = 0;
    i = x;
    do
    {
      nSum += tmp;
      tmp += x;
      --i;
    }
    while ( i );
  }
  return nSum;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

对应汇编如下：

.text:00401000 push    ebp
.text:00401001 mov     ebp, esp

.text:00401006 xor     eax, eax				; eax = 0

.text:00401003 mov     ecx, [ebp+x]			; ecx = x
.text:00401008 test    ecx, ecx
.text:0040100A jle     short loc_40101B

.text:0040100D mov     edx, eax				; edx = 0
.text:0040100F mov     esi, ecx				; esi = x

.text:00401011 loc_401011:
.text:00401011 add     eax, edx				; eax += edx
.text:00401013 add     edx, ecx				; edx += x
.text:00401015 sub     esi, 1				; esi--
.text:00401018 jnz     short loc_401011

.text:0040101B loc_40101B:
.text:0040101B pop     ebp
.text:0040101C retn
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

识别 break 和 continue

break

当没有 if 判断时，循环结构无效，会变成顺序结构

int Fun(int x) {
    int nSum = 0;
    for (int i = 0; i < x; i++) {
        nSum += i;
        break;
    }
    return nSum;
}
/*
.text:00401000 xor     eax, eax
.text:00401002 retn
*/
1
2
3
4
5
6
7
8
9
10
11
12

break 前有 if 判断时循环结构有效。break 条件成立后会跳转至循环结束的位置。

int Fun(int x) {
    int nSum = 0;
    for (int i = 0; i < x; i++) {
        nSum += i;
        if (i % 8 == 0) { break; }
    }
    return nSum;
}
1
2
3
4
5
6
7
8

对应汇编代码如下：

.text:00401000 push    ebp
.text:00401001 mov     ebp, esp

.text:00401003 xor     edx, edx
.text:00401005 mov     ecx, edx
.text:00401007 cmp     [ebp+x], ecx
.text:0040100A jle     short loc_401026

; edx += i
.text:0040100C loc_40100C:
.text:0040100E add     edx, ecx

; if (ecx % 8 == 0)
.text:0040100C mov     eax, ecx
.text:00401010 and     eax, 10000000000000000000000000000111b
.text:00401015 jns     short loc_40101E
.text:00401017 dec     eax
.text:00401018 or      eax, 0FFFFFFF8h
.text:0040101B add     eax, 1
.text:0040101E loc_40101E:
.text:0040101E jz      short loc_401026

.text:00401020 inc     ecx
.text:00401021 cmp     ecx, [ebp+x]
.text:00401024 jl      short loc_40100C

.text:00401026 loc_401026:
.text:00401026 mov     eax, edx
.text:00401028 pop     ebp
.text:00401029 retn
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

continue

当没有 if 判断时，continue 后面的代码全部被优化掉。

int Fun(int x) {
    int nSum = 0;
    for (int i = 0; i < x; i++) {
        nSum += i;
        continue;
        nSum -= i;
    }
    return nSum;
}
1
2
3
4
5
6
7
8
9

对应汇编代码如下：

.text:00401000 push    ebp
.text:00401001 mov     ebp, esp

.text:00401003 xor     eax, eax
.text:00401005 mov     ecx, eax
.text:00401007 cmp     [ebp+x], eax
.text:0040100A jle     short loc_401014

.text:0040100C loc_40100C:
.text:0040100C add     eax, ecx
.text:0040100E inc     ecx
.text:0040100F cmp     ecx, [ebp+x]
.text:00401012 jl      short loc_40100C

.text:00401014 loc_401014:
.text:00401014 pop     ebp
.text:00401015 retn
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

当有 if 判断时，如果 continue 条件成立会直接跳转至循环中条件判断和更新循环变量的位置。

int Fun(int x) {
    int nSum = 0;
    for (int i = 0; i < x; i++) {
        nSum += i;
        if (i % 8 == 0) { continue; }
        puts("not continue");
    }
    return nSum;
}
1
2
3
4
5
6
7
8
9

对应汇编如下：

.text:00401000 push    ebp
.text:00401001 mov     ebp, esp
.text:00401003 push    ebx
.text:00401004 push    esi

.text:00401005 xor     esi, esi
.text:00401007 mov     ebx, esi
.text:00401009 cmp     [ebp+x], ebx
.text:0040100C jle     short loc_401026

.text:0040100E loc_40100E:
.text:0040100E add     esi, ebx
.text:00401010 test    bl, 7
.text:00401013 jz      short loc_401020

.text:00401015 push    offset string                   ; "not continue"
.text:0040101A call    _puts
.text:0040101F pop     ecx

.text:00401020 loc_401020:
.text:00401020 inc     ebx
.text:00401021 cmp     ebx, [ebp+x]
.text:00401024 jl      short loc_40100E

.text:00401026 loc_401026:
.text:00401026 mov     eax, esi

.text:00401028 pop     esi
.text:00401029 pop     ebx
.text:0040102A pop     ebp
.text:0040102B retn
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

函数

调用约定

_cdecl
- C\C++默认的调用方式，调用方平衡栈，参数从右往左入栈，不定参数的函数可以使用。
- 内部名称：_函数名，如 _foo_c：char __cdecl foo_c(int n1, int n2, int n3)
_stdcall
- 被调方平衡栈，参数从右往左入栈，不定参数的函数无法使用。
- 内部名称：_函数名@参数总大小，如 _foo_std@12：short int __stdcall foo_std(int n1, int n2, int n3)
_fastcall
- 前 2 个参数通过寄存器传递（第一个参数 ecx，第二个参数 edx），其余参数数从右往左栈传递。被调方平衡栈，不定参数的函数无法使用。
- 内部名称：_@函数名@参数总大小，如 _@foo_fst@12：__int64 foo_fst(int n1, int n2, int n3)

函数内联

有复杂的结构，编译器不会内联，比 switch...case ，或者递归。
递归函数不会内联，不管什么选项。除非极其简单编译器转成循环（很少见）。
高版本编译器默认开启内联。

识别参数

观察调用处的代码

push        3					
push        2					
push        1					
call       0040100f			
1
2
3
4

找到平衡堆栈的代码继续论证
```
call        0040100f					
add         esp,0Ch		
1
2
```
或者函数内部
```
ret 4/8/0xC/0x10		
1
```
注意，编译器可能会将两次平栈代码合并，即程序在调用函数后不会立即平栈，而是多次调用函数后统一平栈。
参数传递未必都是通过堆栈，还可能通过使用寄存器，比如：
```
push ebx					
push eax					
mov ecx,dword ptr ds:[esi]					
mov edx,dword ptr ds:[edi]					
push 45					
push 33					
call 函数地址	
1
2
3
4
5
6
7
```
因此还需要观察函数内部是否使用未初始化的寄存器的值。

识别返回值

观察当函数在返回的时候是否针对 eax 的写入。
观察出函数后如何对待该 eax 返回值。

变量

局部变量

如果使用 ebp 寻址则 [ebp - xxx] 为局部变量。
如果使用 esp 寻址则使用 [esp + xx] 访问局部变量。
在 IDA 中如果使用 esp 寻址会使用局部变量与返回地址之间的偏移来标记局部变量，即 [esp + xxx + var_x]（[esp + xxx] 为返回地址，局部变量 var_x 相对返回地址偏移为 var_x（小于 0））。

全局变量

示例程序如下：

#include 

extern "C" {
int x;
int y = 1;
int z = (srand(time(NULL)), rand());
}
int main() {
    printf("%d %d %d\n", x, y, z);
    return 0;
}
1
2
3
4
5
6
7
8
9
10
11

main 函数汇编代码如下，可以看到全局变量 x，y，z 全部在 .data 段

.data:00403000 _y dd 1 

.data:00403020 _x dd 0
.data:00403024 _z dd 0 

.text:00401030 push    _z
.text:00401036 push    _y
.text:0040103C push    _x
.text:00401042 push    offset _Format                  ; "%d %d %d\n"
.text:00401047 call    _printf
.text:00401047
.text:0040104C add     esp, 10h
.text:0040104F xor     eax, eax
.text:00401051 retn
1
2
3
4
5
6
7
8
9
10
11
12
13
14

其中 z 在 dynamic_initializer_for__z__ 函数中被初始化。

int __cdecl dynamic_initializer_for__z__()
{
  unsigned int v0; // eax
  int result; // eax

  v0 = __time64(0);
  _srand(v0);
  result = _rand();
  z = result;
  return result;
}
1
2
3
4
5
6
7
8
9
10
11

在调用 main 函数之前 _scrt_common_main_seh 函数调用了 C++ 的初始化函数 initterm，该函数会依次调用 _xc_a 和 _xc_z 之间的函数指针指向的函数，其中就包括 dynamic_initializer_for__z__ 函数。

另外 C 的初始化函数是 initterm_e(_xi_a, _xi_z)，该函数先于 initterm 函数调用。dynamic_initializer_for__z__ 函数指针没有放到 _xi_a 和 _xi_z 之间也说明了 C 语言不支持全局变量 z 这种初始化方式。

// Calls each function in [first, last).  [first, last) must be a valid range of
// function pointers.  Each function is called, in order.
extern "C" void __cdecl _initterm(_PVFV* const first, _PVFV* const last)
{
    for (_PVFV* it = first; it != last; ++it)
    {
        if (*it == nullptr)
            continue;

        (**it)();
    }
}

// Calls each function in [first, last).  [first, last) must be a valid range of
// function pointers.  Each function must return zero on success, nonzero on
// failure.  If any function returns nonzero, iteration stops immediately and
// the nonzero value is returned.  Otherwise all functions are called and zero
// is returned.
//
// If a nonzero value is returned, it is expected to be one of the runtime error
// values (_RT_{NAME}, defined in the internal header files).
extern "C" int __cdecl _initterm_e(_PIFV* const first, _PIFV* const last)
{
    for (_PIFV* it = first; it != last; ++it)
    {
        if (*it == nullptr)
            continue;

        int const result = (**it)();
        if (result != 0)
            return result;
    }

    return 0;
}


    if ( _initterm_e(__xi_a, __xi_z) )
      return 0xFF;
    _initterm(__xc_a, __xc_z);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

静态局部变量

首先如果静态全局变量的初始值为常量则等价为已初始化的全局变量。

示例代码：

int main() {
    static int x = 10;
    return (int) & x;
}
1
2
3
4

对应汇编代码：

.data:00414000 x dd 0Ah 

.text:00401000 mov     eax, offset x
.text:00401005 retn
1
2
3
4

当静态局部变量初始值不确定的时候，编译器会额外创建一个标志位记录是否已初始化，从而避免重复初始化。

例如下面的示例代码：

int main(int argc) {
    static int x = argc;
    return (int) & x;
}
1
2
3
4

VC6.0 对应的反汇编代码如下：

int __cdecl main(int argc, const char **argv, const char **envp)
{
  if ( (byte_405284 & 1) == 0 )
  {
    byte_405284 |= 1u;
    dword_405280 = argc;
  }
  return (int)&dword_405280;
}
1
2
3
4
5
6
7
8
9

事实上为了节省空间每个静态局部变量只占一个标志位。例如下面的示例代码：

int main(int argc) {
    static int x = argc;
	static int y = argc;
    return (int) & x+ (int) & y;
}
1
2
3
4
5

VC6.0 对应的反汇编代码如下：

int __cdecl main(int argc, const char **argv, const char **envp)
{
  char v3; // al

  v3 = byte_405288;
  if ( (byte_405288 & 1) == 0 )
  {
    v3 = byte_405288 | 1;
    dword_405284 = argc;
    byte_405288 |= 1u;
  }
  if ( (v3 & 2) == 0 )
  {
    dword_405280 = argc;
    byte_405288 = v3 | 2;
  }
  return (int)&dword_405284 + (_DWORD)&dword_405280;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

高版本为了避免标志位访问冲突将标志位存放在 TLS 中。

.data:00416000 __Init_global_epoch dd 80000000h

.data:004168DC $TSS0 dd 0

void __cdecl _Init_thread_header(int *pOnce)
{
  AcquireSRWLockExclusive(&g_tss_srw);
  while ( 1 )
  {
    if ( !*pOnce )
    {
      *pOnce = -1;
      goto LABEL_7;
    }
    if ( *pOnce != -1 )
      break;
    _Init_thread_wait_v2();
  }
  *(_DWORD *)(*((_DWORD *)NtCurrentTeb()->ThreadLocalStoragePointer + _tls_index) + 4) = _Init_global_epoch;
LABEL_7:
  ReleaseSRWLockExclusive(&g_tss_srw);
}

void __cdecl _Init_thread_footer(int *pOnce)
{
  AcquireSRWLockExclusive(&g_tss_srw);
  *pOnce = ++_Init_global_epoch;
  *(_DWORD *)(*((_DWORD *)NtCurrentTeb()->ThreadLocalStoragePointer + _tls_index) + 4) = _Init_global_epoch;
  ReleaseSRWLockExclusive(&g_tss_srw);
  WakeAllConditionVariable(&g_tss_cv);
}

int __cdecl main(int argc, const char **argv, const char **envp)
{
  if ( _TSS0 > *(_DWORD *)(*((_DWORD *)NtCurrentTeb()->ThreadLocalStoragePointer + _tls_index) + 4) )
  {
    _Init_thread_header(&_TSS0);
    if ( _TSS0 == -1 )
    {
      x = argc;
      _Init_thread_footer(&_TSS0);
    }
  }
  return (int)&x;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

多媒体指令

~~n 方过百万，暴力碾标算~~

x87 浮点指令

Intel x87 FPU 专门用于执行标量浮点计算，可以对单精度（32 位），双精度（64 位），以及扩展双精度浮点（80 位）进行计算，并顺从 IEEE754 标准。

x87 浮点寄存器

x87 协处理器有 ST(0) ~ ST(7) 8 个浮点寄存器。

浮点寄存器是一个闭环的栈结构，其中 ST(0) 寄存器是栈顶。当浮点数出栈的时候，出栈的寄存器值移动到栈底 ST(7) 。
当寄存器满了之后会报错。
控制结果的位置可以控制效率。

常用浮点指令

指令名称	使用格式	指令功能
`FLD`	`FLD IN`	将浮点数 `IN` 压入 `ST(0)` 中。`IN`（`mem 32/64/80`）
`FILD`	`FILD IN`	将整数数 `IN` 压入 `ST(0)` 中。`IN`（`mem 32/64/80`）
`FLDZ`	`FLDZ`	将 `0.0` 压入 `ST(0)` 中。
`FLD1`	`FLD1`	将 `1.0` 压入 `ST(0)` 中。
`FST`	`FST OUT`	`ST(0)` 中的数据以浮点形式存入 `OUT` 地址中。`OUT`（`mem 32/64`）
`FSTP`	`FSTP OUT`	和 `FST` 指令一样，但会执行一次出栈操作。
`FIST`	`FIST OUT`	`ST(0)` 中的数据以整数形式存入 `OUT` 地址中。`OUT`（`mem 32/64`）
`FISTP`	`FISTP OUT`	和 `FIST` 指令一样，但会执行一次出栈操作。
`FCOM`	`FCOM IN`	将 `IN` 地址数据与 `ST(0)` 进行实数比较，影响对应标记位。
`FTST`	`FTST`	比较 `ST(0)` 是否为 0.0 ，影响对应标记位。
`FADD`	`FADD IN`	将 `IN` 地址内的数据与 `ST(0)` 做加法运算，结果放入 `ST(0)` 中。
`FADDP`	`FADDP ST(N), ST`	将 `ST(N)` 中的数据与 `ST(0)` 中的数据做加法运算，`N` 为 0~7 中的任意一个，先执行一次出栈操作，然后将相加结果放入 `ST(0)` 中保存。

其中 FADDP ST(N), ST 指令的执行过程如下：

ST(N) = ST(N) + ST(0)
执行一次出栈操作。

示例代码：

int main() {
    // {}赋值不允许隐式转换，避免=赋值运算时产生的丢失精度问题
    float f1{1.1f};
    float f2{2.2f};
    float ret = 0.0f;       // 返回值必须初始化

    // 优化前：寄存器间操作
    __asm {
		fld f1;             // st(0) = 1.1f
        fld f2;             // st(0) = 2.2f；st(1) = 1.1f
        faddp st(1), st;    // st(0) = 3.3f；st(7) = 2.2f
        fstp ret;           // st(6) = 2.2f；st(7) = 3.3f；ret = 3.3f
    }

    // 优化后：直接加内存，减少两次操作（入栈出栈各一次）
    __asm {
        fld f1;             // st(0) = 1.1f
        fadd f2;            // st(0) 1.1f + 内存值2.2f → st(0) = 3.3f
        fstp ret;           // st(7) = 3.3f；ret = 3.3f
    }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

VS 指定使用浮点 x87 指令设置：项目属性 → C/C++ → 代码生成 → 启用增强指令集 → 无增强指令集

MMX 指令集

MMX（Multi Media eXtension，多媒体扩展指令集）指令集是 Intel 公司于 1996 年推出的一项多媒体指令增强技术。

MMX 指令集包括 57 条多媒体指令，通过这些指令可以一次处理多个数据，在处理结果超过实际处理能力的时候也能正常进行处理，这样在软件的配合下，就可以得到更高的性能。
能够并行运算：MMX 的支持，同时运算两个 int 型的变量，速率大大提高。
XMM 指令集不支持浮点运算。

MMX 寄存器

8 个 64 位寄存器 MM0~MM7
它和 ST(0) 和 ST(7) 都是栈相关的，改变 MM0 ，ST(0) 也改变，但是一般使用 MMX 指令集后就不看 x87 寄存器。

常用多媒体指令

指令名称	使用格式	指令功能
`MOVD`	`MOVD mmx, reg/mem32` `MOVD reg/mem32, mmx`	复制 `MMX` 寄存器中的低位双字到一个通用寄存器或内存中，也可以把通用寄存器或内存中的数据复制到 `MMX` 寄存器的低位双字中。
`MOVQ`	`MOVQ mmx1, mmx2/mem64` `MOVQ mmx1/mem64, mmx2`	把一个 `MMX` 寄存器的内容复制到另一个 `MMX` 寄存器中，这个指令也能被用来把一个内存区域的内容复制到一个 `MMX` 寄存器中，或者把 `MMX` 寄存器中的内容复制到内存中。
`PADDB`	`PADDB mmx1, mmx2/mem64`	环绕方式，并行执行 1 个字节整型加法。
`PADDD`	`PADDD mmx1, mmx2/mem64`	环绕方式，并行执行 4 个字节整型加法。
`PADDSB`	`PADDSB mmx1, mmx2/mem64`	饱和方式，并行执行有符号 1 个字节整型加法。
`PADDSW`	`PADDSW mmx1, mmx2/mem64`	饱和方式，并行执行有符号 2 个字节整型加法。
`PADDUSB`	`PADDUSB mmx1, mmx2/mem64`	饱和方式，并行执行无符号 1 个字节整型加法。
`PADDUSW`	`PADDUSW mmx1, mmx2/mem64`	饱和方式，并行执行无符号 2 个字节整型加法。
`PSUBB`	`PSUBB mmx1, mmx2/mem64`	环绕方式，并行执行 1 个字节整型减法。
`PSUBW`	`PSUBW mmx1, mmx2/mem64`	环绕方式，并行执行 2 个字节整型减法。
`PSUBD`	`PSUBD mmx1, mmx2/mem64`	环绕方式，并行执行 4 个字节整型减法。
`PSUBSB`	`PSUBSB mmx1, mmx2/mem64`	饱和方式，并行执行有符号 1 个字节整型减法。

示例代码：

int main()
{
    // 4字节长度数据类型
    int ary1[] = {1,2};
    int ary2[] = {3,4};
    int ret[2] = {0};
    
    // 并行运算
    __asm
    {
     	movq mm1, ary1;	// MM1 = 0000000200000001
        movq mm3, ary2; // MM1同上，MM3 = 0000000400000003
        paddd mm1, mm3;	// 并行执行加法：1+3和2+4的结果给mm1
        				//  MM3同上，MM1 = 0000000600000004
        movq ret, mm1;	//  ret[0] = 00000004；ret[1] = 00000006；
    }
    
    // 1字节长度数据类型：可以同时8个整型做加法
    char ary1[] = {1,1,1,1,1,1,1,1};
    char ary2[] = {2,2,2,2,2,2,2,2};
    char ret[8] = {0};
    
     // 并行运算
    __asm
    {
     	movq mm1, ary1;	// MM1 = 0101010101010101
        movq mm3, ary2; // MM1同上，MM3 = 0202020202020202
        paddb mm1, mm3;	// 并行执行加法：1+2的结果给mm1
        				//  MM3同上，MM1 = 0303030303030303
        //pfadd mm1, mm3  @3d now
        movq ret, mm1;	//  ret[0] ~ ret[7] = 0x3；
    }
    return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

为了避免使用内联汇编，微软在 mmintrin.h 中定义了 MMX 指令对应的各种宏。例如下面这段代码：

#include 
#include 

int main() {
    int ary1[] = {1, 2};
    int ary2[] = {3, 4};
    int ret[2] = {0};

    *(__m64 *) &ret = _m_paddd(*(__m64 *) &ary1, *(__m64 *) &ary2);
    
    printf("%d %d\n", ret[0], ret[1]);

    return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14

对应汇编如下（为了方便阅读，指令顺序稍作调整）：

.text:00401029 push    ebp
.text:0040102A mov     ebp, esp
.text:0040102C sub     esp, 10h

.text:0040102F mov     dword ptr [ebp+ary1], 1
.text:00401036 mov     dword ptr [ebp+ary1+4], 2
.text:00401041 mov     dword ptr [ebp+ary2], 3
.text:00401048 mov     dword ptr [ebp+ary2+4], 4

.text:0040103D movq    mm1, [ebp+ary1]
.text:0040104F movq    mm0, [ebp+ary2]
.text:00401053 paddd   mm1, mm0
.text:00401056 movq    [ebp+ary2], mm1

.text:0040105A push    dword ptr [ebp+ary2+4]
.text:0040105D push    dword ptr [ebp+ary2]
.text:00401060 push    offset _Format                  ; "%d %d\n"
.text:00401065 call    _printf

.text:0040106A add     esp, 0Ch
.text:0040106D xor     eax, eax
.text:0040106F leave
.text:00401070 retn
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

AMD 3DNow! 指令集

3DNow!（据称是“3D No Waiting!”的缩写）是由 AMD 开发的一套 SIMD 多媒体指令集，支持单精度浮点数的矢量运算，用于增强 x86 架构的计算机在三维图像处理上的性能。

新增指令：

PFADD
PFSUB

针对 AMD 推出的 3DNow! 指令集，英特尔宣布放弃媒体指令集，并推出了 SSE 指令集（不兼容 3DNow!）。不久之后，AMD 也宣布放弃 3DNow! 指令集。

SSE 指令集

SSE（Streaming SIMD Extensions）是英特尔在 AMD 的 3D Now! 发布之后，在其计算机芯片 Pentium Ⅲ 中引入的指令集，是继 MMX 的扩展指令集。SSE 指令集提供了 70 条新指令。AMD 后来在 Athlon XP 中加入了对这个新指令集的支持。

SSE 指令集在加快浮点运算的同时，也改善了内存的使用频率，使内存速度更快。它对游戏性能的改善十分显著，按 Intel 的说法，SSE 对下述几个领域的影响特别明显：3D 几何运算及动画处理、图形处理（如 Photoshop）、视频编辑/压缩/解压（如 MPEG 和 DVD）、语音识别以及声音压缩和合成等。

SSE1 主要是单精度浮点运算。
SSE2 主要是双精度浮点运算。
SSE2 与 SSE1 使用相同寄存器。

SSE 寄存器

SSE 有 8 个 128 位独立寄存器（XMM0~XMM7）。

MM 指 64 位 MMX 寄存器。
XMM 指 XMM 寄存器。

常用指令

指令名称	使用格式	指令功能
`MOVSS`	`MOVSS xmm1, xmm2` `MOVSS xmm1, mem32` `MOVSS xmm2/mem32, xmm1`	传送单精度数
`MOVSD`	`MOVSD xmm1, xmm2` `MOVSD xmm1, mem64` `MOVSD xmm2/mem64, xmm1`	传送双精度数
`MOVAPS`	`MOVAPS xmm1, xmm2/mem128` `MOVAPS xmm1/mem128, xmm2`	传送对齐的封装好的单精度数
`MOVAPD`	`MOVAPD xmm1, xmm2/mem128` `MOVAPD xmm1/mem123, xmm2`	传送对齐封装好的双精度浮点数
`ADDSS`	`ADDSS xmm1, xmm2/mem32`	单精度加法
`ADDSD`	`ADDSD xmm1, xmm2/mem64`	双精度加法
`ADDPS`	`ADDPS xmm1, xmm2/mem128`	并行 4 个单精度加法
`APPPD`	`ADDPD xmm1, xmm2/mem128`	并行 2 个双精度加法
`SUBSS`	`SUBSS xmm1, xmm2/mem32`	单精度减法
`SUBSD`	`SUBSD xmm1, xmm2/mem64`	双精度减法
`SUBPS`	`SUBPS xmm1, xmm2/mem128`	并行 4 个单精度减法
`SUBPD`	`SUBPD xmm1, xmm2/mem128`	并行 2 个双精度减法

注意：

MOVAPS 指令要求内存操作数的地址必须关于 0x10 对齐，否则 CPU 报错。MOVUPS 没有这方面要求。
从效率出发 MOVAPS 比 MOVUPS 好，但是这需要严格控制内存地址，因此 MOVUPS 更常用。

与 MMX 相同，微软在同样中定义了 SSE 指令对应的各种宏：

xmmintrin.h：SSE
emmintrin.h：SSE2
pmmintrin.h：SSE3
smmintrin.h：SSE4

#include 
#include 

int main() {
    int ary1[] = {1, 2, 3, 4};
    int ary2[] = {5, 6, 7, 8};
    int ret[4] = {0};

    *(__m128 *) &ret = _mm_add_ps(*(__m128 *) &ary1, *(__m128 *) &ary2);

    printf("%d %d %d %d\n", ret[0], ret[1], ret[2], ret[3]);

    return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14

对应反编译代码如下：

int __cdecl main(int argc, const char **argv, const char **envp)
{
  __m128 ary1; // [esp-34h] [ebp-40h]
  __m128 ary2; // [esp-24h] [ebp-30h]
  __m128 ret; // [esp-14h] [ebp-20h]

  ary1.m128_u64[0] = 0x200000001i64;
  ary1.m128_u64[1] = 0x400000003i64;
  ary2.m128_u64[0] = 0x600000005i64;
  ary2.m128_u64[1] = 0x800000007i64;
  ret = _mm_add_ps(ary1, ary2);
  printf("%d %d %d %d\n", ret.m128_i32[0], ret.m128_i32[1], ret.m128_i32[2], ret.m128_i32[3]);
  return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14

后来 AMD 发布 SSE5，英特尔宣布放弃 SSE 指令集并推出全新指令集 AVX 指令集。之后 AMD 宣布放弃 SSE5 指令集。

AVX 指令集

AVX（Advanced Vector Extension，高级向量扩展）指令集是 Sandy Bridge 和 Larrabee 架构下的新指令集。AVX 是在之前的 128 位扩展到 256 位的单指令多数据流。而 Sandy Bridge 的单指令多数据流演算单元扩展到 256 位的同时数据传输也获得了提升，所以从理论上看 CPU 内核浮点性能提升到了 2 倍。

Intel AVX 指令集，在单指令多数据流计算性能增强的同时也沿用了 MMX/SSE 指令集。不过和 MMX/SSE 的不同点在于增强的 AVX 指令，从指令的格式上就发生了很大的变化。x86（IA-32/Intel 64）架构的基础上增加了 prefix（Prefix），所以实现了新的命令，也使更加复杂的指令得以实现，从而提升了 x86 CPU 的性能。

AVX 有 16 个 256 位独立寄存器（YMM0~YMM15，32 位只有 8 个），在 XMM 的基础上寄存器范围增加一倍。
- XMM0 ~ XMM15：128bit
- YMM0 ~ YMM15：256bit
运算指令支持三个操作数。
兼容性：全面兼容 SSE 指令，只需要将 SSE 指令前加 V 即可。

示例代码如下：

#include 

int main() {
    int ary1[]{1, 2, 3, 4, 5, 6, 7, 8};
    int ary2[]{8, 7, 6, 5, 4, 3, 2, 1};
    int ret[8]{0};

    __asm {
        vmovups YMM0, ary1
        vmovups YMM1, ary2
        vaddps YMM2, YMM0, YMM1
        vmovups ret, YMM2
    }

    for (int i = 0; i < 8; i++) {
        printf("%d%c", ret[i], i == 7 ? '\n' : ' ');
    }

    return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

使用 immintrin.h 可以完成同样功能。

#include 
#include 

int main() {
    int ary1[]{1, 2, 3, 4, 5, 6, 7, 8};
    int ary2[]{8, 7, 6, 5, 4, 3, 2, 1};
    int ret[8]{0};

    *(__m256d *) ret = _mm256_add_pd(*(__m256d *) ary1, *(__m256d *) ary2);

    for (int i = 0; i < 8; i++) {
        printf("%d%c", ret[i], i == 7 ? '\n' : ' ');
    }

    return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

数组

同时具备连续性和一致性的数据结构，我们把它定义为数组。

数组的识别

识别数组即求证某数据结构的连续性和一致性，我们常常着重两个方面：

比例因子寻址：[ebp + ecx * 4 + N]
- 在 ecx 的取值范围内，数据连续且一致。
- 从比例因子中，我们可以得到数组首地址和元素大小。
  - 数组首地址：ebp + N
  - 元素大小：4
- 在实际工作中，比例因子寻址会出现各种运算优化。
  - 比如 ary[5] 就会有常量折叠：[ebp - 40 + 5 * 4] → [ebp - 20]
循环结构：在正常的开发环境中，有数组的场合一定有循环结构。
- 在循环结构中观察对数组元素的处理，可证其数据连续且一致。
- 在循环结构中，观察循环的初值、终值和步长设定，可以得到数组首地址和范围。

数组初始化

对于低版本或者 Debug 版程序，数组初始化汇编如下：

.text:00401046 mov     [ebp+ary], 1
.text:0040104D mov     [ebp+ary+4], 2
.text:00401054 mov     [ebp+ary+8], 3
.text:0040105B mov     [ebp+ary+0Ch], 4
.text:00401062 mov     [ebp+ary+10h], 5
.text:00401069 mov     [ebp+ary+14h], 6
.text:00401070 mov     [ebp+ary+18h], 7
.text:00401077 mov     [ebp+ary+1Ch], 8
1
2
3
4
5
6
7
8

高版本开优化后会使用多媒体指令集初始化数组。

.rdata:00444D70 __xmm@00000004000000030000000200000001 xmmword 4000000030000000200000001h
.rdata:00444D80 __xmm@00000008000000070000000600000005 xmmword 8000000070000000600000005h

.text:00401046 movaps  xmm0, ds:__xmm@00000004000000030000000200000001
.text:0040104D movups  xmmword ptr [ebp+ary], xmm0
.text:00401052 movaps  xmm0, ds:__xmm@00000008000000070000000600000005
.text:0040105B movups  xmmword ptr [ebp+ary+10h], xmm0
.text:0040105F mov     [ebp+ary+20h], 9
.text:00401066 mov     [ebp+ary+24h], 0Ah
1
2
3
4
5
6
7
8
9

一维数组

比例因子寻址：如果访问的是数组中的某个元素通常采用比例因子寻址：

 	printf("%d\n", ary[x]);
.text:004010C6 mov     eax, [ebp+x]
.text:004010C9 push    [ebp+eax*4+ary]
.text:004010CD push    offset aD                       ; "%d\n"
.text:004010D2 call    _printf
1
2
3
4
5

指针寻址：如果是遍历数组中的元素，低版本编译器为了提高效率会避免使用比例因子寻址，而是采用指针寻址。

.text:00401005 mov     edi, 0Ah						; edi 数组元素个数
.text:00401056 lea     esi, [esp+30h+ary]			; esi 访问数组元素的指针

.text:0040105A loc_40105A:
.text:0040105A mov     eax, [esi]
.text:0040105C push    eax
.text:0040105D push    offset Format                   ; "%d\n"
.text:00401062 call    _printf
.text:00401067 add     esp, 8
.text:0040106A add     esi, 4
.text:0040106D dec     edi
.text:0040106E jnz     short loc_40105A
1
2
3
4
5
6
7
8
9
10
11
12

当数组中元素大小超过比例因子寻址范围时（例如用户自定义的结构体数组）时会将结构体“大小”记录到下标中然后再进行比例因子寻址。例如下面的代码：
```
struct Struct {
    int i, j, k;
};

int main(int argc) {
    Struct a[5];
    return a[argc].j;
}
1
2
3
4
5
6
7
8
```
其中访问结构体数组的代码对应汇编如下：
```
.text:00401010 mov     eax, [ebp+argc]
.text:00401018 lea     eax, [eax+eax*2]
.text:0040101B mov     eax, [ebp+eax*4+a.j]
1
2
3
```

多维数组

以二维数组为例。对于单个数组成员 数组[x][y] 的，我们采用 [数组基址 + 数组元素大小 * (x * 数组列数 + y)] 的方式进行访问。

.text:004010E0 imul    eax, [ebp+x], 0Ah
.text:004010E4 add     eax, [ebp+y]
.text:004010E7 push    [ebp+eax*4+ary]
.text:004010EB push    offset aD                       ; "%d\n"
.text:004010F0 call    _printf
1
2
3
4
5

对于数组的遍历依旧采取指针的方式。

.text:00401129 lea     esi, [ebp+ary]				; 访问元素的指针
.text:0040112C mov     ebx, 2						; 数组行数 2
.text:00401131 loc_401131:
.text:00401131 mov     edi, 0Ah						; 数组列数 10
.text:00401140 loc_401140:
.text:00401140 push    dword ptr [esi]
.text:00401142 push    offset aD                       ; "%d\n"
.text:00401147 call    _printf
.text:0040114C add     esp, 8
.text:0040114F add     esi, 4
.text:00401152 sub     edi, 1
.text:00401155 jnz     short loc_401140
.text:00401157 sub     ebx, 1
.text:0040115A jnz     short loc_401131
1
2
3
4
5
6
7
8
9
10
11
12
13
14

字符串

strlen

高版本 VS2019：

.text:00401003 lea     eax, [ebp+s]			; eax 指向字符串 s
.text:00401009 lea     edx, [eax+1]			; edx = eax + 1

.text:00401010 loc_401010:
.text:00401010 mov     cl, [eax]
.text:00401012 inc     eax
.text:00401013 test    cl, cl
.text:00401015 jnz     short loc_401010		; 跳出循环时 eax 指向字符串结尾 + 2 (判断 cl 为 0 后 eax 又加 1)

.text:00401017 sub     eax, edx				; eax - edx 得到字符串长度
1
2
3
4
5
6
7
8
9
10

低版本 VC6.0：

.text:00401003 or      ecx, 0FFFFFFFFh		; ecx = -1
.text:00401006 xor     eax, eax				; eax = 0
.text:00401009 lea     edi, [esp+68h+s]		; edi 指向字符串 s 
.text:0040100D repne scasb					; eax - [edi] 如果让 ZF 标志位置位则跳出循环，跳出循环时 edi 指向字符串结尾 + 2 (ZF 置位后 edi 又加 1)，ecx 为 -(字符串长度 + 2) (ecx 初始值为 -1，后来进行了字符串长度 +1 次减 1 操作)
.text:0040100F not     ecx
.text:00401011 dec     ecx					; ecx 取反减 1，相当于 -(ecx - 2)
.text:00401013 mov     eax, ecx
1
2
3
4
5
6
7

strcpy

高版本 VS2019，直接循环复制直到遇到 \x00，每次复制 1 字节。

.text:00401040 loc_401040:
.text:00401040 mov     cl, [ebp+eax+src]
.text:00401044 lea     eax, [eax+1]
.text:00401047 mov     [ebp+eax-1+dst], cl
.text:0040104E test    cl, cl
.text:00401050 jnz     short loc_401040
1
2
3
4
5
6

低版本 VC6.0，先 strlen(src) 求出 src 字符串长度，之后先 4 字节复制，再 1 字节复制。

; ecx = strlen(src) + 1
.text:00401012 lea     edi, [esp+0D4h+src]
.text:00401016 or      ecx, 0FFFFFFFFh
.text:00401019 xor     eax, eax
.text:0040101F repne scasb
.text:00401021 not     ecx

; memcpy(dst, src, ecx)
.text:0040101B lea     edx, [esp+0D4h+dst]			; edx = dst
.text:00401023 sub     edi, ecx						; edi = src
.text:00401025 mov     eax, ecx						; eax = strlen(src) + 1
.text:00401027 mov     esi, edi						; esi = src
.text:00401029 mov     edi, edx						; edi = dst
.text:0040102B shr     ecx, 2						; ecx = (strlen(src) + 1) / 4
.text:0040102E rep movsd							; 每次复制 4 字节
.text:00401030 mov     ecx, eax
.text:00401032 and     ecx, 3						; ecx = (strlen(src) + 1) % 4
.text:00401035 rep movsb							; 每次复制 1 字节
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

strcmp

高版本 VS2019：

.text:00401076 lea     eax, [ebp+src]
.text:00401079 lea     ecx, [ebp+dst]

.text:00401080 loc_401080:
.text:00401080 mov     dl, [ecx]					; 保存一个字节到 dl
.text:00401082 cmp     dl, [eax]					; 进行第一次比较，比较 [ecx] 和 [eax]
.text:00401084 jnz     short loc_4010A0

.text:00401086 test    dl, dl						; 如果相等判断当前字节是否为 '\x00'
.text:00401088 jz      short loc_40109C

.text:0040108A mov     dl, [ecx+1]					; 保存一个字节到 dl
.text:0040108D cmp     dl, [eax+1]					; 进行第二次比较，比较 [ecx+1] 和 [eax+1]
.text:00401090 jnz     short loc_4010A0

.text:00401092 add     ecx, 2
.text:00401095 add     eax, 2						; 更新指针 ecx 和 eax
.text:00401098 test    dl, dl						; 如果相等判断当前字节是否为 '\x00'
.text:0040109A jnz     short loc_401080

.text:0040109C loc_40109C:
.text:0040109C xor     eax, eax						; 如果字符串相等则将 eax 寄存器值为 0 并跳出循环
.text:0040109E jmp     short loc_4010A5

.text:004010A0 loc_4010A0:
.text:004010A0 sbb     eax, eax						; eax = eax - eax - CF，src > dst 则为 -1 ，否则为 0
.text:004010A2 or      eax, 1						; 如果 eax 为 0 则将 eax 设为 1

.text:004010A5 loc_4010A5:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

低版本 VC6.0：

; strcmp(dst, src)

.text:0040101F lea     esi, [esp+0D0h+src]
.text:00401023 lea     eax, [esp+0D0h+dst]

.text:00401027 loc_401027:
.text:00401027 mov     dl, [eax]
.text:00401029 mov     bl, [esi]
.text:0040102B mov     cl, dl					; 保存一个字节到 cl
.text:0040102D cmp     dl, bl					; 进行第一次比较，比较 [eax] 和 [esi]
.text:0040102F jnz     short loc_40104F

.text:00401031 test    cl, cl					; 如果相等判断当前字节是否为 '\x00'
.text:00401033 jz      short loc_40104B

.text:00401035 mov     dl, [eax+1]
.text:00401038 mov     bl, [esi+1]
.text:0040103B mov     cl, dl					; 保存一个字节到 cl
.text:0040103D cmp     dl, bl					; 进行第二次比较，比较 [eax+1] 和 [esi+1]
.text:0040103F jnz     short loc_40104F

.text:00401041 add     eax, 2
.text:00401044 add     esi, 2					; 更新指针 eax 和 esi
.text:00401047 test    cl, cl					; 如果相等判断当前字节是否为 '\x00'
.text:00401049 jnz     short loc_401027

.text:0040104B loc_40104B:
.text:0040104B xor     eax, eax					; 如果字符串相等则将 eax 寄存器值为 0 并跳出循环
.text:0040104D jmp     short loc_401054

.text:0040104F loc_40104F:
.text:0040104F sbb     eax, eax					; eax = eax - eax - CF，src > dst 则为 -1 ，否则为 0
.text:00401051 sbb     eax, 0FFFFFFFFh			; eax = eax + 1 - CF，如果上一步结果为 0 则这一步结果为 1，否则结果为 -1 。

.text:00401054 loc_401054:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

memcpy

高版本直接调用 memcpy 函数，该函数通过 IDA 签名可以识别出来。

.text:0040107C push    [ebp+size]                      ; count
.text:00401082 lea     eax, [ebp+src]
.text:00401085 push    eax                             ; src
.text:00401086 lea     eax, [ebp+dst]
.text:0040108C push    eax                             ; dst
.text:0040108D call    _memcpy
1
2
3
4
5
6

低版本 VC6.0 本质是 strcpy，不过是指定长度而不是通过 strlen 获取长度。

.text:00401021 mov     ecx, [esp+0E0h+size]
.text:00401025 lea     esi, [esp+0E0h+src]
.text:00401029 mov     edx, ecx
.text:0040102B lea     edi, [esp+0E0h+dst]
.text:0040102F shr     ecx, 2
.text:00401032 rep movsd
.text:00401034 mov     ecx, edx
.text:00401036 lea     eax, [esp+0E0h+dst]
.text:0040103A and     ecx, 3
.text:0040103D push    eax                             ; Buffer
.text:0040103E rep movsb
1
2
3
4
5
6
7
8
9
10
11

memcmp

高版本 VS2019：

.text:004010CD mov     esi, [ebp+size]
.text:004010D3 lea     ecx, [ebp+dst]
.text:004010DC lea     edx, [ebp+src]
.text:004010DF sub     esi, 4					; 特判 size 小于 4 的情况
.text:004010E2 jb      short loc_4010F5

.text:004010E4 loc_4010E4:
.text:004010E4 mov     eax, [ecx]
.text:004010E6 cmp     eax, [edx]				; 循环比较，一次比较 4 个字节
.text:004010E8 jnz     short loc_4010FA

.text:004010EA add     ecx, 4
.text:004010ED add     edx, 4
.text:004010F0 sub     esi, 4					; 更新字符串指针和 size ，如果 size 不足 4 则跳出循环
.text:004010F3 jnb     short loc_4010E4

.text:004010F5 loc_4010F5:
.text:004010F5 cmp     esi, -4					; 特判 size 为 0 的情况，如果 size 为 0 则返回值为 0 表示相等
.text:004010F8 jz      short loc_40112E

.text:004010FA loc_4010FA:						; 依次判断剩余 size 为 3、2、1 的情况，并比较剩余字符
.text:004010FA mov     al, [ecx]
.text:004010FC cmp     al, [edx]
.text:004010FE jnz     short loc_401127

.text:00401100 cmp     esi, -3
.text:00401103 jz      short loc_40112E

.text:00401105 mov     al, [ecx+1]
.text:00401108 cmp     al, [edx+1]
.text:0040110B jnz     short loc_401127

.text:0040110D cmp     esi, -2
.text:00401110 jz      short loc_40112E

.text:00401112 mov     al, [ecx+2]
.text:00401115 cmp     al, [edx+2]
.text:00401118 jnz     short loc_401127

.text:0040111A cmp     esi, -1
.text:0040111D jz      short loc_40112E

.text:0040111F mov     al, [ecx+3]
.text:00401122 cmp     al, [edx+3]
.text:00401125 jz      short loc_40112E

.text:00401127 loc_401127:
.text:00401127 sbb     eax, eax					; eax = eax - eax - CF，src > dst 则为 -1 ，否则为 0
.text:00401129 or      eax, 1					; 如果 eax 为 0 则将 eax 设为 1
.text:0040112C jmp     short loc_401130

.text:0040112E loc_40112E:
.text:0040112E xor     eax, eax					; 如果字符串相等则将 eax 置 0

.text:00401130 loc_401130:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

低版本 VC6.0：

.text:0040102B mov     ecx, [esp+0E4h+size]
.text:00401032 lea     edi, [esp+0D4h+src]
.text:00401036 lea     esi, [esp+0D4h+dst]
.text:0040103A xor     eax, eax					; eax 先清零
.text:0040103C repe cmpsb
.text:00401040 jz      short loc_401047			; 如果相同 eax = 0 直接跳走

.text:00401042 sbb     eax, eax					; eax = eax - eax - CF，src > dst 则为 -1 ，否则为 0
.text:00401044 sbb     eax, 0FFFFFFFFh			; eax = eax + 1 - CF，如果上一步结果为 0 则这一步结果为 1，否则结果为 -1 。


.text:00401047 loc_401047:
1
2
3
4
5
6
7
8
9
10
11
12

结构体

结构体对齐

设置结构体对齐值

方法1：在 Visual Studio 中可以在 项目属性 -> 配置属性 -> C/C++ -> 所有选项 -> 结构体成员对齐 中设置结构体对齐大小。

方法2：使用 #pragma pack(对齐值) 来设置，不过要想单独设置一个结构体的对齐大小需要保存和恢复原先的结构体对齐值。

#pragma pack(push)  // 保存原先的结构体对齐值
#pragma pack(2)     // 设置结构体对齐值为 2
struct Struct {     // sizeof(Struct) = 6
    char x;
    int y;
};
#pragma pack(pop)   // 恢复原先的结构体对齐值
1
2
3
4
5
6
7

方法3：在 C++11 及以后标准中，可使用 alians 关键字设置结构体的对齐值。不过请注意，alignas 关键字的参数必须是常量表达式，对齐值必须是 2 的幂且不能小于结构体中最大的成员。
```
struct alignas(32) Struct {     // sizeof(Struct) = 32
    char x;
    int y;
};
1
2
3
4
```

结构体对齐策略

假设一个结构体中有 $n$ 个元素，每个元素大小为 $a_i(1\le i\le n)$ 并且按照 $k$ 字节对齐，则结构体大小计算方式如下：

#include 

int main() {
    std::ios::sync_with_stdio(false);
    std::cin.tie(nullptr);

    int n, k;
    std::cin >> n >> k;
    assert(__builtin_popcount(k) == 1);

    std::vector<int> a(n);
    for (int i = 0; i < n; i++) {
        std::cin >> a[i];
        assert(__builtin_popcount(a[i]) == 1);
    }

    k = std::min(k, *std::max_element(a.begin(), a.end()));

    int ans = 0;
    for (int i = 0; i < n; i++) {
        if ((ans + a[i] - 1) / a[i] * a[i] + a[i] <= (ans + k - 1) / k * k) {
            ans = (ans + a[i] - 1) / a[i] * a[i] + a[i];
        } else {
            ans = (ans + k - 1) / k * k + a[i];
        }
    }

    ans = (ans + k - 1) / k * k;

    std::cout << ans << std::endl;

    return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

注意以下特殊情况：

如果是 alignas 设置的对齐值则结构体严格按照对齐值对齐（IDA 中设置的结构体 align 属性也按照这个规则对齐结构体），否则按照对齐值和结构体最大成员两者较小的那个值进行对齐。
如果是结构体套结构体则内部的结构体的成员需要看做是外部结构体的成员进行对齐，而不是内部的结构体整个参与到外部结构体的对齐中去。

结构体的识别

通常采用 [base + offset] 的方式访问结构体成员。

如果结构体中成员大小相同则结构体初始化代码等价与数组的初始化代码，无法区分。
如果结构体中成员的大小或类型（整型与浮点数）不同会造成结构体成员在内存中不连续或者访问的汇编指令不同，可以识别出结构体。
如果采用 [esp + xxx] 或者 [ebp - xxx] 访问则不能区分是结构体还是多个局部变量。

结构体拷贝

如果结构体比较小则利用寄存器进行拷贝。

    Struct b = *a;
006B186C  mov         eax,dword ptr [a]  
006B186F  mov         ecx,dword ptr [eax]  
006B1871  mov         dword ptr [b],ecx  
006B1874  mov         edx,dword ptr [eax+4]  
006B1877  mov         dword ptr [ebp-18h],edx	; [ebp - 18h] 为 [b + 4] 
006B187A  mov         eax,dword ptr [eax+8]  
006B187D  mov         dword ptr [ebp-14h],eax	; [ebp - 14h] 为 [b + 8] 
1
2
3
4
5
6
7
8

如果结构体比较大则优化为 rep 指令。

    Struct b = *a;
00F8186C  mov         ecx,0Ch  
00F81871  mov         esi,dword ptr [a]  
00F81874  lea         edi,[b]  
00F81877  rep movs    dword ptr es:[edi],dword ptr [esi]
1
2
3
4
5

结构体传参

例如下面这个代码：

#include 

struct Struct {
    int x;
    int y;
};

void foo(Struct a) {
    printf("%d %d\n", a.x, a.y);
}

int main() {
    Struct a;
    scanf_s("%d%d", &a.x, &a.y);
    foo(a);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

在结构体成员比较少的情况下调用 foo 函数时会依次将结构体成员 push 到栈上。类似于函数正常传参。

    foo(a);
007C45E4  mov         eax,dword ptr [ebp-0Ch]	; [ebp - 0Ch] 为 [a + 4]
007C45E7  push        eax  
007C45E8  mov         ecx,dword ptr [a]  
007C45EB  push        ecx  
007C45EC  call        foo (07C13CFh)  
007C45F1  add         esp,8  
1
2
3
4
5
6
7

将 Struct 修改为如下定义：

struct Struct {
    int x;
    int y;
    int z[10];
};
1
2
3
4
5

则 foo 函数通过 rep 指令栈拷贝传参，而如果是数组传参则会传数组的地址，这是区分数组和结构体的一个依据。

    foo(a);
005345E4  sub         esp,30h  
005345E7  mov         ecx,0Ch  
005345EC  lea         esi,[a]  
005345EF  mov         edi,esp  
005345F1  rep movs    dword ptr es:[edi],dword ptr [esi]  
005345F3  call        foo (05313CFh)  
005345F8  add         esp,30h  
1
2
3
4
5
6
7
8

如果传入的参数是结构体引用或是结构体指针，则于数组参数一样传的是结构体的地址，这样就只能根据函数中对结构体成员访问来判断参数类型是否是结构体。

    foo(a); // a 是一个结构体引用
006017F8  lea         eax,[a]  
006017FB  push        eax  
006017FC  call        foo (060105Fh)  
00601801  add         esp,4  
1
2
3
4
5

结构体返回值

首先让结构体只有一个成员变量：

#include 

struct Struct {
    int x;
};

Struct bar() {
    Struct a;
    printf("%d\n", a.x);
    return a;
}

int main() {
    Struct a = bar();
    printf("%d\n", a.x);
    return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

此时会将结构体存放在 eax 寄存器中返回。

    Struct a = bar();
00AC1B93  call        bar (0AC10D2h)  
00AC1B98  mov         dword ptr [ebp-48h],eax  
00AC1B9B  mov         eax,dword ptr [ebp-48h]  
00AC1B9E  mov         dword ptr [a],eax  
1
2
3
4
5

将结构体中添加一个成员变量 y 。

struct Struct {
    int x, y;
};
1
2
3

此时返回值结构体中的两个成员变量分别使用 eax 和 edx 寄存器存储。这与 32 位下函数返回 64 位数值类似。

    Struct a = bar();
009A1B93  call        bar (09A10D2h)  
009A1B98  mov         dword ptr [ebp-50h],eax  
009A1B9B  mov         dword ptr [ebp-4Ch],edx  
009A1B9E  mov         eax,dword ptr [ebp-50h]  
009A1BA1  mov         ecx,dword ptr [ebp-4Ch]  
009A1BA4  mov         dword ptr [a],eax  
009A1BA7  mov         dword ptr [ebp-4],ecx  
1
2
3
4
5
6
7
8

因此结构体大小不超过 8 字节的时候采用值返回。

将结构体中再添加一个成员变量 z 。

struct Struct {
    int x, y, z;
};
1
2
3

此时不再使用寄存器存返回值，而是向函数中传一个 ebp - 0x24 的地址作为参数。

bar 函数返回后先将返回值 eax 指向的 12 字节内存拷贝到 ebp - 0x0C 处的内存，之后再将 ebp - 0x0C 处的内存拷贝到 ebp -0x18 也就是局部变量 b 所在的内存。

    Struct b = bar();
.text:00401146 lea     eax, [ebp+a]
.text:00401149 push    eax                             ; a
.text:0040114A call    ?bar@@YA?AUStruct@@XZ           ; bar(void)
.text:0040114A
.text:0040114F add     esp, 4
.text:00401152 mov     ecx, [eax+Struct.x]
.text:00401154 mov     [ebp+temp.x], ecx
.text:00401157 mov     edx, [eax+Struct.y]
.text:0040115A mov     [ebp+temp.y], edx
.text:0040115D mov     eax, [eax+Struct.z]
.text:00401160 mov     [ebp+temp.z], eax
.text:00401163 mov     ecx, [ebp+temp.x]
.text:00401166 mov     [ebp+b.x], ecx
.text:00401169 mov     edx, [ebp+temp.y]
.text:0040116C mov     [ebp+b.y], edx
.text:0040116F mov     eax, [ebp+temp.z]
.text:00401172 mov     [ebp+b.z], eax
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

传入的参数在 bar 函数中直接被当做局部变量 a 进行赋值操作。

Struct *__cdecl bar(Struct *a)
{
  a->x = 1;
  a->y = 0;
  a->z = 0;
  scanf_s("%d%d%d\n", a, &a->y, &a->z);
  return a;
}
1
2
3
4
5
6
7
8

因此整个过程中发生 2 次结构体拷贝。
在这里插入图片描述

如果 bar 函数本身还传参，则结构体（局部变量 a）地址作为第一个参数。

    Struct b = bar(x);
00761161  mov         ecx,dword ptr [x]		; x  
00761164  push        ecx  
00761165  lea         edx,[ebp-2Ch]			; a  
00761168  push        edx  
00761169  call        bar (0761100h)  
0076116E  add         esp,8  
1
2
3
4
5
6
7

将结构体定义的再大一些，此时同样会发生 2 次拷贝，不过会使用 rep 指令进行优化。

    Struct b = bar();
00A8114B  lea         eax,[ebp-90h]  
00A81151  push        eax  
00A81152  call        bar (0A81100h)  
00A81157  add         esp,4  
00A8115A  mov         ecx,0Ch  
00A8115F  mov         esi,eax  
00A81161  lea         edi,[ebp-30h]  
00A81164  rep movs    dword ptr es:[edi],dword ptr [esi]  
00A81166  mov         ecx,0Ch  
00A8116B  lea         esi,[ebp-30h]  
00A8116E  lea         edi,[b]  
00A81171  rep movs    dword ptr es:[edi],dword ptr [esi] 
1
2
3
4
5
6
7
8
9
10
11
12
13

相关阅读:
android基础学习
一周速学SQL Server（第五天）
2.MongoDB与关系数据库对比
linux正则表达式
RustDay05------Exercise[31-40]
2024 年天津专升本招生实施办法（天津专升本文化报名考试时间）
海外媒体发稿：彭博社发稿宣传中，5种精准营销方式
论文浅尝 | QA-GNN：结合语言模型与知识图谱进行问答推理
初学者设计PCB，如何检查光绘文件的断头线
数学建模笔记-第七讲-回归分析

原文地址：https://blog.csdn.net/qq_45323960/article/details/133574262