Python3对象序列化，即将Python对象从内存中储存为本地文件、从本地文件中加载Python对象（持续更新ing...）

Python3对象序列化，即将Python对象从内存中储存为本地文件、从本地文件中加载Python对象（持续更新ing...）
诸神缄默不语-个人CSDN博文目录

本文主要关注如何将各种Python对象储存为本地文件，并反之从本地文件加载Python对象。
（注意：一般情况下Python读写的工具都需要统一，如果可以跨工具使用的话，我会在对应内容的位置说明）

数据库就不算在内了，Python与数据库的交互我写了别的博文。
文章目录
1. 使用Python3原生函数读写文件流

Python3使用原生函数open()可以直接打开本地文件，返回值是文件流。
参数：
- 文件路径
- 打开模式，默认为r只读。其他可选项：w写入，a添加，rb/wb后面的b指对二进制文件的处理¹
- encoding：编码格式，常用选项为utf-8或gbk
有两种常见写法，一种是将open()作为命令，对返回的文件流进行处理，最后要记得close()；一种是将open()作为上下文管理器，如with open('file.txt') as f:语句下包裹的代码运行之间自动打开文件流，运行完毕后自动关闭。
（如果对with语句之外的f进行I/O操作，将会报：ValueError: I/O operation on closed file.这个bug）

对文件流的操作：
- readlines() 对于文本文件，就是返回全部内容，列表格式，每行文字是一个元素
- read() 对于文本文件，就是返回全部内容，字符串格式
- write(str) 写入一个字符串对象
- writelines(obj) 写入一个可迭代对象的所有元素，obj需要元素是字符串。注意：1. 不会自动换行。2. 集合对象也可以写入，但顺序随机；以字符串为键的字典对象也可以写入，但将只写入键值，具体的顺序我不确定。
- flush()² 刷新内部缓冲（意思就是更新代码中各种写入操作）（在文件关闭时也会自动刷新一次）
- close() 关闭文件流（如果使用with open()就不用显式关闭文件流）
2. JSON

Python 3内置json包

加载本地文件到内存中：json.load(文件流)
将Python对象储存到本地：json.dump(Python对象,文件流)
（文件流是通过open()函数打开的）

将字符串对象转换为dict对象：json.loads(str)
将dict对象转换为字符串：json.dumps(obj)

dump()和dumps()的共有入参：
- ensure_ascii：默认置True，这会导致转换得到的字符串无法用肉眼直接阅读。所以一般都会显式置False
使用JSON来储存数据的优势在于跨平台、跨语言。

在这里额外提供一些别的JSON工具包：
- json格式化（就算是不严格的JSON也可以格式化，比原始文本易读多了）：https://www.bejson.com/
3. XML

在线格式化：https://c.runoob.com/front-end/710/

1. xml.etree.ElementTree

简单的XML文件示例

假设我们有以下的XML文件，名为example.xml，它包含了一些图书信息：
```
<library>
    <book>
        <title>Python入门title>
        <author>张三author>
        <year>2021year>
    book>
    <book>
        <title>深入理解计算机系统title>
        <author>李四author>
        <year>2019year>
    book>
library>
1
2
3
4
5
6
7
8
9
10
11
12
```
读取XML文件

首先，我们需要读取并解析这个XML文件。可以使用xml.etree.ElementTree模块中的parse()函数来完成：
```
import xml.etree.ElementTree as ET

# 解析XML文件
tree = ET.parse('example.xml')
root = tree.getroot()
1
2
3
4
5
```
提取数据

解析XML文件后，我们可以轻松地提取所需的数据。例如，获取所有书籍的标题和作者：
```
for book in root.findall('book'):
    title = book.find('title').text
    author = book.find('author').text
    print(f'书名: {title}, 作者: {author}')
1
2
3
4
```
预期输出：
```
书名: Python入门, 作者: 张三
书名: 深入理解计算机系统, 作者: 李四
1
2
```
↑注意在这里我们可以发现xml对象中，每一层只能检索（不管是用find还是findall）当前对象的属性，就是说如果你跨级检索就检索不到

 修改XML文件

如果我们想要修改这个XML文件，比如添加一个新的书籍条目，我们可以这样做：
```
# 创建一个新的book元素
new_book = ET.Element('book')
new_title = ET.SubElement(new_book, 'title')
new_title.text = '机器学习'
new_author = ET.SubElement(new_book, 'author')
new_author.text = '王五'
new_year = ET.SubElement(new_book, 'year')
new_year.text = '2022'

# 将新书籍添加到library中
root.append(new_book)

# 保存更改到新文件
tree.write('modified_example.xml')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
```
4. 使用pickle包

pickle包官方文档：https://docs.python.org/3/library/pickle.html

将Python对象储存为本地文件：
```
import pickle

data = [1, 2, 3, {'k': 'A1', '全文': '内容1'}] # 你的数据

with open('data.pkl', 'wb') as file:
    pickle.dump(data, file)
1
2
3
4
5
6
```
加载本地文件到内存中：
```
import pickle

with open('data.pkl', 'rb') as file:
    loaded_data = pickle.load(file)

print(loaded_data) # 输出: [1, 2, 3, {'k': 'A1', '全文': '内容1'}]
1
2
3
4
5
6
```
5. 处理CSV：csv包

csv包官方文档：csv — CSV File Reading and Writing — Python 3.11.0 documentation

从CSV文件中读数据：
```
import csv
with open(CSV文件名,newline='') as csvfile:
	spamwreader=csv.reader(csvfile)
	for row in spamreader:
		#一行数据，列表对象，每个元素是该行的一个cell
1
2
3
4
5
```
将数据写入到CSV文件中：
```
import csv
with open(CSV文件名,'w',newline='') as csvfile:
    spamwriter=csv.writer(csvfile)
    spamwriter.writerow(一行列表对象（每个元素是该行的一个cell）)
1
2
3
4
```
6. 处理Excel

1 xlsx: openpyxl包

openpyxl包官方中文文档：openpyxl - 读/写 Excel 2010 xlsx/xlsm 文件的 Python 库 — openpyxl 3.0.7 文档
安装openpyxl：pip install openpyxl
1. 创建Workbook对象（相当于一个Excel文件）
  1. 新建一个（会默认创建一个sheet）：
```
from openpyxl import Workbook
wb=Workbook()
1
2
```
  1. 打开Excel文件：
```
from openpyxl import load_workbook
wb2 = load_workbook('test.xlsx')
1
2
```
2. 获取第一个sheet：ws = wb.active
3. 创建sheet：ws=wb.create_sheet(sheet_name)（返回sheet）
4. 通过sheet_name获取sheet：ws=wb[sheet_name]
5. 打印所有的sheet_name：wb.sheetnames（返回一个列表对象）
6. 提取某一元素（如果没有元素将自动创建空值）
  从1开始
  1. ws.cell(row=line_index,column=column_index,value=jiancheng)：定义value会自动赋值
  2. ws['A1']（直接用单元格提取）
7. ws.iter_rows(min_row=None, max_row=None, min_col=None, max_col=None, values_only=False)：默认从A1开始
  如果置value_only=True，返回值是一个元素的迭代器
8. 将对象插入Excel：
  1. 将list插入Excel：ws.append(list_object)
  2. 将dict插入Excel：ws.append(list(dict_object.values()))
9. 将Workbook对象储存到Excel文件中（警告：这个操作将会无警告直接覆盖已有文件）：wb.save('an_excel.xlsx')
10. 其他注意事项
  1. 使用openpyxl包在Linux上编程时，发现sheet name只是不允许添加/；但把excel文件下载到本地后会发现／也不允许添加，office会自动把非法文字给删掉。所以建议在编程的时候就不要创建这种内容
  2. 如果试图用openpyxl包调用.xls文件，会得到如下报错信息：InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.
2. xls：xlrd包

xlrd — xlrd 2.0.1 documentation

安装：pip install xlrd

使用示例：
```
import xlrd
book = xlrd.open_workbook("myfile.xls")
print("The number of worksheets is {0}".format(book.nsheets))
print("Worksheet name(s): {0}".format(book.sheet_names()))
sh = book.sheet_by_index(0)
print("{0} {1} {2}".format(sh.name, sh.nrows, sh.ncols))
print("Cell D30 is {0}".format(sh.cell_value(rowx=29, colx=3)))
for rx in range(sh.nrows):
    print(sh.row(rx))
1
2
3
4
5
6
7
8
9
```
7. word文件：python-docx库

可以参考这篇网络博文：Word 神器 python-docx - 知乎

 8. 使用numpy包

 1 一次性序列化多个对象

习惯以.npz后缀存储

官方文档：https://numpy.org/devdocs/reference/generated/numpy.savez.html
https://numpy.org/devdocs/reference/generated/numpy.savez_compressed.html

9. 使用scipy包

 1 scipy.sparse

习惯以.npz后缀存储

储存对象：save_npz()（官方文档：https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.save_npz.html）
```
import scipy.sparse
sparse_matrix = scipy.sparse.csc_matrix(np.array([[0, 0, 3], [4, 0, 0]]))
scipy.sparse.save_npz('/tmp/sparse_matrix.npz', sparse_matrix)
1
2
3
```
加载本地对象：load_npz()（官方文档：https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.load_npz.html）
```
import scipy.sparse
sparse_matrix = scipy.sparse.load_npz('/tmp/sparse_matrix.npz')
1
2
```
10. 使用pandas包

 1 CSV/Excel文件

 11. 使用sklearn包

 12. 使用PyTorch包

习惯以.pt或.pth后缀存储

PyTorch储存与加载模型的官方教程：Saving and Loading Models — PyTorch Tutorials 1.12.1+cu102 documentation
其他参考资料：python - How do I save a trained model in PyTorch? - Stack Overflow

将对象储存到磁盘：torch.save(obj,path)
将磁盘对象加载到内存：torch.load(path)
（path可以是路径字符串或文件流）

load()入参：
- map_location：可以是函数、torch.device、字符串或字典，指定对象存储的设备位置。
获取模型参数（返回state_dict，匹配模型层到参数张量的字典文件，只包括可学习的那些。优化器对象也有这个）：model.state_dict() optimizer.state_dict()
将模型参数加载回模型：model.load(state_dict)

所以直接储存模型参数就是：torch.save(model.state_dict(), path)
直接加载模型参数就是：model.load_state_dict(torch.load(path))

需要注意的一个情况是：如果在每个epoch后，都根据当前指标，保存最好指标下的epoch的checkpoint，因为state_dict是mutable对象OrderedDict，所以直接引用（best_state = model.state_dict()）的话会跟着模型的当前指标变化，因此需要深拷贝（best_state = copy.deepcopy(model.state_dict())）³

13. 使用transformers包

 14. 处理PDF

我现在依然不会用Python直接从PDF中提取格式。GPT-4说它也没辙：

（2023.6.21更新：看了一下，好像PyMuPDF可以，我还没试）

1 PyPDF2：逐页提取文字

请参考我之前撰写的博文：PyPDF2：使用Python操作PDF文件

 2 tabula-py：提取表格

tabula-py · PyPI

表格之外的内容全都无法提取（嗯所以我个人的解决方案是结合别的处理PDF的包一起使用）
安装方式：pip install tabula-py
注意：1. 如果想要使用这个包的话，需要安装Java 2. 这个包无法完全提取文档中的表格，有时会遗漏，有条件的话还是需要自己检查一遍的
```
import tabula
# 读取PDF文件
file = "中国计算机学会推荐国际学术会议和期刊目录-2022（拟定）.pdf"
# 使用tabula读取PDF中的表格
tables = tabula.read_pdf(file, pages = "all", multiple_tables = True)  #pages可以是int，int列表，或者'all'
# `tables`现在是一个包含所有表格的列表，每个表格都是一个pandas DataFrame
# 您可以通过索引来访问每个表格，例如：
first_table = tables[20]
# 打印第19个表格的内容
print(tables[20])
1
2
3
4
5
6
7
8
9
10
```
输出：
```
    序号     会议简称                                               会议全称        出版社                                                 网址
0    1       CF  ACM International Conference on Computing Fron...        ACM                http://dblp.uni-trier.de/db/conf/cf
1    2   SYSTOR   ACM International Systems and Storage Conference        ACM  http://dblp.uni-trier.de/db/conf/systor/index....
2    3     NOCS  ACM/IEEE International Symposium on Networks-o...   ACM/IEEE              http://dblp.uni-trier.de/db/conf/nocs
3    4     ASAP  Application-Specific Systems, Architectures, a...       IEEE              http://dblp.uni-trier.de/db/conf/asap
4    5  ASP-DAC  Asia and South Pacific Design Automation Confe...   ACM/IEEE            http://dblp.uni-trier.de/db/conf/aspdac
5    6      ETS                            European Test Symposium       IEEE              http://dblp.uni-trier.de/db/conf/ets/
6    7      FPL          Field Programmable Logic and Applications       IEEE              http://dblp.uni-trier.de/db/conf/fpl/
7    8     FCCM       Field-Programmable Custom Computing Machines       IEEE             http://dblp.uni-trier.de/db/conf/fccm/
8    9  GLSVLSI                      Great Lakes Symposium on VLSI   ACM/IEEE           http://dblp.uni-trier.de/db/conf/glvlsi/
9   10      ATS                          IEEE Asian Test Symposium       IEEE              http://dblp.uni-trier.de/db/conf/ats/
10  11     HPCC  IEEE International Conference on High Performa...       IEEE             http://dblp.uni-trier.de/db/conf/hpcc/
11  12     HiPC  IEEE International Conference on High Performa...  IEEE/ ACM   http://dblp.uni-trier.de/db/conf/hipc/index.html
12  13  MASCOTS  IEEE International Symposium on Modeling, Anal...       IEEE          http://dblp.uni-trier.de/db/conf/mascots/
1
2
3
4
5
6
7
8
9
10
11
12
13
14
```
3 PDFMiner

pdfminer · PyPI

安装：pip install pdfminer
```
pdf2txt.py [-P password] [-o output] [-t text|html|xml|tag]
             [-O output_dir] [-c encoding] [-s scale] [-R rotation]
             [-Y normal|loose|exact] [-p pagenos] [-m maxpages]
             [-S] [-C] [-n] [-A] [-V]
             [-M char_margin] [-L line_margin] [-W word_margin]
             [-F boxes_flow] [-d]
             input.pdf
1
2
3
4
5
6
7
```
挺复杂的，总之我实验过txt和html两种文件，TXT就是纯纯转换为文本，我的意见是不如用PyPDF2，好歹能知道原本的页数；HTML看起来不能复原原文件中的背景色和字体，但是会以某种方式呈现出来。
但可能是因为我HTML没学好，反正我看起来觉得实在是太混乱了，我懒得研究了，我就放弃用这个包了。

4 pdfplumber

jsvine/pdfplumber: Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

安装方式：pip install pdfplumber

提取表格（我的使用体验是不如tabula）：
```
pdf = pdfplumber.open(pdf_file_path)
page = pdf.pages[page_index]
if not page.extract_table() is None:  #说明提取出表格了，do something!
    pass
1
2
3
4
```
5 PyMuPDF

官方文档：PyMuPDF 1.22.3 documentation

安装方式：pip install pymupdf
1. 打开文件（是一个由fitz.fitz.Page对象组成的列表）
```
import fitz
doc = fitz.open(PDF路径)
1
2
```
2. 从fitz.fitz.Page中提取文本：page.get_text() 返回字符串对象
  入参：
  - "blocks"：返回元组列表，每个元组的格式类似这样：(107.57489776611328, 246.74249267578125, 505.5909118652344, 282.74249267578125, '专业学位硕士学位论文 \n', 0, 0)：包含了block的位置信息
  - “words”：返回单词及对应的位置信息
3. 从fitz.fitz.Page中提取图片：page.get_images()返回图片。将图片保存为png格式：
```
image_list = page.get_images()
for image_index, img in enumerate(image_list, start=1): # enumerate the image list
   xref = img[0] # get the XREF of the image
   pix = fitz.Pixmap(doc, xref) # create a Pixmap

   if pix.n - pix.alpha > 3: # CMYK: convert to RGB first
       pix = fitz.Pixmap(fitz.csRGB, pix)

   pix.save("图"+str(image_number)+".png") # save the image as png
   pix = None
   image_number+=1
1
2
3
4
5
6
7
8
9
10
11
```
6 pdfreader

pypi官网：pdfreader · PyPI

15. 处理LaTeX

pylatex包
参考博文：
1. Pylatex module in python - GeeksforGeeks
16. 本文撰写过程中使用的其他参考资料
相关阅读:
【智能算法】回溯搜索算法（BSA）原理及实现
 JavaWeb在线商城系统(java+jsp+servlet+MySQL+jdbc+css+js+jQuery)
React组件设计之性能优化篇
 使用 Pycharm 调试远程代码
 transformer系列1---Attention Is All You Need全文详细翻译
 Gartner：2022年全球IT支出将超4万亿美元，软件增速最高
 R与Python：编程语言间的差异与魅力
 Golang依赖管理(GOPATH-＞vendor-＞Go Module)
图像灰度映射方案对比总结
 顺序查找和折半查找
原文地址：https://blog.csdn.net/PolarisRisingWar/article/details/126407274

文章目录

1. 使用Python3原生函数读写文件流

2. JSON

3. XML

1. xml.etree.ElementTree

简单的XML文件示例

读取XML文件

提取数据

修改XML文件

4. 使用pickle包

5. 处理CSV：csv包

6. 处理Excel

1 xlsx: openpyxl包

2. xls：xlrd包

7. word文件：python-docx库

8. 使用numpy包

1 一次性序列化多个对象

9. 使用scipy包

1 scipy.sparse

10. 使用pandas包

1 CSV/Excel文件

11. 使用sklearn包

12. 使用PyTorch包

13. 使用transformers包

14. 处理PDF

1 PyPDF2：逐页提取文字

2 tabula-py：提取表格

3 PDFMiner

4 pdfplumber

5 PyMuPDF

6 pdfreader

15. 处理LaTeX

16. 本文撰写过程中使用的其他参考资料