• 根据Uniprot ID/PDB ID批处理获取蛋白质.pdb文件


    1.根据Uniprot ID批处理获取蛋白质.pdb文件

    由于Uniprot的ID号可能对应多个NCBI的ID,但是根据Alphafold可以获取其唯一的PDB文件,所以用代码批处理获得.pdb文件如下:

    1. import pandas as pd
    2. import numpy as np
    3. from Bio import SeqIO
    4. from Bio import PDB
    5. import requests
    6. # 但是可能会出现 InsecureRequestWarning 警告,
    7. # 虽然不影响代码采集但是看着不舒服,可以加上下面两行:
    8. import urllib3
    9. urllib3.disable_warnings()
    10. headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0'}
    11. def read_file(file_name):
    12. pro_swissProt = []
    13. with open(file_name, 'r') as fp:
    14. for line in fp:
    15. if line.startswith('>'):#作用:判断字符串是否以指定字符或子字符串开头
    16. pro_swissProt.append(line[1:-1])
    17. return pro_swissProt
    18. file1 = '/AD/all1.csv'
    19. ID=read_file(file1)
    20. j = 0
    21. not_exist_list = []
    22. for i in ID:
    23. j = j + 1
    24. print(j)
    25. print(i)
    26. url = 'https://alphafold.ebi.ac.uk/files/AF-'+i+'-F1-model_v1'+'.pdb'
    27. print(url)
    28. r = requests.get(url, headers=headers, verify=False)
    29. with open('/AD/Information/PDB/'+i+'.pdb','w') as files:
    30. r = r.text.splitlines() #np.array(pssm).tolist()
    31. for lines in r:
    32. files.write(lines)
    33. files.write('\n')
    34. if r[0][1]=='?':
    35. print(i + '没有pdb文件。')
    36. not_exist_list.append(i)
    37. #输出了未找到的蛋白质的.pdb文件,这些可以在网址里再手动查一下,有遗漏的
    38. print(not_exist_list)
    39. print(len(not_exist_list))

    其中,file1格式如下:

    1. >Q8BH75
    2. MGYDVTRFQGDVDEDLICPICSGVLEEPVQAPHCEHAFCNACITQWFSQQQTCPVDRSVVTVAHLRPVPRIMRNMLSKLQIACDNAVFGCSAVVRLDNLMSHLSDCEHNPKRPVTCEQGCGLEMPKDELPNHNCIKHLRSVVQQQQSRIAELEKTSAEHKHQLAEQKRDIQLLKAYMRAIRSVNPNLQNLEETIEYNEILEWVNSLQPARVTRWGGMISTPDAVLQAVIKRSLVESGCPASIVNELIENAHERSWPQGLATLETRQMNRRYYENYVAKRIPGKQAVVVMACENQHMGDDMVQEPGLVMIFAHGVEEI
    3. >P06727
    4. MFLKAVVLTLALVAVAGARAEVSADQVATVMWDYFSQLSNNAKEAVEHLQKSELTQQLNALFQDKLGEVNTYAGDLQKKLVPFATELHERLAKDSEKLKEEIGKELEELRARLLPHANEVSQKIGDNLRELQQRLEPYADQLRTQVSTQAEQLRRQLTPYAQRMERVLRENADSLQASLRPHADELKAKIDQNVEELKGRLTPYADEFKVKIDQTVEELRRSLAPYAQDTQEKLNHQLEGLTFQMKKNAEELKARISASAEELRQRLAPLAEDVRGNLRGNTEGLQKSLAELGGHLDQQVEEFRRRVEPYGENFNKALVQQMEQLRQKLGPHAGDVEGHLSFLEKDLRDKVNSFFSTFKEKESQDKTLSLPELEQQQEQQQEQQQEQVQMLAPLES
    5. >Q60770
    6. MAPPVSERGLKSVVWRKIKTAVFDDCRKEGEWKIMLLDEFTTKLLSSCCKMTDLLEEGITVIENIYKNREPVRQMKALYFISPTPKSVDCFLRDFGSKSEKKYKAAYIYFTDFCPDSLFNKIKASCSKSIRRCKEINISFIPQESQVYTLDVPDAFYYCYSPDPSNASRKEVVMEAMAEQIVTVCATLDENPGVRYKSKPLDNASKLAQLVEKKLEDYYKIDEKGLIKGKTQSQLLIIDRGFDPVSTVLHELTFQAMAYDLLPIENDTYKYKTDGKEKEAVLEEDDDLWVRVRHRHIAVVLEEIPKLMKEISSTKKATEGKTSLSALTQLMKKMPHFRKQISKQVVHLNLAEDCMNKFKLNIEKLCKTEQDLALGTDAEGQRVKDSMLVLLPVLLNKNHDNCDKIRAVLLYIFGINGTTEENLDRLIHNVKIEDDSDMIRNWSHLGVPIVPPSQQAKPLRKDRSAEETFQLSRWTPFIKDIMEDAIDNRLDSKEWPYCSRCPAVWNGSGAVSARQKPRTNYLELDRKNGSRLIIFVIGGITYSEMRCAYEVSQAHKSCEVIIGSTHILTPRKLLDDIKMLNKSKDKVSFKDE
    7. >P70452
    8. MRDRTHELRQGDNISDDEDEVRVALVVHSGAARLGSPDDEFFQKVQTIRQTMAKLESKVRELEKQQVTILATPLPEESMKQGLQNLREEIKQLGREVRAQLKAIEPQKEEADENYNSVNTRMKKTQHGVLSQQFVELINKCNSMQSEYREKNVERIRRQLKITNAGMVSDEELEQMLDSGQSEVFVSNILKDTQVTRQALNEISARHSEIQQLERSIRELHEIFTFLATEVEMQGEMINRIEKNILSSADYVERGQEHVKIALENQKKARKKKVMIAICVSVTVLILAVIIGITITVG
    9. >P63044
    10. MSATAATVPPAAPAGEGGPPAPPPNLTSNRRLQQTQAQVDEVVDIMRVNVDKVLERDQKLSELDDRADALQAGASQFETSAAKLKRKYWWKNLKMMIILGVICAIILIIIIVYFST

    2.根据PDB ID在RCSB中获取pdb文件

    将第一段代码的网址换成:

    url = 'http://www.rcsb.org/pdb/files/'+i+'.pdb'

    PS:最近在学习dssp的处理,但是一直没有进展,又没有小伙伴有Linux的安装包和教程

    ***********************

    满满的干货说我文章质量太低了………………,让我提交下,看看字数够了没

  • 相关阅读:
    节点导纳矩阵
    QFile文件读写操作QFileInFo文件信息读取
    多媒体数据处理实验4:LSH索引
    Win10鼠标宏怎么设置?电脑设置鼠标宏的方法
    java毕业设计冰球馆管理系统mybatis+源码+调试部署+系统+数据库+lw
    java操作office表格(POI与easyExcelg)
    KubeDNS 和 CoreDNS
    什么是传递函数模型?
    股票战法课程之业绩跟股价的关系
    自定义starter
  • 原文地址:https://blog.csdn.net/Daisy4/article/details/126088485