• 如何将原始按照word为单位索引标识的数据集修改为以char单位


    目录

    问题描述:

    问题解决:


    问题描述:

    原始数据集中,数据的标注是以word为单位,且计数从1开始的。如下图:

    如何修改为以char为单位的,从0开始计数的数据格式:

    问题解决:

    1. # 将英文数据集中,原本按照word存储的数据集,按照char格式存储,处理成与中文一样的格式
    2. path = '/home/qtxu/Sentiment-SPN/data/Camera-COQE/dev.txt'
    3. path_w = '/home/qtxu/Sentiment-SPN/data/Camera-COQE/dev_char.txt'
    4. from pdb import set_trace as stop
    5. def obtain_index(cur_ele): # '[10&&would 11&¬ , 17&&difference]' ## '[10&¬ , 16&&clearer]' ##[13&&did 14&&n't , 20&&as 21&&well 22&&as]
    6. len_ele = len(cur_ele)
    7. if ' ,' in cur_ele: # 针对几个带有逗号的特殊处理
    8. start_index = cur_ele.find(',')
    9. cur_ele = '['+cur_ele[start_index+2:]
    10. if len_ele == 2:
    11. index_list = []
    12. span_str = ''
    13. return index_list,span_str
    14. else:
    15. cur_ele = cur_ele[1:-1]
    16. # try:
    17. index_list = [int(ele.split('&&')[0])-1 for ele in cur_ele.split(' ')]
    18. span_str = ' '.join(ele.split('&&')[1] for ele in cur_ele.split(' '))
    19. # except:
    20. # stop()
    21. return index_list, span_str
    22. def word_to_char(sentence, span, span_index):
    23. if len(span)==0:
    24. return '[]'
    25. else:
    26. span_start_index = span_index[0]
    27. front_str = ' '.join(sentence.split(' ')[:span_start_index])
    28. span_len = len(front_str)
    29. result_str = ""
    30. if span_start_index == 0:
    31. i = 0
    32. else:
    33. i = 1
    34. for char in span:
    35. start_index = span_len + i
    36. cur_char = f"{start_index}&&{char} "
    37. result_str += cur_char
    38. i += 1
    39. # 移除末尾的空格
    40. result_str = '['+result_str.rstrip()+']'
    41. return result_str
    42. with open(path, 'r') as fr, open(path_w, 'w') as fw:
    43. lines = fr.readlines()
    44. for line in lines:
    45. try:
    46. sent, label = line.strip().split('\t')
    47. fw.write(line)
    48. except:
    49. if '[[];[];[];[];[]]' in line:
    50. fw.write(line)
    51. else:
    52. # stop()
    53. cur_line = line.strip()[1:-1]
    54. sub,obj,asp,op,polarity = cur_line.split(';')
    55. sub_index, sub_span = obtain_index(sub)
    56. obj_index, obj_span = obtain_index(obj)
    57. asp_index, asp_span = obtain_index(asp)
    58. op_index, op_span = obtain_index(op)
    59. sub_char= word_to_char(sent, sub_span,sub_index)
    60. obj_char= word_to_char(sent, obj_span,obj_index)
    61. asp_char= word_to_char(sent, asp_span,asp_index)
    62. op_char= word_to_char(sent, op_span,op_index)
    63. char_quintuple = '['+ str(sub_char) + ';' + str(obj_char) +';'+ str(asp_char) +';'+str(op_char) +';' + polarity + ']'
    64. # polarity
    65. fw.write(char_quintuple+'\n')
    66. # print(sub_char)

  • 相关阅读:
    【计算机网络:自顶向下方法】(四)网络层 (IPV4 | IPV6 | 路由算法 )
    Docker实战:Docker安装nginx并配置SSL
    【一:实战开发testng的介绍】
    正点原子嵌入式linux驱动开发——Linux DAC驱动
    [附源码]Python计算机毕业设计Django二手书店设计论文
    python程序将pdf转word
    DPDK-A1:Centos配置MLX5驱动
    MySQL基础
    阿里P8资深架构师耗时半年整理21年Java工程师成神之路
    Unity的UnityStats: 属性详解与实用案例
  • 原文地址:https://blog.csdn.net/weixin_41862755/article/details/133255072