
Write_dicts_from_xmls_in_directory_to_jsonlines_file(parse_xml) Writer.write(next_word.create_word_dict()) With jsonlines.open('output.jsonl', mode='a') as writer:įor next_word in parsing_generator(xml_file_name): Path = os.path.abspath(os.path.dirname(os.path.abspath(_file_))) + '/output/*' Yield Word(element.get('number'), element.text)ĭef write_dicts_from_xmls_in_directory_to_jsonlines_file(parsing_generator): Return įor event, element in ET.iterparse(file_path, events=("start", "end",)): Then run this script that will iterate through example_1.xml and example_2.xml files (using iglob) and create output.jsonl file (that will be saved in the root directory of your project) with data from two XML files created in the first step:.Tree_2.write('output/example_2.xml', encoding='UTF-8', xml_declaration=True) Tree_1.write('output/example_1.xml', encoding='UTF-8', xml_declaration=True) _create_word_object(sentence_2, 5, 'sentence') _create_word_object(sentence_1, 5, 'sentence') _create_word_object(sentence_2, 4, 'example') _create_word_object(sentence_1, 4, 'example') _create_word_object(sentence_2, 3, 'second') _create_word_object(sentence_1, 3, 'first') _create_word_object(sentence_2, 1, 'This') _create_word_object(sentence_1, 1, 'This') Sentence_2 = ET.SubElement(xml_doc_2, 'sentence', number='1') Sentence_1 = ET.SubElement(xml_doc_1, 'sentence', number='1') Xml_doc_2 = ET.Element('paragraph', number='1') Xml_doc_1 = ET.Element('paragraph', number='1') String = ET.SubElement(word, 'string', number=str(number)) Word = ET.SubElement(sentence_object, 'word', number=str(number)) Queue = children # prepend so children come before siblingsĭef _create_word_object(sentence_object, number, word_string):

Queue = # (level, element)Ĭhildren = Įlement.text = '\n' + indent * (level+1) # for child openĮlement.tail = '\n' + indent * queue # for sibling openĮlement.tail = '\n' + indent * (level-1) # for parent close

This file format allows you to store every JSON as one line of that file without any spaces, which eventually let me decrease the size of the initial data set to around 450 MB. Such paragraph object starts with