mmakeron.blogg.se - Converter xml files to premiere without xtocc

Write_dicts_from_xmls_in_directory_to_jsonlines_file(parse_xml) Writer.write(next_word.create_word_dict()) With jsonlines.open('output.jsonl', mode='a') as writer:įor next_word in parsing_generator(xml_file_name): Path = os.path.abspath(os.path.dirname(os.path.abspath(_file_))) + '/output/*' Yield Word(element.get('number'), element.text)ĭef write_dicts_from_xmls_in_directory_to_jsonlines_file(parsing_generator): Return įor event, element in ET.iterparse(file_path, events=("start", "end",)): Then run this script that will iterate through example_1.xml and example_2.xml files (using iglob) and create output.jsonl file (that will be saved in the root directory of your project) with data from two XML files created in the first step:.Tree_2.write('output/example_2.xml', encoding='UTF-8', xml_declaration=True) Tree_1.write('output/example_1.xml', encoding='UTF-8', xml_declaration=True) _create_word_object(sentence_2, 5, 'sentence') _create_word_object(sentence_1, 5, 'sentence') _create_word_object(sentence_2, 4, 'example') _create_word_object(sentence_1, 4, 'example') _create_word_object(sentence_2, 3, 'second') _create_word_object(sentence_1, 3, 'first') _create_word_object(sentence_2, 1, 'This') _create_word_object(sentence_1, 1, 'This') Sentence_2 = ET.SubElement(xml_doc_2, 'sentence', number='1') Sentence_1 = ET.SubElement(xml_doc_1, 'sentence', number='1') Xml_doc_2 = ET.Element('paragraph', number='1') Xml_doc_1 = ET.Element('paragraph', number='1') String = ET.SubElement(word, 'string', number=str(number)) Word = ET.SubElement(sentence_object, 'word', number=str(number)) Queue = children # prepend so children come before siblingsĭef _create_word_object(sentence_object, number, word_string):

Queue = # (level, element)Ĭhildren = Įlement.text = '\n' + indent * (level+1) # for child openĮlement.tail = '\n' + indent * queue # for sibling openĮlement.tail = '\n' + indent * (level-1) # for parent close

Run this script to create two files in the output directory of the root directory of your project: example_1.xml and example_2.xml:.

To try this solution by yourself do as follows:

I've saved those JSONs in output.jsonl file.

To parse each ann_morphosyntax.xml file I've used the ElementTree library.

I've used iglob to iterate through that directories in order to take ann_morphosyntax.xml file from each of them.

I've implemented a solution in Python 3.6.

This file format allows you to store every JSON as one line of that file without any spaces, which eventually let me decrease the size of the initial data set to around 450 MB. Such paragraph object starts with