Python LXML iterparse function: memory not getting freed while parsing a huge XML -
i parsing big xmls (~500mb) of lxml library in python. have used beautifulsoup lxml-xml parser small files. when came across huge xmls, inefficient reads whole file once, , parses it.
i need parse xml root leaf paths (except outermost tag).
eg.
<?xml version="1.0" encoding="utf-8"?> <!doctype a> <a> <b> <c> abc </c> <d> abd </d> </b> </a>
above xml should give keys , values output (root leaf paths).
a.b.c = abc
a.b.d = abd
here's code i've written parse it:
(ignore1 , ignore2 tags need ignored, , tu.clean_text() function remove unnecessary characters)
def fast_parser(filename, keys, values, ignore1, ignore2): context = etree.iterparse(filename, events=('start', 'end',)) path = list() = 0 lastevent = "" event, elem in context: += 1 tag = elem.tag if "}" not in elem.tag else elem.tag.split('}', 1)[1] if tag == ignore1 or tag == ignore2: pass elif event == "start": path.append(tag) elif event == "end": if lastevent == "start": keys.append(".".join(path)) values.append(tu.clean_text(elem.text)) # free memory elem.clear() while elem.getprevious() not none: del elem.getparent()[0] if len(path) > 0: path.pop() lastevent = event del context return keys, values
i have referred following article parsing large file ibm.com/developerworks/xml/library/x-hiperfparse/#listing4
here's screenshot of top command. memory usage goes beyond 2 gb ~500 mb xml file. suspect memory not getting freed.
i have gone through few stackoverflow questions. didn't help. please advice.
i took code https://stackoverflow.com/a/7171543/131187, chopped out comments , print statements, , added suitable func
this. wouldn't guess how time take process 500 mb file!
even in writing func
have done nothing original, having adopted original authors' use of xpath expression, 'ancestor-or-self::*', provide absolute path want.
however, since code conforms more closely original scripts might not leak memory.
import lxml.etree et input_xml = 'temp.xml' line in open(input_xml).readlines(): print (line[:-1]) def mod_fast_iter(context, func, *args, **kwargs): event, elem in context: func(elem, *args, **kwargs) elem.clear() ancestor in elem.xpath('ancestor-or-self::*'): while ancestor.getprevious() not none: del ancestor.getparent()[0] del context def func(elem): content = '' if not elem.text else elem.text.strip() if content: ancestors = elem.xpath('ancestor-or-self::*') print ('%s=%s' % ('.'.join([_.tag _ in ancestors]), content)) print ('\nresult:\n') context = et.iterparse(open(input_xml , 'rb'), events=('end', )) mod_fast_iter(context, func)
output:
<?xml version="1.0" encoding="utf-8"?> <!doctype a> <a> <b> <c> abc </c> <d> abd </d> </b> </a result: a.b.c=abc a.b.d=abd
Comments
Post a Comment