Python LXML iterparse function: memory not getting freed while parsing a huge XML -

April 15, 2013

i parsing big xmls (~500mb) of lxml library in python. have used beautifulsoup lxml-xml parser small files. when came across huge xmls, inefficient reads whole file once, , parses it.

i need parse xml root leaf paths (except outermost tag).
eg.

<?xml version="1.0" encoding="utf-8"?> <!doctype a> <a>     <b>         <c>             abc         </c>         <d>             abd         </d>     </b> </a>

above xml should give keys , values output (root leaf paths).

a.b.c = abc
a.b.d = abd

here's code i've written parse it:
(ignore1 , ignore2 tags need ignored, , tu.clean_text() function remove unnecessary characters)

def fast_parser(filename, keys, values, ignore1, ignore2):     context = etree.iterparse(filename, events=('start', 'end',))      path = list()     = 0     lastevent = ""     event, elem in context:         += 1         tag = elem.tag if "}" not in elem.tag else elem.tag.split('}', 1)[1]          if tag == ignore1 or tag == ignore2:             pass         elif event == "start":             path.append(tag)         elif event == "end":             if lastevent == "start":                 keys.append(".".join(path))                 values.append(tu.clean_text(elem.text))              # free memory             elem.clear()             while elem.getprevious() not none:                 del elem.getparent()[0]             if len(path) > 0:                 path.pop()         lastevent = event      del context     return keys, values

i have referred following article parsing large file ibm.com/developerworks/xml/library/x-hiperfparse/#listing4

here's screenshot of top command. memory usage goes beyond 2 gb ~500 mb xml file. suspect memory not getting freed.

i have gone through few stackoverflow questions. didn't help. please advice.

i took code https://stackoverflow.com/a/7171543/131187, chopped out comments , print statements, , added suitable func this. wouldn't guess how time take process 500 mb file!

even in writing func have done nothing original, having adopted original authors' use of xpath expression, 'ancestor-or-self::*', provide absolute path want.

however, since code conforms more closely original scripts might not leak memory.

import lxml.etree et  input_xml = 'temp.xml' line in open(input_xml).readlines():     print (line[:-1])  def mod_fast_iter(context, func, *args, **kwargs):     event, elem in context:         func(elem, *args, **kwargs)         elem.clear()         ancestor in elem.xpath('ancestor-or-self::*'):             while ancestor.getprevious() not none:                 del ancestor.getparent()[0]     del context  def func(elem):     content = '' if not elem.text else elem.text.strip()     if content:         ancestors = elem.xpath('ancestor-or-self::*')         print ('%s=%s' % ('.'.join([_.tag _ in ancestors]), content))  print ('\nresult:\n') context = et.iterparse(open(input_xml , 'rb'), events=('end', )) mod_fast_iter(context, func)

output:

<?xml version="1.0" encoding="utf-8"?> <!doctype a> <a>     <b>         <c>             abc         </c>         <d>             abd         </d>     </b> </a  result:  a.b.c=abc a.b.d=abd

Search This Blog

Force Net

Python LXML iterparse function: memory not getting freed while parsing a huge XML -

Comments

Post a Comment

Popular posts from this blog

ubuntu - PHP script to find files of certain extensions in a directory, returns populated array when run in browser, but empty array when run from terminal -

php - How can i create a user dashboard -

javascript - How to detect toggling of the fullscreen-toolbar in jQuery Mobile? -