r - Fast XML Parsing & output to CSV (nodes with elements missing and duplicate elements) -
i need convert xml file csv r, xml structure such child nodes (assets) either
- (i) missing elements (as opposed blank)
- (ii) elements duplicated (same element name, different value) or
- (iii) combination of both (i) , (ii)
the xml file in question https://www.sec.gov/archives/edgar/data/1694010/000169401017000025/exh1025710062017.xml rather large (190mb), contains 1 parent node assetdata, on 45k child nodes assets contain relevant elements want collect.
below xsd:
<?xml version="1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/xmlschema" xmlns="http://www.sec.gov/edgar/document/absee/autoloan/assetdata" xmlns:mstns="http://www.sec.gov/edgar/document/absee/autoloan/assetdata" xmlns:ns1="http://www.sec.gov/edgar/common" xmlns:ns2="http://www.sec.gov/edgar/eis_abs_common" targetnamespace="http://www.sec.gov/edgar/document/absee/autoloan/assetdata" elementformdefault="qualified" attributeformdefault="unqualified"> <xs:import namespace="http://www.sec.gov/edgar/common" schemalocation="eis_common.xsd"/> <xs:import namespace="http://www.sec.gov/edgar/eis_abs_common" schemalocation="eis_abs_common.xsd"/> <xs:simpletype name="intr_calc_typ_code_type"> <xs:annotation> <xs:documentation> item 3(c)(7) - enumerated values , descriptions: 1: simple, 98: other </xs:documentation> </xs:annotation> <xs:restriction base="xs:token"> <xs:enumeration value="1"/> <xs:enumeration value="98"/> </xs:restriction> </xs:simpletype> <xs:simpletype name="mod_type_code_type"> <xs:annotation> <xs:documentation> item 3(j)(1) - enumerated values , descriptions: 1: apr, 2: principal, 3: term, 4: extension, 98: other </xs:documentation> </xs:annotation> <xs:restriction base="xs:token"> <xs:enumeration value="1"/> <xs:enumeration value="2"/> <xs:enumeration value="3"/> <xs:enumeration value="4"/> <xs:enumeration value="98"/> </xs:restriction> </xs:simpletype> <xs:simpletype name="oblgr_employ_vrfctn_code_type"> <xs:annotation> <xs:documentation> item 3(e)(4) - enumerated values , descriptions: 1: not stated, not verified, 2: stated, not verified, 3: stated, level 3 verified level 3 verified = direct independent verification third party of obligors current employment. </xs:documentation> </xs:annotation> <xs:restriction base="xs:token"> <xs:enumeration value="1"/> <xs:enumeration value="2"/> <xs:enumeration value="3"/> </xs:restriction> </xs:simpletype> <xs:simpletype name="oblgr_incm_vrfctn_lvl_code_type"> <xs:annotation> <xs:documentation> item 3(e)(3) - enumerated values , descriptions: 1: not stated, not verified, 2: stated, not verified, 3: stated, verified not level 4 or level 5., 4: stated, "level 4" verifiedlevel 4 income verification = previous year w-2 or tax returns, , year-to-date pay stubs, if salaried. if self-employed, obligor provided 2 years of tax returns., 5: stated, "level 5" verifiedlevel 5 income verification = 24 months income verification (w-2s, pay stubs, bank statements and/or tax returns). if self-employed, obligor provided 2 years tax returns plus cpa certification of tax returns. </xs:documentation> </xs:annotation> <xs:restriction base="xs:token"> <xs:enumeration value="1"/> <xs:enumeration value="2"/> <xs:enumeration value="3"/> <xs:enumeration value="4"/> <xs:enumeration value="5"/> </xs:restriction> </xs:simpletype> <xs:simpletype name="orig_intr_rt_typ_code_type"> <xs:annotation> <xs:documentation> item 3(c)(8) - enumerated values , descriptions: 1: fixed, 2: adjustable, 98: other </xs:documentation> </xs:annotation> <xs:restriction base="xs:token"> <xs:enumeration value="1"/> <xs:enumeration value="2"/> <xs:enumeration value="98"/> </xs:restriction> </xs:simpletype> <xs:simpletype name="pymnt_typ_code_type"> <xs:annotation> <xs:documentation> item 3(c)(13) - enumerated values , descriptions: 1: bi-weekly, 2: monthly, 3: quarterly, 4: balloon, 98: other </xs:documentation> </xs:annotation> <xs:restriction base="xs:token"> <xs:enumeration value="1"/> <xs:enumeration value="2"/> <xs:enumeration value="3"/> <xs:enumeration value="4"/> <xs:enumeration value="98"/> </xs:restriction> </xs:simpletype> <xs:simpletype name="reprch_asset_subj_dmand_code_type"> <xs:annotation> <xs:documentation> item 3(h)(1) - enumerated values , descriptions: 0: asset pending repurchase or replacement, 1: asset repurchased or replaced, 2: demand in dispute, 3: demand withdrawn, 4: demand rejected, 98: other </xs:documentation> </xs:annotation> <xs:restriction base="xs:token"> <xs:enumeration value="0"/> <xs:enumeration value="1"/> <xs:enumeration value="2"/> <xs:enumeration value="3"/> <xs:enumeration value="4"/> <xs:enumeration value="98"/> </xs:restriction> </xs:simpletype> <xs:simpletype name="reprch_rplcmnt_reasn_code_type"> <xs:annotation> <xs:documentation> item 3(h)(5) - enumerated values , descriptions: 1: fraud, 2: payment default, 3: other recourse obligation, 4: reps/warrants breach, 5: servicer breach, 98: other, 99: unknown </xs:documentation> </xs:annotation> <xs:restriction base="xs:token"> <xs:enumeration value="1"/> <xs:enumeration value="2"/> <xs:enumeration value="3"/> <xs:enumeration value="4"/> <xs:enumeration value="5"/> <xs:enumeration value="98"/> <xs:enumeration value="99"/> </xs:restriction> </xs:simpletype> <xs:simpletype name="srvc_adv_meth_code_type"> <xs:annotation> <xs:documentation> item 3(f)(4) - enumerated values , descriptions: 1: no advancing, 2: interest only, 3: principal only, 4: principal , interest, 99: unavailable </xs:documentation> </xs:annotation> <xs:restriction base="xs:token"> <xs:enumeration value="1"/> <xs:enumeration value="2"/> <xs:enumeration value="3"/> <xs:enumeration value="4"/> <xs:enumeration value="99"/> </xs:restriction> </xs:simpletype> <xs:simpletype name="subvnt_code_type"> <xs:annotation> <xs:documentation> item 3(c)(14) - enumerated values , descriptions: 0: no, 1: yes - rate subvention, 2: yes - cash rebate, 98: yes - other </xs:documentation> </xs:annotation> <xs:restriction base="xs:token"> <xs:enumeration value="0"/> <xs:enumeration value="1"/> <xs:enumeration value="2"/> <xs:enumeration value="98"/> </xs:restriction> </xs:simpletype> <xs:simpletype name="vhcl_new_used_code_type"> <xs:annotation> <xs:documentation> item 3(d)(3) - enumerated values , descriptions: 1: new, 2: used </xs:documentation> </xs:annotation> <xs:restriction base="xs:token"> <xs:enumeration value="1"/> <xs:enumeration value="2"/> </xs:restriction> </xs:simpletype> <xs:simpletype name="vhcl_typ_code_type"> <xs:annotation> <xs:documentation> item 3(d)(5) - enumerated values , descriptions: 1: car, 2: truck, 3: suv, 4: motorcycle, 98: other, 99: unknown </xs:documentation> </xs:annotation> <xs:restriction base="xs:token"> <xs:enumeration value="1"/> <xs:enumeration value="2"/> <xs:enumeration value="3"/> <xs:enumeration value="4"/> <xs:enumeration value="98"/> <xs:enumeration value="99"/> </xs:restriction> </xs:simpletype> <xs:simpletype name="vhcl_val_src_code_type"> <xs:annotation> <xs:documentation> item 3(d)(7) - enumerated values , descriptions: 1: invoice price, 2: msrp, 3: kelly blue book, 98: other </xs:documentation> </xs:annotation> <xs:restriction base="xs:token"> <xs:enumeration value="1"/> <xs:enumeration value="2"/> <xs:enumeration value="3"/> <xs:enumeration value="98"/> </xs:restriction> </xs:simpletype> <xs:simpletype name="zero_bal_code_type"> <xs:annotation> <xs:documentation> item 3(f)(24)(ii) - enumerated values , descriptions: 1: prepaid or matured, 2: third-party sale, 3: repurchased or replaced, 4: charged-off, 5: servicing transfer, 99: unavailable</xs:documentation> </xs:annotation> <xs:restriction base="xs:token"> <xs:enumeration value="1"/> <xs:enumeration value="2"/> <xs:enumeration value="3"/> <xs:enumeration value="4"/> <xs:enumeration value="5"/> <xs:enumeration value="99"/> </xs:restriction> </xs:simpletype> <xs:complextype name="asset_type"> <xs:annotation> <xs:documentation>this main repeatable group element each major unit under assetdata.</xs:documentation> </xs:annotation> <xs:sequence> <!-- item 3(a)(1) - identify source of asset number used identify each asset in pool. --> <xs:element name="assettypenumber" type="ns2:string_100_type" minoccurs="1"/> <!-- item 3(a)(2) - provide unique id number of asset. --> <xs:element name="assetnumber" type="ns1:string_25_type" minoccurs="1"/> <!-- item 3(b)(1) - specify beginning date of reporting period. --> <xs:element name="reportingperiodbeginningdate" type="ns1:date_type" minoccurs="0"/> <!-- item 3(b)(2) - specify ending date of reporting period. --> <xs:element name="reportingperiodendingdate" type="ns1:date_type" minoccurs="0"/> <!-- item 3(c)(1) - identify name of entity originated loan. --> <xs:element name="originatorname" type="ns1:string_50_type" minoccurs="0"/> <!-- item 3(c)(2) - provide date loan originated. --> <xs:element name="originationdate" type="ns1:date_mm_yyyy_type" minoccurs="0"/> <!-- item 3(c)(3) - indicate amount of loan @ time loan originated. --> <xs:element name="originalloanamount" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(c)(4) - indicate term of loan in months @ time loan originated. --> <xs:element name="originalloanterm" type="ns1:integer_type_8_a" minoccurs="0"/> <!-- item 3(c)(5) - indicate month , year in final payment on loan scheduled made. --> <xs:element name="loanmaturitydate" type="ns1:date_mm_yyyy_type" minoccurs="0"/> <!-- item 3(c)(6) - provide rate of interest @ time loan originated. --> <xs:element name="originalinterestratepercentage" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(c)(7) - indicate whether interest rate calculation method simple or other. --> <xs:element name="interestcalculationtypecode" type="intr_calc_typ_code_type" minoccurs="0"/> <!-- item 3(c)(8) - indicate whether interest rate on loan fixed, adjustable or other. --> <xs:element name="originalinterestratetypecode" type="orig_intr_rt_typ_code_type" minoccurs="0"/> <!-- item 3(c)(9) - indicate number of months origination in obligor permitted pay interest on loan beginning when loan originated. --> <xs:element name="originalinterestonlytermnumber" type="ns1:integer_type_8_a" minoccurs="0"/> <!-- item 3(c)(10) - provide date of first scheduled payment due after loan originated. --> <xs:element name="originalfirstpaymentdate" type="ns1:date_mm_yyyy_type" minoccurs="0"/> <!-- item 3(c)(11) - indicate whether loan met criteria first level of solicitation, credit-granting or underwriting criteria used originate loan. --> <xs:element name="underwritingindicator" type="ns1:true_false_type" minoccurs="0"/> <!-- item 3(c)(12) - indicate number of months during time interest accrues no payments due obligor. --> <xs:element name="graceperiodnumber" type="ns1:integer_type_8_a" minoccurs="0"/> <!-- item 3(c)(13) - specify code indicating how payments required or if balloon payment due. --> <xs:element name="paymenttypecode" type="pymnt_typ_code_type" minoccurs="0"/> <!-- item 3(c)(14) - indicate yes or no whether form of subsidy received on loan, such cash incentives or favorable financing buyer. --> <xs:element name="subvented" type="subvnt_code_type" minoccurs="0" maxoccurs="unbounded"/> <!-- item 3(d)(1) - provide name of manufacturer of vehicle. --> <xs:element name="vehiclemanufacturername" type="ns1:string_30_type" minoccurs="0"/> <!-- item 3(d)(2) - provide name of model of vehicle. --> <xs:element name="vehiclemodelname" type="ns1:string_30_type" minoccurs="0"/> <!-- item 3(d)(3) - indicate whether vehicle financed new or used @ time of origination. --> <xs:element name="vehiclenewusedcode" type="vhcl_new_used_code_type" minoccurs="0"/> <!-- item 3(d)(4) - indicate model year of vehicle. --> <xs:element name="vehiclemodelyear" type="ns2:string_4_type" minoccurs="0"/> <!-- item 3(d)(5) - indicate code describing vehicle type. --> <xs:element name="vehicletypecode" type="vhcl_typ_code_type" minoccurs="0"/> <!-- item 3(d)(6) - indicate value of vehicle @ time of origination. --> <xs:element name="vehiclevalueamount" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(d)(7) - specify code describes source of vehicle value. --> <xs:element name="vehiclevaluesourcecode" type="vhcl_val_src_code_type" minoccurs="0"/> <!-- item 3(e)(1) - specify type of standardized credit score used evaluate obligor during loan origination process. --> <xs:element name="obligorcreditscoretype" type="ns2:string_35_type" minoccurs="0"/> <!-- item 3(e)(2) - provide standardized credit score of obligor used evaluate obligor during loan origination process. --> <xs:element name="obligorcreditscore" type="ns2:string_20_type" minoccurs="0"/> <!-- item 3(e)(3) - indicate code describing extent obligor's income verified during loan origination process. --> <xs:element name="obligorincomeverificationlevelcode" type="oblgr_incm_vrfctn_lvl_code_type" minoccurs="0"/> <!-- item 3(e)(4) - indicate code describing extent obligor's employment verified during loan origination process. --> <xs:element name="obligoremploymentverificationcode" type="oblgr_employ_vrfctn_code_type" minoccurs="0"/> <!-- item 3(e)(5) - indicate whether loan has co-obligor. --> <xs:element name="coobligorindicator" type="ns1:true_false_type" minoccurs="0"/> <!-- item 3(e)(6) - provide scheduled monthly payment amount percentage of total monthly income of obligor , other obligor @ origination date. provide methodology determining monthly income in prospectus. --> <xs:element name="paymenttoincomepercentage" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(e)(7) - specify location of obligor providing current u.s. state or territory. --> <xs:element name="obligorgeographiclocation" type="ns2:string_100_type" minoccurs="0"/> <!-- item 3(f)(1) - indicate yes or no whether asset added during reporting period. instruction: response data point required when assets added asset pool after final prospectus under securities act rule 424 (section 230.424 of chapter) filed --> <xs:element name="assetaddedindicator" type="ns1:true_false_type" minoccurs="0"/> <!-- item 3(f)(2) - indicate number of months end of reporting period loan maturity date. --> <xs:element name="remainingtermtomaturitynumber" type="ns1:integer_type_8_a" minoccurs="0"/> <!-- item 3(f)(3) - indicates yes or no whether loan modified original terms during reporting period. --> <xs:element name="reportingperiodmodificationindicator" type="ns1:true_false_type" minoccurs="0"/> <!-- item 3(f)(4) - specify code indicates servicer's responsibility advancing principal or interest on delinquent loans. --> <xs:element name="servicingadvancemethodcode" type="srvc_adv_meth_code_type" minoccurs="0"/> <!-- item 3(f)(5) - indicate outstanding principal balance of loan of beginning of reporting period. --> <xs:element name="reportingperiodbeginningloanbalanceamount" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(f)(6) - indicate total payment due collected in next reporting period. --> <xs:element name="nextreportingperiodpaymentamountdue" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(f)(7) - indicate interest rate loan in effect during reporting period. --> <xs:element name="reportingperiodinterestratepercentage" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(f)(8) - loans have not been paid-off, indicate interest rate in effect next reporting period. --> <xs:element name="nextinterestratepercentage" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(f)(9) - if servicing fee based on percentage, provide percentage used calculate aggregate servicing fee. --> <xs:element name="servicingfeepercentage" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(f)(10) - if servicing fee based on flat-fee amount, indicate monthly servicing fee paid servicers. --> <xs:element name="servicingflatfeeamount" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(f)(11) - provide amount of other fees earned loan administrators reduce amount of funds remitted issuing entity (including subservicing, master servicing, trustee fees, etc.). --> <xs:element name="otherservicerfeeretainedbyservicer" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(f)(12) - provide cumulative amount of late charges , other fees have been assessed servicer, not paid obligor. --> <xs:element name="otherassesseduncollectedservicerfeeamount" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(f)(13) - indicate interest payment amount scheduled collected during reporting period. --> <xs:element name="scheduledinterestamount" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(f)(14) - indicate principal payment amount scheduled collected during reporting period. --> <xs:element name="scheduledprincipalamount" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(f)(15) - indicate other amounts caused principal balance of loan decreased or increased during reporting period. --> <xs:element name="otherprincipaladjustmentamount" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(f)(16) - indicate actual balance of loan of end of reporting period. --> <xs:element name="reportingperiodactualendbalanceamount" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(f)(17) - indicate total payment amount scheduled collected during reporting period (including fees). --> <xs:element name="reportingperiodscheduledpaymentamount" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(f)(18) - indicate total payment paid servicer during reporting period. --> <xs:element name="totalactualamountpaid" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(f)(19) - indicate gross amount of interest collected during reporting period, whether or not obligor. --> <xs:element name="actualinterestcollectedamount" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(f)(20) - indicate amount of principal collected during reporting period, whether or not obligor. --> <xs:element name="actualprincipalcollectedamount" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(f)(21) - indicate total of amounts, other principal , interest, collected during reporting period, whether or not borrower. --> <xs:element name="actualothercollectedamount" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(f)(22) - if amounts advanced servicer during reporting period, specify amount. --> <xs:element name="serviceradvancedamount" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(f)(23) - provide date through interest paid payment received during reporting period, effective date interest calculated application of next payment. --> <xs:element name="interestpaidthroughdate" type="ns1:date_type" minoccurs="0"/> <!-- item 3(f)(24)(i) - provide date on loan balance reduced zero. --> <xs:element name="zerobalanceeffectivedate" type="ns1:date_mm_yyyy_type" minoccurs="0"/> <!-- item 3(f)(24)(ii) - provide code indicates reason loan's balance reduced zero. --> <xs:element name="zerobalancecode" type="zero_bal_code_type" minoccurs="0" maxoccurs="unbounded"/> <!-- item 3(f)(25) - indicate number of days obligor delinquent past obligor's payment due date, determined governing transaction agreement. --> <xs:element name="currentdelinquencystatus" type="ns1:integer_type_8_a" minoccurs="0"/> <!-- item 3(g)(1) - provide name of entity services or have right service loan. --> <xs:element name="primaryloanservicername" type="ns2:string_100_type" minoccurs="0"/> <!-- item 3(g)(2) - if loan's servicing has been transferred, provide effective date of recent servicing transfer. --> <xs:element name="mostrecentservicingtransferreceiveddate" type="ns1:date_mm_yyyy_type" minoccurs="0"/> <!-- item 3(h) - indicate yes or no whether during reporting period loan subject of demand repurchase or replace breach of representations , warranties, including investor demands upon trustee. --> <xs:element name="assetsubjectdemandindicator" type="ns1:true_false_type" minoccurs="0"/> <!-- item 3(h)(1) - indicate code describes status of repurchase or replacement demand of end of reporting period. --> <xs:element name="assetsubjectdemandstatuscode" type="reprch_asset_subj_dmand_code_type" minoccurs="0"/> <!-- item 3(h)(2) - provide amount paid repurchase loan. --> <xs:element name="repurchaseamount" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(h)(3) - indicate date loan repurchase or replacement demand resolved. --> <xs:element name="demandresolutiondate" type="ns1:date_type" minoccurs="0"/> <!-- item 3(h)(4) - specify name of repurchaser. --> <xs:element name="repurchasername" type="ns1:string_30_type" minoccurs="0"/> <!-- item 3(h)(5) - indicate code describes reason repurchase or replacement. --> <xs:element name="repurchasereplacementreasoncode" type="reprch_rplcmnt_reasn_code_type" minoccurs="0" maxoccurs="unbounded"/> <!-- item 3(i)(1) - specify amount of uncollected principal charged-off. --> <xs:element name="chargedoffprincipalamount" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(i)(2) - if loan charged-off, specify amounts received after charge-off. --> <xs:element name="recoveredamount" type="ns1:decimal_type20_8" minoccurs="0"/> <!-- item 3(j)(1) - indicate code describes reason loan modified during reporting period. --> <xs:element name="modificationtypecode" type="mod_type_code_type" minoccurs="0" maxoccurs="unbounded"/> <!-- item 3(j)(2) - provide number of months loan extended during reporting period. --> <xs:element name="paymentextendednumber" type="ns1:integer_type_8_a" minoccurs="0"/> <!-- item 3(k) - indicate yes or no whether vehicle has been repossessed. if vehicle has been repossessed, provide information required in item 3(k)(1). --> <xs:element name="repossessedindicator" type="ns1:true_false_type" minoccurs="0"/> <!-- item 3(k)(1) - provide total amount of proceeds received on disposition (net of repossession fees , expenses). --> <xs:element name="repossessedproceedsamount" type="ns1:decimal_type20_8" minoccurs="0"/> </xs:sequence> </xs:complextype> <xs:element name="assetdata"> <xs:annotation> <xs:documentation>this root element abs asset data exhibit.</xs:documentation> </xs:annotation> <xs:complextype> <xs:sequence> <xs:element name="assets" type="asset_type" minoccurs="0" maxoccurs="unbounded"> <xs:unique name="uniquesubvented"> <xs:selector xpath="mstns:subvented"/> <xs:field xpath="."/> </xs:unique> <xs:unique name="uniquezerobalancecode"> <xs:selector xpath="mstns:zerobalancecode"/> <xs:field xpath="."/> </xs:unique> <xs:unique name="uniquerepurchasereplacementreasoncode"> <xs:selector xpath="mstns:repurchasereplacementreasoncode"/> <xs:field xpath="."/> </xs:unique> <xs:unique name="uniquemodificationtypecode"> <xs:selector xpath="mstns:modificationtypecode"/> <xs:field xpath="."/> </xs:unique> </xs:element> </xs:sequence> </xs:complextype> </xs:element> </xs:schema>
i've tried using https://stackoverflow.com/a/22901398 , although method seems efficient, not deal duplicate elements of same name within assets nodes. however, deal elements aren't present in each assets node.
ideally, i'd concatenate duplicate elements (e.g. element name="subvented") within same value given delimiter between each value (i.e. 1 , 98 merged 1,98). output csv columns each representing element , 1 assets node per row. given know possible elements can use build static structure (as opposed dynamically in linked code above). how can merge duplicate elements , still keep code efficient?
using code https://stackoverflow.com/a/22901398 , adding few lines to:
- specify elements extracted each node (this influences data.table structure , csv)
- specify elements may found more once in given node
merge duplicate elements within 1 node , delete other duplicates
xmltocsv_auto = function(datafile2, output) { require(xml) require(xlsx) require(rcurl) require(data.table) xmltest <- xmlparse(datafile2) d = xmlroot(xmltest) size = xmlsize(d) ##elements extract each node namess <- c("assettypenumber", "assetnumber", "reportingperiodbeginningdate", "reportingperiodendingdate", "originatorname", "originationdate", "originalloanamount", "originalloanterm", "loanmaturitydate", "originalinterestratepercentage", "interestcalculationtypecode", "originalinterestratetypecode", "originalinterestonlytermnumber", "originalfirstpaymentdate", "underwritingindicator", "graceperiodnumber", "paymenttypecode", "subvented", "vehiclemanufacturername", "vehiclemodelname", "vehiclenewusedcode", "vehiclemodelyear", "vehicletypecode", "vehiclevalueamount", "vehiclevaluesourcecode", "obligorcreditscoretype", "obligorcreditscore", "obligorincomeverificationlevelcode", "obligoremploymentverificationcode", "coobligorindicator", "paymenttoincomepercentage", "obligorgeographiclocation", "assetaddedindicator", "remainingtermtomaturitynumber", "reportingperiodmodificationindicator", "servicingadvancemethodcode", "reportingperiodbeginningloanbalanceamount", "nextreportingperiodpaymentamountdue", "reportingperiodinterestratepercentage", "nextinterestratepercentage", "servicingfeepercentage", "servicingflatfeeamount", "otherservicerfeeretainedbyservicer", "otherassesseduncollectedservicerfeeamount", "scheduledinterestamount", "scheduledprincipalamount", "otherprincipaladjustmentamount", "reportingperiodactualendbalanceamount", "reportingperiodscheduledpaymentamount", "totalactualamountpaid", "actualinterestcollectedamount", "actualprincipalcollectedamount", "actualothercollectedamount", "serviceradvancedamount", "interestpaidthroughdate", "zerobalanceeffectivedate", "zerobalancecode", "currentdelinquencystatus", "primaryloanservicername", "mostrecentservicingtransferreceiveddate", "assetsubjectdemandindicator", "assetsubjectdemandstatuscode", "repurchaseamount", "demandresolutiondate", "repurchasername", "repurchasereplacementreasoncode", "chargedoffprincipalamount", "recoveredamount", "modificationtypecode", "paymentextendednumber", "repossessedindicator", "repossessedproceedsamount") ##potential duplicates dupes <- c("repurchasereplacementreasoncode", "modificationtypecode", "subvented", "zerobalancecode") m = data.table(matrix(na,nc=length(namess), nr=size)) setnames(m, namess) (n in namess) mode(m[[n]]) = "character" for(i in 1:size){ v = getchildrenstrings(d[[i]]) w = names(v) for(j in dupes){ if (length(which(w %in% j))>1){ ##if duplicates found dupeslist <- (which(w %in% j)) ##get duplicate location changer <- c(xmlvalue(d[[i]][[dupeslist[1]]])) ##buffer value of first occurence for(k in dupeslist[-(1)]){ changer <- c(changer, xmlvalue(d[[i]][[k]])) ##add values of other duplicates buffer } for(k in rev(dupeslist[-(1)])){ removenodes(d[[i]][[k]]) ##remove duplicate nodes starting end of list } changer <- paste(changer, collapse = "-") xmlvalue(d[[i]][[dupeslist[1]]]) <- changer } } v = getchildrenstrings(d[[i]]) m[i, (names(v)):= as.list(v)] } (n in namess) m[, (n):= type.convert(m[[n]], as.is=true)] fwrite(m, output) free(xmltest) gc()
}
processing time approximately 10 minutes given xml file.
Comments
Post a Comment