python - Pyspark: help filtering out any rows which have unwanted characters -
writing parquet file gives me error states " ,;{}()\n\t="
characters not allowed.
i'd eliminate rows have of these characters anywhere.
would use "like", "rlike" or else?
i have tried this:
df = df.filter(df.account_number.rlike('*\n*', '*\ *','*,*','*;*','*{*','*}*','*)*','*(*','*\t*') == false)
obviously not work. i'm unsure right regex syntax is, or if need regex in particular case.
you use rlike
since it's regular expressions:
df.filter(~df.account_number.rlike("[ ,;{}()\n\t=]"))
when put characters between []
means of following characters.
i don't see why these characters wouldn't allowed in dataframe rows, there might invalid character in column names instead. can use .withcolumnrenamed()
rename it.
Comments
Post a Comment