python - Pyspark error handling with file name having spaces -
i using pyspark 2.1
problem statement: need validate hdfs path
, file if exist need copy file name variable
below code used far after referring few websites , stackoverflow
import os import subprocess import pandas pd import times def run_cmd(args_list): print('running system command: {0}'.format(' '.join(args_list))) proc = subprocess.popen(args_list, stdout=subprocess.pipe, stderr=subprocess.pipe) proc.communicate() return proc.returncode today = datetime.now().date().strftime('%d%b%y') source_dir = '/user/dev/input/'+ today hdfs_file_path=source_dir+'\'student marks details.csv\'' cmd = ['hdfs', 'dfs', '-find','{}','-name', hdfs_file_path] code=run_cmd(cmd) if code<>1: print 'file doesnot exist' system.exit(1) else: print 'file exist'
with above code getting error "file doesn't exist" file present in folder
problem able run run below command in shell console getting complete path.
hdfs dfs -find () -name /user/dev/input/08aug2017/'student marks details.csv'
when tried import in pyspark above detailed code not able execute there exist space in filename . please me in resolving issue.
the problem
your problem on line:
hdfs_file_path = source_dir + '\'student marks details.csv\''
you adding 2 unneeded single quotes, , forgetting add directory separator.
the reason path works in command:
hdfs dfs -find () -name /user/dev/input/08aug2017/'student marks details.csv'
is because shell command. on shell using (presumably bash), following commands equivalent:
echo '/user/dev/input/08aug2017/student marks details.csv' echo /user/dev/input/08aug2017/'student marks details.csv'
bash removes quotes, , merges strings together, yielding same string result, /user/dev/input/08aug2017/student marks details.csv
. quotes not part of path, way tell bash not split string @ spaces, create single string, , remove quotes.
when write:
hdfs_file_path = source_dir + '\'student marks details.csv\''
the path end getting /user/dev/input/08aug2017'student marks details.csv'
, instead of correct /user/dev/input/08aug2017/student marks details.csv
.
the subprocess
call requires plain strings correspond values want, , not process them same way shell does.
solution
in python, joining paths best performed calling os.path.join
. suggest replace these lines:
source_dir = '/user/dev/input/' + today hdfs_file_path = source_dir + '\'student marks details.csv\''
with following:
source_dir = os.path.join('/user/dev/input/', today) hdfs_file_path = os.path.join(source_dir, 'student marks details.csv')
os.path.join
takes care add single directory separator (/ on unix, \ on windows) between arguments, can't accidentally either forget separator, or add twice.
Comments
Post a Comment