python - Using multiple manifest files to load to Redshift from S3? -


i have large manifest file containing 460,000 entries (all s3 files) wish load redshift. due issues beyond control few (maybe dozen or more) of these entries contain bad json cause copy command fail if pass in entire manifest @ once. using copy key prefix fail in same way.

to around have written python script go through manifest file 1 url @ time , issue copy command each 1 using psycopg2. script additionally catch , log errors ensure script runs when comes across bad file, , allows locate , fix bad files.

the script has been running little more week on spare ec2 instance , around 75% complete. i'd lower run time, because script used again.

my understanding of redshift copy commands executed in parallel, , had idea - splitting manifest file smaller chunks , running script each of chunks reduce time takes load files?

copy command can load multiple files in parallel fast , efficiently. when run 1 copy command each file in python file, that's going take lot of time since not taking advantage of parallel loading.

so maybe can write script find bad jsons in manifest , kick them out , run single copy new clean manifest?

or suggested, recommend splitting manifest file small chunks copy can run multiple files @ time. (not single copy command each file)


Comments

Popular posts from this blog

ubuntu - PHP script to find files of certain extensions in a directory, returns populated array when run in browser, but empty array when run from terminal -

php - How can i create a user dashboard -

javascript - How to detect toggling of the fullscreen-toolbar in jQuery Mobile? -