amazon s3 - Adapting to the disappearance of the IMDb datasets -
so freely available imdb datasets disappear @ end of 2017.
from understand, must:
- identify (register personal account access)
- pay money (once free quota used up, though actual price may minuscule)
- write code (though looks you're downloading .gz files, simple)
some questions arise this:
- what data format like? there's brief example on page, have actual file showing how titles, years, votes, etc. formatted , linked?
- what options if don't want go along regime? there freely available copies of datasets somewhere? other freely available film databases exist @ least cover movies , tv series minimum of interest released since 2017 onward.
talking paywall
the new files amount 360 megabytes of data, understand of s3 pricing, inside free cap unless you'll download many times month.
what data format like?
they seem dumps of database tables.
as example, here beginning of title.basics.tsv.gz:
tconst titletype primarytitle originaltitle isadult startyear endyear runtimeminutes genres tt0000001 short carmencita carmencita 0 1894 \n 1 documentary,short tt0000002 short le clown et ses chiens le clown et ses chiens 0 1892 \n 5 animation,short tt0000003 short pauvre pierrot pauvre pierrot 0 1892 \n 4 animation,comedy,romance tt0000004 short un bon bock un bon bock 0 1892 \n \n animation,short
the available files are: title.basics.tsv.gz, title.crew.tsv.gz, title.episode.tsv.gz, title.principals.tsv.gz, title.ratings.tsv.gz , name.basics.tsv.gz
in terms of contained data, fields in each file:
name.basics.tsv.gz nconst primaryname birthyear deathyear primaryprofession knownfortitles title.basics.tsv.gz tconst titletype primarytitle originaltitle isadult startyear endyear runtimeminutes genres title.crew.tsv.gz tconst directors writers title.episode.tsv.gz tconst parenttconst seasonnumber episodenumber title.principals.tsv.gz tconst principalcast title.ratings.tsv.gz tconst averagerating numvotes
talking number of lines in each file, (2017-080-21) have:
name.basics.tsv.gz 8086560 title.basics.tsv.gz 4466246 title.crew.tsv.gz 4466246 title.episode.tsv.gz 2934335 title.principals.tsv.gz 3957899 title.ratings.tsv.gz 757412
what options if don't want go along regime?
not many, fear. if price concern, see above.
all of findings new format in this thread on imdbpy-devel mailing list
what other freely available film databases exist
i think best alternative https://www.themoviedb.org/ , http://www.omdbapi.com/ i'm not familiar neither.
Comments
Post a Comment