Quick Music Video Fact
MTV From 1992
Or: The shit I put up with for good TV in this house.
I had two conversations today about how I basically optimize statistical analysis and data reconfiguration...blah blah. Each conversation just wound up with me wanting to tell the same story so I though I should write it down.
Some time ago I made a colossal mistake. I have a lot of music videos.. Like a lot a lot, to me anyway. And there's organization and logic to it to build different playlists and whatever.
A couple of years ago I saw That Link to "every video from MTV's 120 Minutes" playlist. Man that thing sucked, but it did give me a plan. There are lists of every video from 120 minutes. You can grab a lot of full episodes from the Archive.
In the course of grabbing all the 120 Minutes videos I found the Internet Music Video Database. They aim to catalog all the videos ever by year and I believe have standards and try to catch videos of text and other junk content. So I just scraped the site and downloaded every single video for the 17 years I cared about so far.
This presents me with a new problem. It added about 4980 videos and took me to 12,000 something total.
I'm in the "1990" directory now. There are 430 odd videos in here. I just checked 1992, that's another 520. I don't have time in my life to deal with this. Basically they'll get played in the overall "MTV" playlist that I almost never use (but which is of course now suddenly really good), but they won't show up in the specialty 120 Minutes or "Arcade / Pizzeria" type playlists.
For each one I need to figure out:
I give a lot of leeway on this but a filename should at least have the artist and song.
So suppose I just take a really broad stroke, weed out the obvious "shit I don't care about", and just throw it all in there and let Nature Take Its Course. I immediately need to create top-level "dance" "metal" and "top 40" directories. Bad videos and triplicates will bubble up as time goes on. But even this approach is way bigger than I can deal with right now. Digging this hole took about 10 hours or so one day. So of course I've been sitting on this taking no action for like 18 months so far because I'm way too burnt out to deal with it.
Even scripting to find out if a file is a duplicate isn't great. But different download methods will make differently named files. I guess the fastest 5 minute thing would be to take a checksum of every file outside the IMVDB area I created (xray no dummy) and then compare the checksum of each new file against that list and remove any duplicates.
I mean at the end of the day it's no different than any of a thousand mass web-scrape and data-munging missions I set myself on / let myself get roped into.
The shit I put up with for good TV in this house. I have several similar magnitude media backlogs. At least in terms of mental strain.
Time flies.
- xrayspx's blog
- Log in to post comments
- 305 reads