The incoming 2011 KDD Cup data mining competition  by Yahoo! Lab posts an interesting challenge to predict the users' ratings for individual songs out of this company’s huge music database. Unlike previous KDD Cups projects filled by tons of variables that make dimension reduction a serious concern, Yahoo! Lab provides few variables: artist/genre/album. No demographic or geographic information is disclosed. It is interesting to forecast the behavior of a web user by limited web records. Digging valuable clues out for potential following direct marketing is also rewarding. Especially while the competition datasets contain up to 1 million users, 600 thousand songs, the project is a real world web-scale analytics question.
To predict song’s rating may need two-stage modeling. The relationship looks straight-forward: rating = (genre + album + artist)* user. Building 1millinon multiply by 600 thousand models to suit each of the 1 million users on each of the 600K songs would be a formidable job. The first stage has to group users and songs to reasonable levels by unsupervised modeling methods, such as clustering. At the second stage, supervised models are to be trained for the songs and users separately. The relationship among rate, genre and artist, in 3D cubes, would decide to which group those songs or users are most likely to fit in. After scoring, in the test dataset, each song would have a song group ID and each user has a user group ID. The association between song group and user group predicts the rating values.
Song listeners get connected by their preference, purposely or not purposely. In a broad picture, those relationships formed an artist-centered social network and listeners scattered around those axis with various distances. As a result, the music fans interact each other and dwell in many social neighborhoods. In this project, the scoring card to predict song's rating will be like a huge matrix between song groups and user groups with rating as values. Biologists tackling the interaction among thousands of genes by DNA microarray would be familiar with this scenario. Usually DNA microarray is used to discover high-throughput information of relationship of multiple genes. Similarly, for music social network, this DNA microarray like matrix would show the affinity between users and songs. To increase the number of cells in this matrix, or overall number of song groups or user groups, is a good idea to bring about higher precision. However, prediction accuracy and computer resource consumption of the models are the first and foremost considerations for this particular question.
Statistics persons call data mining as statistical learning, while computer persons refer it as machine learning. Eventually both roads lead to the same way. The datasets for KDD cup 2011 data mining competition will be released on March 15. Though its 263m rating data for track1 and 63m rating data for track looks quite a lot, statistics guys probably can gear up with their high-level weapons, such as SAS and R, to explore this usually computer-geek-dominated field.
Reference: 1. 'KDD CUP 2011 from Yahoo! Lab'. http://kddcup.yahoo.com/
********(0) DOWNLOAD PREVIEW DATASET TRACK2*****; http://kddcup.yahoo.com/dataset.track2.sample.tar.bz2 ******(1) INTEGRATE PREVIEW SONG DATA*********; data trackdata; infile 'C:\trackData2.txt' dlm='|' missover dsd; informat TrackId AlbumId ArtistId GenreId1-GenreId15 $6.; input TrackId AlbumId ArtistId GenreId1-GenreId15 ; if AlbumId = 'None' then call missing(AlbumId); if ArtistId = 'None' then call missing(ArtistId); run; proc sql; select count(unique(ArtistId)) from trackdata ;quit; ********(2) NORMALIZE SONG DATA*********; data three; set trackdata (where=(missing(ArtistId)=0)); array genre GenreId1-GenreId15; do i=1 to 15; if missing(genre[i])= 0 then do; songgenre=genre[i]; output; end; end; keep ArtistId songgenre; run; *******(3) FAST CLUSTERING THE SONGS ONLY BY GENRES******; proc sort data=three out=four; by ArtistId; run; proc sql; create table five as select ArtistId, songgenre, count(songgenre) as freq from four group by ArtistId, songgenre order by ArtistId, songgenre ;quit; proc transpose data=five out=five_t(drop=_name_) prefix=genre; by ArtistId; id songgenre; var freq; run; data six; set five_t; array x[*] _numeric_; do i=1 to dim(x); if missing(x[i]) = 1 then x[i]=0; end; drop i; run; proc fastclus data=six maxc=10 maxiter=100 out=clus; var genre:; run; *****(4) Simulate data for Genomic Heat Map********; data music (keep=songgroup usergroup rate); do i=1 to 20; songgroup = cats('song', i); do j=1 to 20; usergroup=cats('user', j); rate= ranuni(0)*100; output; end; end; run; proc template; define statgraph myheat.Grid; begingraph; layout overlay / border=true xaxisopts=(label='Song groups' ) yaxisopts=(label='User groups' ) pad=(top=5px bottom=0px right=15px); scatterplot x=songgroup y=usergroup / markercolorgradient=rate markerattrs=(symbol=squarefilled size=30) colormodel=threecolorramp name='s2'; continuouslegend 's2' / orient=vertical location=outside valign=center halign=right valuecounthint=10; endlayout; endgraph; end; run; proc template; define Style HeatMapStyle; parent = styles.harvest; style GraphFonts from GraphFonts "Fonts used in graph styles" / 'GraphFootnoteFont' = (", ",8pt) 'GraphLabelFont' = (", ",7pt) 'GraphValueFont' = (", ",10pt) 'GraphDataFont' = (", ",5pt); style GraphColors from graphcolors / "gcdata1" = cxaf1515 "gcdata2" = cxeabb14 "gcdata3" = cxffffff "gramp3cend" = cxaa081b "gramp3cneutral" = cx000000 "gramp3cstart" = cx1ba808; end; run; ods html style=HeatMapStyle image_dpi=300 file='heatmap.html' path='d:\myfun'; ods graphics on / reset imagename='GTLHandout_Heatmaps' imagefmt=gif; proc sgrender data=music template=myheat.grid; run; ods html close;