Feb 26, 2014

Fetching data from HTML source


Parsing html to get the data we need can be very frustratingLucky, Mathematica has a powerful hmtl import function, you can import raw html data into several different formats. In my experiences, import html as "XMLObject" is usually the best way to go. 
Here is an example: OSCAR Nominees:
xml = Import["http://oscar.go.com/nominees", "XMLObject"];   
We are interested in the list of nomineed films

body = Cases[xml, XMLElement["div", {"class" -> "nominee-by-film"}, ___], Infinity];
Extract titles:
title = Cases[body, XMLElement["span", {"class" -> "title"}, value_] :> value, Infinity] 
Extract the number of nominees:
nominee =
  Cases[body,
   XMLElement["h1", {"class" -> "numberOfNominations"}, value_] :>
    StringCases[value, x : NumberString :> ToExpression[x]], Infinity] ;
Put these two together:
result = Sort[Transpose[{title, Flatten@nominee}], #1[[2]] > #2[[2]] &]
Let's draw a graph to show the top 10 of the most nomineed films:
oscar = Import["http://www.oscars.org/awards/academyawards/about/awards/images/side_oscar.jpg"];
BarChart[result[[1 ;; 10, 2]],
 ChartLabels -> Placed[Flatten@result[[1 ;; 10, 1]], After],
 BarOrigin -> Left, Background -> LightBlue, ChartElements -> {oscar, {1, 1}},
  Axes -> None, LabelStyle -> {Bold, Darker@Blue, 14}] 

For this particular example, you can also try to get the same information directly from WolframAlpha.

Related post: A discussion on Mathmeatica Stackexchange

1 comment:

Anonymous said...

Welcome back!
This new post really came after a long while.
Please continue to keep this blog going!