Lunchtime Playground: Fun with Mathematica: Fetching data from HTML source

Feb 26, 2014

Fetching data from HTML source

Parsing html to get the data we need can be very frustrating. Lucky, Mathematica has a powerful hmtl import function, you can import raw html data into several different formats. In my experiences, import html as "XMLObject" is usually the best way to go.
Here is an example: OSCAR Nominees:

xml = Import["http://oscar.go.com/nominees", "XMLObject"];

We are interested in the list of nomineed films

body = Cases[xml, XMLElement["div", {"class" -> "nominee-by-film"}, ___], Infinity];

Extract titles:

title = Cases[body, XMLElement["span", {"class" -> "title"}, value_] :> value, Infinity]

Extract the number of nominees:

nominee =
Cases[body,
XMLElement["h1", {"class" -> "numberOfNominations"}, value_] :>
StringCases[value, x : NumberString :> ToExpression[x]], Infinity] ;

Put these two together:

result = Sort[Transpose[{title, Flatten@nominee}], #1[[2]] > #2[[2]] &]

Let's draw a graph to show the top 10 of the most nomineed films:

oscar = Import["http://www.oscars.org/awards/academyawards/about/awards/images/side_oscar.jpg"];
BarChart[result[[1 ;; 10, 2]],
ChartLabels -> Placed[Flatten@result[[1 ;; 10, 1]], After],
BarOrigin -> Left, Background -> LightBlue, ChartElements -> {oscar, {1, 1}},
Axes -> None, LabelStyle -> {Bold, Darker@Blue, 14}]

For this particular example, you can also try to get the same information directly from WolframAlpha.

Related post: A discussion on Mathmeatica Stackexchange

3 comments:

Anonymous said...: Welcome back!
This new post really came after a long while.
Please continue to keep this blog going!; March 25, 2014 at 2:48 AM
SoftwareCorner said...: nice information
SoftwareCorner; October 4, 2014 at 1:33 PM
SoftwareCorner said...: Thank you for sharing with us.
Mathematica; October 4, 2014 at 1:49 PM