Parsing html to get the data we need can be very frustrating. Lucky, Mathematica has a powerful hmtl import function, you can import raw html data into several different formats. In my experiences, import html as "XMLObject" is usually the best way to go.
Here is an example: OSCAR Nominees:
xml = Import["http://oscar.go.com/nominees", "XMLObject"];We are interested in the list of nomineed films
body = Cases[xml, XMLElement["div", {"class" -> "nominee-by-film"}, ___], Infinity];Extract titles:
title = Cases[body, XMLElement["span", {"class" -> "title"}, value_] :> value, Infinity]Extract the number of nominees:
nominee =Put these two together:
Cases[body,
XMLElement["h1", {"class" -> "numberOfNominations"}, value_] :>
StringCases[value, x : NumberString :> ToExpression[x]], Infinity] ;
result = Sort[Transpose[{title, Flatten@nominee}], #1[[2]] > #2[[2]] &]Let's draw a graph to show the top 10 of the most nomineed films:
oscar = Import["http://www.oscars.org/awards/academyawards/about/awards/images/side_oscar.jpg"];
BarChart[result[[1 ;; 10, 2]],
ChartLabels -> Placed[Flatten@result[[1 ;; 10, 1]], After],
BarOrigin -> Left, Background -> LightBlue, ChartElements -> {oscar, {1, 1}},
Axes -> None, LabelStyle -> {Bold, Darker@Blue, 14}]
For this particular example, you can also try to get the same information directly from WolframAlpha.
Related post: A discussion on Mathmeatica Stackexchange