Apr 19, 2010

Play with Bibliographic Data

I’ve got the publication list of a research institute. There are more than 1400 entries, exported in xml format from Endnote. As a data mining project, there are plenty of things you can do with the data. Let’s assume  we are interested in the relationships among researchers and groups. Can we check this out quickly in Mathematica?

First, extract the data:

in the xml, for each publication, the author list is stored as the following:

<authors>
<author>
  <style face="normal" font="default" size="100%">Paul, N.</style>
</author>
<author>
  <style face="normal" font="default" size="100%">Cao, B.</style>
</author>
… </authors>

xml = Import["Bib.xml"];

authors = Cases[xml, XMLElement["authors", _, authors_] -> authors, Infinity];

names = Flatten@Cases[#, XMLElement["author", _, {___, XMLElement["style", _, name_]}] –> name] & /@ authors;

What we get here is the lists of authors for each publication.

Let’ see the relationship between number of authors and number of publications.

Sort[Tally[Length[#] & /@ names], #1[[1]] < #2[[1]] &]

{{1, 265}, {2, 280}, {3, 320}, {4, 224}, {5, 94}, {6, 85}, {7,
  39}, {8, 22}, {9, 20}, {10, 14}, {11, 11}, {12, 14}, {13, 5}, {14,
  3}, {15, 5}, {18, 1}, {19, 1}, {20, 1}}

xml

We can see that most of publications have no more than 4 authors.

Next step, we like to check out the internal relationships among authors. We need to generate a network for authors. For example, if a publication has 4 authors {A, B, C, D}, the network is defined as a circle:

Flatten@(Partition[Append[authors, authors[[1]]], 2, 1] /. {x_, y_} :> {x -> y})

{A -> B, B -> C, C -> D, D -> A}

For all the publications, we get the following network:

xml2

There are two large research groups inside this institute. Then re-draw the graph with top 10 contributors, it confirms the information.

xml3

No too bad with 10 minutes coding.

I will not release the notebook this time, since I may not have the right to distribute the data. Sorry about it.