Apr 19, 2010

Play with Bibliographic Data

I’ve got the publication list of a research institute. There are more than 1400 entries, exported in xml format from Endnote. As a data mining project, there are plenty of things you can do with the data. Let’s assume  we are interested in the relationships among researchers and groups. Can we check this out quickly in Mathematica?

First, extract the data:

in the xml, for each publication, the author list is stored as the following:

<authors>
<author>
  <style face="normal" font="default" size="100%">Paul, N.</style>
</author>
<author>
  <style face="normal" font="default" size="100%">Cao, B.</style>
</author>
… </authors>

xml = Import["Bib.xml"];

authors = Cases[xml, XMLElement["authors", _, authors_] -> authors, Infinity];

names = Flatten@Cases[#, XMLElement["author", _, {___, XMLElement["style", _, name_]}] –> name] & /@ authors;

What we get here is the lists of authors for each publication.

Let’ see the relationship between number of authors and number of publications.

Sort[Tally[Length[#] & /@ names], #1[[1]] < #2[[1]] &]

{{1, 265}, {2, 280}, {3, 320}, {4, 224}, {5, 94}, {6, 85}, {7,
  39}, {8, 22}, {9, 20}, {10, 14}, {11, 11}, {12, 14}, {13, 5}, {14,
  3}, {15, 5}, {18, 1}, {19, 1}, {20, 1}}

xml

We can see that most of publications have no more than 4 authors.

Next step, we like to check out the internal relationships among authors. We need to generate a network for authors. For example, if a publication has 4 authors {A, B, C, D}, the network is defined as a circle:

Flatten@(Partition[Append[authors, authors[[1]]], 2, 1] /. {x_, y_} :> {x -> y})

{A -> B, B -> C, C -> D, D -> A}

For all the publications, we get the following network:

xml2

There are two large research groups inside this institute. Then re-draw the graph with top 10 contributors, it confirms the information.

xml3

No too bad with 10 minutes coding.

I will not release the notebook this time, since I may not have the right to distribute the data. Sorry about it.

4 comments:

Anonymous said...

What a job! What you did in 10 minutes is amazing. Please post the notebook... This would be of great help for many!

Bryan

Nicholas said...

Excellent Job. It's great to read some more advanced Mathematica usage, I am kind of tired of the educational push (even though I am not against that particular application). Keep up the good work.

Robert Nachbar said...

Very cool indeed.

Is the XML format used by EndNote "standard"?

What about "clustering" the papers by topic, for example, by using keywords?

Bob

Unknown said...

Great tutorial, and thanks for sharing it!

One questions though, when constructing the network of authors (e.g. {A,B,C,D}), what's the rationale to not include A->C, A->D, and B->D?

::Y.G::