Automatic parsing of citation text in academic references

Automatic parsing of citation text in academic references

Is there any software (or pseudo-code) which can automatically scan a piece of text (either pasted into the tool, or read from a .doc/.pdf) and identify citation data using standard formats? The data would then be split up into its constituent fields and exported in XML, CSV, or some other structured data format. I have looked at cb2Bib but it was only able to extract the year from Harvard-style references, which is insufficient.

Asked by: Guest | Views: 304

Total answers/comments: 4

Comments display order:

Guest [Entry]

"Try a tool such as Regex Buddy or Expresso.

If you're not a programmer Regular Expressions may be a bit intimidating, but they're really not that hard, especially with a decent tool like one of the above.

Here's an example of someone using Regular Expressions for extracting citations:

Citation parsing regular expression"

Guest [Entry]

"Try
http://www.crossrefdotorg/guestquery/#stqsearch

This one is capable of automatic parsing your reference text and offers a link to an on-line article."

Guest [Entry]

Zotero is a plugin for firefox which does this for web content. Not sure if there is a similar tool for documents/pdfs

Guest [Entry]

"This probably belongs more as a comment to @Abhinav, but zotero definitely only handles structured data, as you would find described here:

http://www.zoterodotorg/support/getting_stuff_into_your_library#importing_records_from_other_reference_tools

An interesting hack might be to try to write a program that uses each citation as a search query in your favorite database, then uses something like zotero to generate the ref information. You could also download structured information from services like citeUlike. Let me know if you end up doings something like that! (put it up on github if you do ;)."