Widget HTML Atas

Download Pubmed Abstracts Using R

Querying PubMed via the easyPubMed package in R

Dami's blog full of codes > data science > Querying PubMed via the easyPubMed package in R

January 5, 2016

PubMed (NCBI Entrez) is an online database of citations for biomedical literature that is available at the following URL: http://www.ncbi.nlm.nih.gov/pubmed. Retrieving data from PubMed is also possible in an automated way via the NCBI Entrez E-utilities. A description of how NBCI E-utilities work is available at the following URL: http://www.ncbi.nlm.nih.gov/books/NBK25501/.
easyPubMed is a R package I wrote that allows to easily download content from PubMed in XML format. easyPubMed includes 3 functions: get_pubmed_ids(), fetch_pubmed_data() and batch_pubmed_download().

Latest easyPubMed version

You can download the latest version of easyPubMed (dev version) from GitHub. A convenient way to install a package from GitHub is to use devtools.

library(devtools)
devtools::install_github(repo = "dami82/easyPubMed", force = TRUE, build_opts = NULL)

get_pubmed_ids()
get_pubmed_ids() takes a character string as argument. This character string is just the text of the PubMed query we want to perform. You can use the same syntax you would use for a regular PubMed query. This function queries PubMed via the NCBI eSearch utility. The result of the query is saved on the PubMed History Server. This function returns a list that includes info about the query, the PubMed IDs of the first 20 results, a WebEnv string and a QueryKey string. These two strings are required by the fetch_pubmed_data() function, as they are used for accessing the results stored on the PubMed History Server. The returned list also contains a Count value (note that Count is a character) that informs about the total number of results the query produced.

Example. In order to retrieve citations about "p53" published from laboratories located in Chicago, IL in 2019, we can run the following lines of code.

library(easyPubMed) library(httr)  myQuery <- "p53 AND Chicago[Affiliation] AND 2019[PDAT]"  myIdList <- get_pubmed_ids(myQuery)  #the query produced the following number of results as.integer(as.character(myIdList$Count))  #this is the unique WebEnv String myIdList$WebEnv  #the PubMed ID of the first record produced by this query is the following myIdList$IdList[[1]]  #open the PubMed abstract corresponding to the latest record in a new window of the browser httr::BROWSE(paste("http://www.ncbi.nlm.nih.gov/pubmed/", myIdList$IdList[[1]], sep =""))

fetch_pubmed_data() and table_articles_byAuth()
fetch_pubmed_data() retrieves data from the PubMed history server in XML format via the eFetch utility. The only required argument of the fetch_pubmed_data() function is a list containing a QueryKey value and a WebEnv value. Typically, this is the list resulting from a get_pubmed_ids() call. fetch_pubmed_data() will return a XMLInternalDocument-class object including the first 500 records retruned by the PubMed query. In order to retrieve a different number of records we need to specify the following optional arguments:

  • retstart , is an integer and defines the position of the first item to be retrieved by the fetch_pubmed_data() function.
  • retmax , is an integer and defines the number of total items to be fetched.

Even when a large number of records has to be fetched, it is recommended to download records in batches of 500 to 1000 items per time. The maximum number of records that can be fetched at one time is 5000. Note: easyPubMed DOES NOT return results as XMLInternalDocument-class objects anymore. Results are always returned as one or more strings. We can fetch records in XML (text including XML tags), plain TXT, and other PubMed-supported formats. To extract specific XML fields of interest, we can rely on the custom_grep() function. To extract multiple fields at once, we can use table_articles_byAuth(). In the latter instance, fields are cast as data.frame. Each row corresonds to one author. We can extract the first author, or the last author. However, we can also extract all authors from a record. All other fields (DOI, itle, Journal, …) will be recycled for each author.

# fetch PubMed records topRecords <- fetch_pubmed_data(myIdList)  # class is 'charachter'; length is '1' class(topRecords) length(topRecords) #fetch the first 20 PubMed records top20records <- fetch_pubmed_data(myIdList, retstart = 1, retmax = 20)  # Extract titles myTitles <- custom_grep(xml_data = top20records, tag = "ArticleTitle", format = "char") head(myTitles)  # Extract multiple fields from each PubMed record allFields <- table_articles_byAuth(top20records, included_authors = "last", getKeywords = TRUE) head(allFields)

In the following real-world example, we are going to fetch all papers about "p53" published by laboratories located in Chicago (2010-2019). For each record, PubMed ID, Name of the first author and publication Title will be extracted, together with DOI and keywords. Results are saved as a csv file. All this can be accomplished using few lines of code.

library(easyPubMed)  myQuery <- 'p53 AND Chicago[Affiliation] AND ("2010/01/01"[PDAT] : "2019/12/31"[PDAT])'  myIdList <- get_pubmed_ids(myQuery)  all_steps <- seq(1, myIdList$Count, by = 50) results <- lapply(all_steps, function(i) {   y <- fetch_pubmed_data(pubmed_id_list = myIdList, retmax = 50, retstart = i)     yy <- table_articles_byAuth(y, included_authors = "first", getKeywords = TRUE)   yy[, c("pmid", "doi", "jabbrv", "lastname", "keywords")] })  results <- do.call(rbind,results) nrow(results) head(results)

Download and extract fields
We can save PubMed data locally before extracting fields from each record. Below we included an example.

myQuery <- 'p53 AND Chicago[Affiliation] AND ("2010/01/01"[PDAT] : "2019/12/31"[PDAT])'  fdt_files <- batch_pubmed_download(pubmed_query_string = myQuery,                                    format = "xml",                                    batch_size = 50,                                    dest_file_prefix = "fdt",                                    encoding = "UTF-8")  # File names head(fdt_files)  # Read files, extract fields, and then cast as data.frames fdt_list <- lapply(fdt_files, table_articles_byAuth,                     included_authors = "last", getKeywords = TRUE)  class(fdt_list) sapply(fdt_list, class)  # Aggregate results <- do.call(rbind, fdt_list) head(results)

These are some simple examples to help you getting started with easyPubMed. Don't heistate to post comments or email me at damiano DOT fantini AT gmail DOT com with questions, concerns, and suggestions.  Thanks.

About Author

Damiano
Postdoc Research Fellow at Northwestern University (Chicago)

Posted by: fletchertighe.blogspot.com

Source: http://www.biotechworld.it/bioinf/2016/01/05/querying-pubmed-via-the-easypubmed-package-in-r/