Getting taxonomy from NCBI with R & rentrez

Recently, I happened to work with Blast a lot. At some point we wanted to get some high-level taxonomy for our dataset. I’ve never done this, but I know the NCBI provides a query interface called Entrez, which can be accessed from command-line tools as well as from several programming languages. In R, you can query the Entrez system with the rentrez package. Here I describe the basics of getting taxonomy information with this package.

Let’s load our packages - rentrez and also magrittr (to get pipes for easier coding):

options(repr.plot.width=7, repr.plot.height=7, jupyter.plot_scale = 1)

library(magrittr)
library(rentrez)

Here is the list of available databases:

entrez_dbs() %>% print

 [1] "pubmed"          "protein"         "nuccore"         "ipg"            
 [5] "nucleotide"      "structure"       "genome"          "annotinfo"      
 [9] "assembly"        "bioproject"      "biosample"       "blastdbinfo"    
[13] "books"           "cdd"             "clinvar"         "gap"            
[17] "gapplus"         "grasp"           "dbvar"           "gene"           
[21] "gds"             "geoprofiles"     "homologene"      "medgen"         
[25] "mesh"            "nlmcatalog"      "omim"            "orgtrack"       
[29] "pmc"             "popset"          "proteinclusters" "pcassay"        
[33] "protfam"         "pccompound"      "pcsubstance"     "seqannot"       
[37] "snp"             "sra"             "taxonomy"        "biocollections" 
[41] "gtr"            

The basic command to get taxonomy for an ID:

taxo <- entrez_summary(db="taxonomy", id="1219383")
str(taxo)

List of 13
 $ uid             : chr "1219383"
 $ status          : chr "active"
 $ rank            : chr "species"
 $ division        : chr "g-proteobacteria"
 $ scientificname  : chr "Acinetobacter boissieri"
 $ commonname      : chr ""
 $ taxid           : int 1219383
 $ akataxid        : chr ""
 $ genus           : chr "Acinetobacter"
 $ species         : chr "boissieri"
 $ subsp           : chr ""
 $ modificationdate: chr "2015/07/31 00:00"
 $ genbankdivision : chr "Bacteria"
 - attr(*, "class")= chr [1:2] "esummary" "list"

You can extract just the part you want with $:

entrez_summary(db="taxonomy", id="1219383")$division

‘g-proteobacteria’

Search & links

But suppose for a moment that we do not have the staxids - can we still get this information with e.g. accession number (saccver or similar)?

Of course we can!

Let’s say we want to search the taxonomy for accession number WP_092748568.1. We start by searching the protein database, where the accession number comes from:

entrez_search(db="protein", term="WP_092748568.1")

Entrez search result with 1 hits (object contains 1 IDs and no web_history object)
 Search term (as translated):   

We see there is 1 hit, but it doesn’t show any information about it - yet. You could save the result of the query into an object and expect it with str() - or we can use magrittr pipe directly on the query:

entrez_search(db="protein", term="WP_092748568.1") %>% str

List of 5
 $ ids             : chr "1224730548"
 $ count           : int 1
 $ retmax          : int 1
 $ QueryTranslation: chr ""
 $ file            :Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr> 
 - attr(*, "class")= chr [1:2] "esearch" "list"

Now we can see the internal NCBI ID of this item, under the $ids slot. We can now look up what other information is linked to this ID:

ncbi_id <- entrez_search(db="protein", term="WP_092748568.1")$ids

entrez_link(dbfrom="protein", id=ncbi_id, db="all") %>% str

List of 2
 $ links:List of 16
  ..$ protein_cdd                    : chr [1:16] "415644" "399551" "395146" "236886" ...
  ..$ protein_cdd_concise_2          : chr "163712"
  ..$ protein_cdd_specific_2         : chr "163712"
  ..$ protein_cdd_summary_nonpublic  : chr [1:13] "395146" "223824" "215328" "215003" ...
  ..$ protein_cdd_superfamily_2      : chr "415644"
  ..$ protein_nuccore                : chr "1224732327"
  ..$ protein_protein_cdart_summary  : chr "1224730548"
  ..$ protein_sparcle                : chr "10169238"
  ..$ protein_taxonomy               : chr "1219383"
  ..$ protein_bioproject             : chr "224116"
  ..$ protein_cdd_summary            : chr [1:16] "415644" "399551" "395146" "236886" ...
  ..$ protein_nuccore_wp             : chr "1224732327"
  ..$ protein_protein_cdart          : chr "1224730548"
  ..$ protein_protein_cdart_summary_2: chr "1224730548"
  ..$ protein_protfam                : chr [1:7] "326225" "315947" "246230" "246128" ...
  ..$ protein_taxonomy_wp2species    : chr "1219383"
  ..- attr(*, "class")= chr [1:2] "elink_classic" "list"
 $ file :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr> 
 - attr(*, "content")= chr " $links: IDs for linked records from NCBI\n "
 - attr(*, "class")= chr [1:2] "elink" "list"

As we can see, there is a link to the taxonomy database, showing us an id “1219382”. Now we can reuse this ID in our query:

our_id <- entrez_link(dbfrom="protein", id=ncbi_id, db="all")$links$protein_taxonomy

entrez_summary(db="taxonomy", id=our_id) %>% str

List of 13
 $ uid             : chr "1219383"
 $ status          : chr "active"
 $ rank            : chr "species"
 $ division        : chr "g-proteobacteria"
 $ scientificname  : chr "Acinetobacter boissieri"
 $ commonname      : chr ""
 $ taxid           : int 1219383
 $ akataxid        : chr ""
 $ genus           : chr "Acinetobacter"
 $ species         : chr "boissieri"
 $ subsp           : chr ""
 $ modificationdate: chr "2015/07/31 00:00"
 $ genbankdivision : chr "Bacteria"
 - attr(*, "class")= chr [1:2] "esummary" "list"

Cool. Now we can get taxonomic information for one sample. But what if we have 5 - or 5000?

Yes we can! In particular, we can supply most of the functions we used with a vector of IDs instead of just one ID. But let’s get some data first.

Real data - Blast output

I started with a blastp output in tabular format with few extra columns - in particular the accession number and Tax ID, so you need at least this: -outfmt '7 saccver staxids' to get taxonomy IDs for your blast hits (note that saccver is part of standard blast output, which can be specified with the std format keyword).

Note: You can also ask blast to provide e.g. scientific names, however so many of them were missing in my queries that I just gave up on using that.

Since staxids can include multiple IDs separated by ;, I had to process the data slightly with awk before they were ready:

# take columns 2 (saccver) and 13 (staxids) from blast output table (skipping header)
awk -F"\t" '!/#/ {print $2, $13}' blastp_outfmt7_std_staxids.tsv > blastp_saccver_staxids.txt

# expand items with multiple staxids row-wise by replacing ";" with "newline+saccver+space"
awk '{gsub(/;/, "\n"$1" ", $2); print $0}' blastp_saccver_staxids.txt > blastp_saccver_staxids_expanded.txt

Now we are ready to load the data into R:

taxa <- read.table("blastp_saccver_staxids_expanded.txt", h=F, sep=" ") #, stringsAsFactors = T)
colnames(taxa) <- c("acc", "id")
str(taxa)

'data.frame':	13842 obs. of  2 variables:
 $ acc: chr  "WP_092748568.1" "WP_023273437.1" "WP_023273437.1" "WP_088823490.1" ...
 $ id : int  1219383 1219382 1392540 1229165 29545 2691581 1907941 86304 203122 2675457 ...

Test data (10 ids)

So far we used the function str() to see results of our queries and to figure out which data to extract with $. Of course this is not practical for bulk queries with multiple IDs. For this reason there is a helper function - extract_from_esummary(). It accepts a list of results from your query and a name of the field you want to extract. We can pipe into it as usual:

entrez_summary(db="taxonomy", id=taxa$id[1:10]) %>% extract_from_esummary("division") %>% print

           1219383            1219382            1392540            1229165 
"g-proteobacteria" "g-proteobacteria" "g-proteobacteria" "g-proteobacteria" 
             29545            2691581            1907941              86304 
"b-proteobacteria" "b-proteobacteria" "g-proteobacteria" "g-proteobacteria" 
            203122            2675457 
"g-proteobacteria" "b-proteobacteria" 

Full data (~90k ids)

The basic command to extract taxonomy through Entrez looks like this:

entrez_summary(db="taxonomy", id=taxa$id) %>% extract_from_esummary("division")

I cannot actually ask Entrez to give me all data at once - they limit queries to certain (unknown) size. After a bit of experimenting, I found I can get 400 items at a time. So in reality, we need to divide the data into chunks and loop over the chunks to get the full dataset.

Note: the rentrez tutorial says I should use web history feature for large queries, however this option has the same limit on items as getting them directly. So I skipped web history and just got the stuff I needed directly. ¯\_(ツ)_/¯

nrow(taxa)
(nrow(taxa)/400) %>% round

13842

Divide data into chunks of 400 IDs and store them in a list:

id_list <- (seq_along(taxa$id)/400) %>% ceiling %>% split(taxa$id,.)

Use lapply to loop over the list using our extraction command:

all_division <- lapply(id_list, function(x) {entrez_summary(db="taxonomy", id=x) %>%
                             extract_from_esummary(., "division")})

Warning message:
“ID 332056 produced error 'cannot get document summary'”
Warning message:
“ID 2044510 produced error 'cannot get document summary'”
Warning message:
“ID 2694937 produced error 'cannot get document summary'”
Warning message:
“ID 2024862 produced error 'cannot get document summary'”
Warning message:
“ID 2660640 produced error 'cannot get document summary'”
Warning message:
“ID 2660639 produced error 'cannot get document summary'”
Warning message:
“ID 2731243 produced error 'cannot get document summary'”
Warning message:
“ID 1330039 produced error 'cannot get document summary'”
Warning message:
“ID 262776 produced error 'cannot get document summary'”
Warning message:
“ID 2529841 produced error 'cannot get document summary'”
Warning message:
“ID 2731248 produced error 'cannot get document summary'”
Warning message:
“ID 2731248 produced error 'cannot get document summary'”
Warning message:
“ID 2529844 produced error 'cannot get document summary'”
Warning message:
“ID 332056 produced error 'cannot get document summary'”
Warning message:
“ID 2529844 produced error 'cannot get document summary'”
Warning message:
“ID 2027912 produced error 'cannot get document summary'”
Warning message:
“ID 2660639 produced error 'cannot get document summary'”
Warning message:
“ID 2660640 produced error 'cannot get document summary'”
Warning message:
“ID 262776 produced error 'cannot get document summary'”
Warning message:
“ID 262776 produced error 'cannot get document summary'”

Now we can see the taxonomic representation in our dataset.

Note the magrittr “T-pipe” %T>% in the following command - it alows side-piping data into a plot, but also reusing it (here for printing the output of table). The extra pipes are the main reason I use the original magrittr package instead of e.g. dplyr.

all_division %>% unlist %>% table %T>% pie %>% sort(decreasing=T)

.
     g-proteobacteria      e-proteobacteria      b-proteobacteria 
                 4419                  2066                  1749 
           firmicutes      a-proteobacteria    CFB group bacteria 
                 1671                  1418                   867 
             bacteria         high GC Gram+        enterobacteria 
                  483                   302                   193 
     d-proteobacteria         cyanobacteria                       
                  176                   140                    65 
       proteobacteria          GNS bacteria        euryarchaeotes 
                   52                    51                    44 
      verrucomicrobia               archaea        planctomycetes 
                   43                    20                    18 
          spirochetes        actinobacteria        kinetoplastids 
                   18                     7                     6 
green sulfur bacteria        crenarchaeotes         thermotogales 
                    5                     3                     2 
          ascomycetes               beetles            chlamydias 
                    1                     1                     1 
    choanoflagellates 
                    1 

png

Wow, pretty… ugly. Let’s see what ggplot2 can do..

ggplot, just for fun

library(ggplot2)

df <- all_division %>% unlist %>% table %>% as.data.frame

colnames(df) <- c("Taxon", "Freq")
head(df)

A data.frame: 6 × 2
	Taxon	Freq
	<fct>	<int>
1		65
2	a-proteobacteria	1418
3	actinobacteria	7
4	archaea	20
5	ascomycetes	1
6	b-proteobacteria	1749

ggplot(df, aes(x = "", y = Freq, fill = Taxon)) +
  geom_col() +
  coord_polar(theta = "y")

png

It is much nicer looking, but I actually have a problem distinguishing the different colours in the plot. I am colourblind, sure, but I think there is just too many colours for anyone to handle.

df$Taxon %>% unique %>% length

28 categories - that’s a lot! Well I think for this amount of data the base R is better - by default. There might be a way to improve both plots, but this is not the time nor place. See you next time 😉️