DaTo : an atlas of biological databases and tools

(1)

Copyright2016TheAuthor(s).PublishedbyJournalofIntegrativeBioinformatics. ThisarticleislicensedunderaCreativeCommonsAttribution-NonCommercial-NoDerivs3.0UnportedLicense(http://creativecommons.org/licenses/by-nc-nd/3.0/).

DaTo: an atlas of biological databases and tools

Qilin Li¹, Yincong Zhou¹, Yingmin Jiao¹, Zhao Zhang¹, Lin Bai¹, Li Tong¹, Xiong Yang¹, Björn Sommer^2,3, Ralf Hofestädt⁴, Ming Chen¹,*

1 Department of Bioinformatics, College of Life Sciences, Zhejiang University, 310058 Hangzhou, China

2Computational Life Sciences, Department of Computer and Information Science, University of Konstanz, 78457 Konstanz, Germany

3Faculty of Information Technology, Monash University, 3800 Melbourne, Australia

4Department of Bioinformatics and Medical Informatics, Faculty of Technology, Bielefeld University, 33615 Bielefeld, Germany

Summary

This work presents DaTo, a semi-automatically generated world atlas of biological databases and tools. It extracts raw information from all PubMed articles which contain exact URLs in their abstract section, followed by a manual curation of the abstract and the URL accessibility. DaTo features a user-friendly query interface, providing extensible URL-related annotations, such as the status, the location and the country of the URL. A graphical interaction network browser has also been integrated into the DaTo web interface to facilitate exploration of the relationship between different tools and databases with respect to their ontology-based semantic similarity. Using DaTo, the geographical locations, the health statuses, as well as the journal associations were evaluated with respect to the historical development of bioinformatics tools and databases over the last 20 years.

We hope it will inspire the biological community to gain a systematic insight into bioinformatics resources. DaTo is accessible via http://bis.zju.edu.cn/DaTo/.

1 Introduction

In the past decades, a variety of publicly available data repositories and resources have been developed for many medical and biology-related applications. Nowadays, many online databases and data analysis tools have been developed for life science research. Although some of these approaches have been collected in special journal issues, such as the Nucleic Acids Research (NAR) Database Issues [1] or Webserver Issues [2], others remain scattered throughout the Internet and within a vast amount of literature. Web searches via Google®, Yahoo® and similar general search systems do not exclusively retrieve online resources.

Therefore, the extraction of useful information is quite difficult. The huge amount of resources and the lack of a complete list of these resources make it difficult for researchers to quickly find the appropriate tools or databases for their specific purpose [3].

To tackle these problems, we developed DaTo (Databases and Tools), a semi-automatic approach to collect, curate and navigate through a large list of different tools and databases.

* To whom correspondence should be addressed. Email: mchen@zju.edu.cn

Journal of Integrative Bioinformatics, 13(4):297, 2016 http://journal.imbio.de/

doi:10.2390/biecoll-jib-2016-297 1

https://dx.doi.org/10.1515/jib-2016-297

(2)

Moreover, DaTo adds a complete new dimension to the analysis and visualization of important tools and databases. By using a Google Maps^TM-based approach, it is possible to localize online resources to specific countries, states and cities. Therefore, it is possible to find out which research-related tools and databases are geographically close to the home university, plus which online resources are well-developed at a certain university and which ones might have to be extended in the future. Also, it is possible to located cooperation partners in the neighborhood, providing services required for local research. Depending on the research area, it might be important to cooperate with close-by institutions.

Moreover, we have analyzed the health status of the relevant web links, as well as the impact of the respective publications’ countries, journals and years.

2 Related Work

To address the previously discussed issues, many groups have collected life-science-related online resources, providing several bioinformatics-related resources with a search function that are available via Internet.

For instance, the Neuroscience Information Framework (NIF) is a dynamic inventory of web- based neuroscience resources: data, materials, and tools accessible via any computer connected to the Internet [4]. However, NIF does not exclusively return biomedical tool or database resources, but also some literatures describing the application of these resources. This mixed return page confuses researches to find the appropriate ones.

Other examples include the Bioinformatics Links Directory (BLD) [5], OReFiL [6] (using URLs as a ‘proxy’ to extract items from BioMed Central papers), BIRI [7], JIBtools [8] and bio.tools [9].

BLD is a catalogue of links to bioinformatics resources, tools and databases based on recommendations from experts in the field. BIRI utilizes keywords and sentence structures to identify relevant terms through custom patterns. BLD and BIRI subdivide the resources into subclasses based on the research topic, making it impossible to return resources that correspond to a specific search word. For example, a search for the word ‘miRNA’ in BLD as well as BIRI returns no results. In addition, BIRI has stopped updating.

OReFiL is the only website collection which returns up-to-date and query-relevant online resources based on peer-reviewed publications. But OReFiL does not focus on the collection of tools, it includes all online resources relevant for a specific key word. In addition, some of the returned resources either lack an accurate title, description, or contain unrelated items. Another problem is that OReFiL cannot return all the search-word-related resources. Taking ‘miRNA’

as an example and restricting the search results to 500, many miRNA-unrelated items are returned, some of which even link to PNG images without any relation to miRNA.

JIBtools is a collection of tools lists curated by a specific editor who is responsible for a specific expert field. This approach highly depends on the motivation of single persons to provide and update an field-adequate list of tools.

Bio.tools also provides a manually curated list of tools. Any researcher is allowed to register to the system and to add additional entries. However, no automatization of this process is included, and the registration process is quite complex, as many different key terms have to be registered to the system.

Other large biological data repositories for researchers which have to be mentioned are Re3data (http://re3data.org, last updated April 13 2016)[10] is a comprehensive registry of biological data repositories available on the web, which lists over 1,500 research data repositories. It also

(3)

supports browsing by subject, content type and country, and offers an API for researchers.

However, the website is deficient in stability, making it easy to freeze when users try to search for biological resources. Biosharing (https://biosharing.org, last updated Dec 7 2016) [11] is another resource on inter-related data standards, databases, and policies. It consists of 671 data standards, 831 databases and 85 policies, and displays them with different tags. On the other hand, biosharing only showed these data in static pages, which makes user hard to analyze the trend of biological data sources because of a lack of temporally and spatially dynamic module.

Moreover, none of these tools provide the opportunity to visualize the geographical locations of the database- or tool-publishing universities. But a large number of Google Maps^TM -related approaches in other research fields exist nowadays, which are making use of the underlying Google Maps^TM Javascript API to visualize specific scientific aspects in relation to their location. To give only one recent example: it was used to provide geolocational information concerning the health status of the Great Barrier Reef [12].

3 Architecture/Implementation 3.1 PubMed data and DaTo

PubMed currently includes citations and abstracts from over 5,000 life science journals for biomedical articles reaching back to September 1, 1994. Since its inception, PubMed has served as the primary tool for electronically searching and retrieving biomedical literature. However, like NIF, PubMed also contains a large amount of literature which is not related to databases or tools and its search results are too broad for users who want to find online resources.

Recognizing the limitations of the previously-discussed related approaches, we developed DaTo (Databases and Tools), which is comprised of 21,159 bioinformatics resources, and which is currently the most comprehensive database providing high actuality.

DaTo extracts the raw information from all the PubMed articles which contain exact URLs in their abstract section, followed by manually checking the URL accessibility. DaTo features a user-friendly query interface, providing comprehensive annotations for each result, such as the description of the resources, the abstract of the original literature, the link to the corresponding PubMed entries and corresponding webpage. A graphical interaction network browser has also been integrated into the DaTo web interface, enabling the exploration of the relationship between the tools and databases based on the similarity of MeSH term, a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences and serving as a thesaurus that facilitates searching.

3.2 Data collection and management

DaTo aims to be the most comprehensive repository of databases and tools extracted from PubMed. Any PubMed article which provides a URL with “http” or “ftp” pattern in its title and/or abstract section is downloaded and its name and URL is computationally extracted. The workflow is shown in Figure 1. Specifically, the detail information such as the name, URL etc.

are checked and corrected manually. A care study of all the records’ simple description mined is performed in order to save time and reach more reliable conclusions by choosing the most appropriate analysis tools. Therefore, the database is capable to provide not only abundant but also accurate records of databases and tools. DaTo is updated automatically from PubMed every two weeks, and then the raw information is manually curated.

The search result page is organized in six columns: “Local ID”, “PubMed ID”,

“Database/Software Name”, “Description”, “URL Health” and “URL Country”. The advanced

(4)

search option incorporates additional options, such as the abstract, publication date, journal etc.

Users can also specify different search connotations with logic terms – such as “and”, “or”, and

“not” – to find suitable biology-related software. To trace the URL health, we developed a Perl script to check if the website is still available. Because each request of the website will be returned with a response code, we can detect the status of the website. For example, the code started with 2 means the request is successful, with 3 means the site has been redirected to other location, and with 4 or 5 means some error happened at client or server side, respectively.

Another strategy for tracing the URL health is that the URL health information will be refreshed automatically if the user submits a search query.

Figure 1: DaTo workflow.

3.3 Geographic location parsing and visualization

In order to facilitate efficient investigation of the international geographic distribution of the hosts and to provide the user a straightforward and accurate perception of the geographic location, we have traced all the first author’s affiliations and website IP addresses to reveal their location: region code, city, latitude, longitude, ISP, and organization as well as country code and country name. DaTo adopts MaxMind® GeoIP for the geographic location of the IP addresses and Google Maps^TM Javascript API to display the atlas.

4 Application

4.1 Geographic location

A map of all the geographic locations of the hosts is shown in Figure 2. Secondly, the distribution over time was computed. This can help researchers to analyze the historical development of tools and databases from a geographical perspective. The first biological host appeared in Italy on September 1, 1994 [13]. The amounts of biological hosts increased to over 800 during the following years up to 2000, and the hosts spread over North America, Europe and East Asia. In the new century, biological databases and tools underwent a rapid development, especially in emerging countries such as China, Russia, Brazil, India and South Africa. Currently, hosts of biological resources exist in six continents, and most of them are located in United States and Europe Union.

(5)

Figure 2: Geographic locations of biological databases and tools (over the years 1994-2013).

Figure 3: San Francisco Bay Area geographic locations of biological databases and tools:

a) San Francisco Bay plus neighborhood and Los Angeles plus neighborhood;

b) San Francisco neighborhood only; c) San Francisco Bay Area.

Figure 3 shows as an example the San Francisco Bay area at different granularity levels. Figure 3a shows the amount of databases and tools in San Francisco and Los Angeles and the corresponding neighborhoods. The detail level depends on the zoom level, just as known from Google maps. Figure 3b shows now in more detail San Francisco and its neighborhood. The 205 entries are distributed over the San Francisco Bay Area.

(6)

One of the geo locations in the top-left side of Figure 3b is showing a green “G”, indicating that the website has a good health status. Finally, Figure 3c shows the San Francisco Bay Area which contains a number of red “B”, indicating that the corresponding web resources are offline. The 205 databases and tools from Figure 3b are distributed over San Francisco, South San Francisco, and Berkeley/neighborhood. The data was collected over the years 1994 to 2013.

4.2 Country, Journal and Year Associations

Figure 3: DaTo status statistics of countries (A/top), journals (B/center) and years (C/bottom).

(7)

The status statistics show the top ten countries, the top ten journals and the top ten years (Figure 3A-3C). Figure 3A/top shows that the USA has the highest number of resources based on the publication number. We also ranked journals based on the publication number. The top three magazines are Oxford Bioinformatics, Nucleic Acids Research, and BMC Bioinformatics.

These three journals account for more than the half of all database/tools-related publications.

Other journals do not provide a significant number of publications in this area in comparison to the top three journals. Obviously, the highest impact for Bioinformatics database/tool developers has N.A.R. and Bioinformatics (Figure 3B/center). Focusing to the recent years (starting from 2007), the publication of new tools peeked in 2013 at 10%, whereas 2014 and 2015 showed only 8%. The 31% from the other years s go back to the year 1994 and in each of the here accumulated years max. 5% of the overall relevant articles were published.

5 Discussion

In the bioinformatics resource fields, there are few databases. As a manually-curated database focusing on not only collection but also system analysis for the bioinformatics resources, there is much value for both experimental biologists and computational biologists.

As future developments, we will incorporate a large set of resources and improve the accuracy of the detail information. While many publications provide the URL pattern in the abstract, there are also some exceptions. We should continue to effectively mine this kind of big data sets and it will take a long time to select them manually to make sure the records feature a high accuracy.

In order to understand the maintenance of biomedical resources better, the systematic study of the status of the URL links in biological systems is necessary to reveal the qualitative and quantitative health status of these biological resources. Since the maintenance of different systems, such as the update process or even the redesign might temporarily change the availability of a website, DaTo must frequently check the resources availability. If we cannot do this, the related information of the database will disappear from DaTo. The statistical results of DaTo illustrate a really bad situation that a considerable amount of resource included has been unavailable, and it is a common problem regardless of the corresponding countries or journals. We have to pay more attention on this and establish some mechanism to monitor such tasks.

As studied in DaTo, some developing countries show a fast progress within the last few years in the biology-related resources area. But in comparison to the USA, most of these countries have a long way to go and they have to invest more resources into bioinformatics research.

Some countries started relatively late their research in the bioinformatics area, but most of them show a fast progress within the recent years.

The underlying algorithm to identify the geographical location requires that the URL is clearly stated in the abstract. In the future, it might be possible to mine complete articles, however, this will be a very complex task. Also, the health status is only checked automatically. In case, an URL has changed over years, because, for example, a tool has changed its hosting institution, DaTo will not find it. Again, manual curation is required in this case. Also, DaTo does not take versioning into account. If for example a tool is published in two different, separately published versions, DaTo will interpret them as two different tools. Moreover, a project might be developed at different universities and hosted at different universities – also in this case DaTo does not provide an optimal approach yet.

However, as a basic claim for all publications in the scope of DaTo, we strongly encourage authors to include the URLs to their published services or tools in the corresponding abstract.

As previously mentioned.

(8)

Nowadays, biological sciences are generating more data than ever. Therefore, it is required to organize, catalogue and rate these resources, so that the contained information can be most effectively explored [14]. So the next steps in the development of DaTo will be the integration of a wiki-like concept and categories similar to the ones used by MetaBase [15]. Moreover, the visualization of the databases/tools interconnections and evolutions has to be improved.

The rapid growth of available bioinformatics databases and tools requires new approaches to organize and mine them. DaTo is a comprehensive atlas of biological databases and tools. It is based on the biological resources collection from PubMed and facilitates the investigation of the semantic relationship between different databases/tools based on the involved MeSH terms.

DaTo is a well-organized database with high accuracy. It provides comprehensive analysis tools including geographical location, URL health status and semantic similarity visualization. We hope it will inspire the biological community to systematically gain an insight into bioinformatics resources.

Comparing DaTo with biological data repositories, such as Re3data and Biosharing, our DaTo includes over 17,000 records of bioinformatics tools and databases based on first author’s affiliations. In addition, DaTo supports a dynamic display with google maps, helping users to explore the trend of bioinformatics since 1993. Currently, we are developing a new version of DaTo called DaTo2, which will display biological data resources in a more concise format.

We believe that DaTo will be an important contribution enabling the systematic mining of bioinformatics resources for users as well as developers. DaTo is accessible via http://bis.zju.edu.cn/DaTo/.

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China [No.

30971743, 31371328, 31450110068, 31571366], and CSC & DAAD (PPP program No.

57136444).

References

[1] X.M. Fernández-Suárez, D.J. Rigden, M.Y. Galperin, The 2014 nucleic acids research database issue and an updated NAR online molecular biology database collection. Nucleic acids research, 42(D1):D1-D6, 2014.

[2] S. Hamperl, G. Benson, C. Brown, et al., Editorial: Nucleic Acids Research annual Web Server Issue in 2014. Nucleic acids research, 42(1):W1-W2, 2014.

[3] Y.-B. Chen, A. Chattopadhyay, P. Bergen, C. Gadd, N. Tannery, The Online Bioinformatics Resources Collection at the University of Pittsburgh Health Sciences Library System—a one-stop gateway to online bioinformatics databases and software tools. Nucleic acids research, 35(suppl 1):D780-D785, 2007.

[4] D. Gardner, H. Akil, G.A. Ascoli, et al., The neuroscience information framework: a data and knowledge environment for neuroscience. Neuroinformatics, 6(3):149-160, 2008.

[5] M.D. Brazas, D. Yim, W. Yeung, B.F. Ouellette, A decade of web server updates at the bioinformatics links directory: 2003–2012. Nucleic acids research, gks632, 2012.

[6] Y. Yamamoto, T. Takagi, OReFiL: an online resource finder for life sciences. BMC bioinformatics, 8(1):1, 2007.

[7] G. de la Calle, M. García-Remesal, S. Chiesa, D. de la Iglesia, V. Maojo, BIRI: a new approach for automatically discovering and indexing available public bioinformatics resources from the literature. BMC bioinformatics, 10(1):1, 2009.

(9)

[8] R. Hofestädt, B. Kormeier, M. Lange, et al., JIBtools: a Strategy to Reduce the Bioinformatics Analysis Gap. Journal of Integrative Bioinformatics, 10(1):226, 2013.

[9] J. Ison, K. Rapacki, H. Ménager, et al., Tools and data services registry: a community effort to document bioinformatics resources. Nucleic acids research, gkv1116, 2015.

[10] H. Pampel, P. Vierkant, F. Scholze, et al., Making research data repositories visible: the re3data. org registry. PloS one, 8(11):e78080, 2013.

[11] P. McQuilton, A. Gonzalez-Beltran, P. Rocca-Serra, et al., BioSharing: curated and crowd-sourced metadata standards, databases and data policies in the life sciences.

Database, 2016(baw075, 2016.

[12] H. Nim, T. Done, F. Schreiber, S. Boyd, Interactive geolocational and coral compositional visualisation of Great Barrier Reef heat stress data, in: Big Data Visual Analytics (BDVA), 2015, IEEE, 2015, pp. 1-7.

[13] S. Pongor, Z. Hátsági, K. Degtyarenko, et al., The SBASE protein domain library, release 3.0: a collection of annotated protein sequence segments. Nucleic acids research, 22(17):3610, 1994.

[14] J. Huang, B. Ru, P. Zhu, et al., MimoDB 2.0: a mimotope database and beyond. Nucleic acids research, gkr922, 2011.

[15] D.M. Bolser, P.-Y. Chibon, N. Palopoli, et al., MetaBase—the wiki-database of biological databases. Nucleic acids research, 40(D1):D1250-D1254, 2012.