We are living a rapidly evolving life sciences revolution. It is based on the
ability to identify, read, understand, and manipulate the four nucleotides that
code for all life forms on the planet. These four base pairs form deoxyribonucleic
acid (DNA). Over the past decade an increasing amount of scientists, labs, and
computer centers throughout the world have chosen to produce, store, and use
biodata. This can be in the form of full genomes, specific genes, parts of genes,
single letter variations in gene code (SNPs), proteins, or a variety of other
variations on organic molecule data.
Bio-literacy is an essential first step in building a bio-based economy (biotechonomy).
So far most academic research has focused on sequencing, understanding, and
annotating genomes or parts thereof. There has been little focus on the customer.
This leaves open a series of interesting questions like: Who is accessing and
reading these tidal waves of data? What are they being used for? How might this
usage pattern change industrial structures and national competitiveness? The
Life Sciences Project at HBS has drafted a first, and quite rough, map of who
is producing, storing, and using public bio data. We hope this draft will improve
and become far more complete as the project evolves. As the project moves forward,
we intend to include more data, include key private data providers, and expand
the time periods analyzed.
Given that just a few companies produce the equipment required to produce bio
sequence data, one can analyze the sequencer market and build a proxy for the
world’s DNA sequencing capacity. This gave us a sense of how much data is being
generated, how much is public and how much is private, and what the growth trends
are. We then tried to understand who is accessing this data and for what purpose.
Some are carrying out strictly academic research, others are downloading data
in an attempt to package and sell results, still others are attempting to patent
and commercialize products derived from the data. To get a sense of these patterns,
we analyzed the server logs of the three key public biodatabases. Millions of
data points give us an initial glimpse of how the biotechonomy is evolving in
the academic, non-profit, and private spheres.
To protect privacy, no individual user is identified, instead we aggregated
usage patterns by country, domain, and in the case of the GenBank in the US,
by organism and format. We also created a proxy variable to identify dispersion
or concentration of downloads from the European database.
This paper provides a brief overview of the initial research. We highlight eight key results and highlight what surprised us within each of these results.
Unaffiliated
43 pages
| Back to 2002-2003 Working Papers | Copyright © President and Fellows of Harvard College