Online stores of current data allow scientists instantaneous access to the information they need
Credit for the rise of databases—collections of data organized for rapid search and retrieval—rightly goes to the advent of computers. But access to those databases was limited and slow in the early years, requiring researchers to wait for punch cards or magnetic tape to arrive in the mail. What the Internet allowed was for thousands of users to instantaneously access up-to-the-minute information.
Over the past 50 years, “I can’t think of a single change that’s been as dramatic as the Internet in enabling the scientific community to share its data,” says Paul Davie, general manager of the U.S. operations of the Cambridge Crystallographic Data Centre, which manages the Cambridge Structural Database (CSD).
One example is the evolution of how chemists have used Chemical Abstracts Service (CAS), founded in 1907. For many decades, CAS analysts would read print journal articles and create abstracts and index information for each paper. The information would include author and subject, as well as formulas and ring information for compounds described in the paper. The abstracts were compiled into monthly mailings. Indexes came out at six-month intervals, then were gathered together into 5- or 10-year compilations.
For a literature search, a researcher would start with a keyword or formula, checking recent six-month indexes and then the larger volumes for leads to particular abstracts. Next, the researcher would go searching in the library stacks for the referenced journal articles. The researcher would repeat the search for different keywords or structure permutations.
“It was really a slow, tedious process,” says Dana Roth, chemistry librarian at California Institute of Technology.
It didn’t help matters when several—or several hundred—people needed access to resources at the same time. At Stanford University, a sophomore organic chemistry laboratory class required students to use the literature to help identify unknown compounds they characterized in the lab. In the days of print, “some students would have the indexes and some the abstracts, so someone could search an index and then have to walk all over the library to find the abstract book,” says chemistry librarian Grace Baysinger.
Over time, CAS started feeding information into a computer database and shipping microform and tape products as well as print. The Internet enabled direct access to the database in the 1980s, albeit through a command-line interface plus numeric codes to describe structural features. In 1995, CAS introduced SciFinder and a graphical interface.
This year CAS registered its 100 millionth substance, a silicon-based compound created by drug discovery company Coferon to treat acute myeloid leukemia. The compound’s chemical name is (4S)-6-(4-chlorophenyl)-N-ethyl-8-[2-[[4-[(hydroxydimethylsilyl)-methyl]benzoyl]amino]ethoxy]-1-methyl-4H-[1,2,4]triazolo[4,3-α][1,4]benzodiazepine-4-acetamide.
A print search in 1960 for the compound, starting from triazolo benzodiazepine-4-acetamide, would have taken an experienced searcher at least five hours, estimates Matthew J. Toussant, who started at CAS in 1980 as an analyst and is now senior vice president for product and content operations. A command-line search in 1980 would have taken at least an hour, again for someone familiar with the necessary structure codes. Today, the search takes about as long as necessary to type in the name or draw the structure in SciFinder or ChemDraw—and thousands of users can successfully search for it simultaneously.
In addition to promoting easy access to database users on the front end, the Internet now plays a key role in funneling information into databases on the back end. Like CAS, other database resources started with people manually collecting information.
The U.S. Protein Data Bank (PDB) decided to start having researchers deposit structures electronically via the Web in the 1990s. “At the time there was nothing like it,” says Joel Sussman, who led PDB from 1994 to 1999 and is now director of the Israel Structural Proteomics Center at the Weizmann Institute of Science. The software PDB created incorporated validation algorithms to ensure researchers correctly filled in required fields. “We checked it and checked it and checked it,” Sussman adds. “I was afraid that if we released it prematurely and it didn’t work, no one would look at it again.”
Similar electronic entry tools applied at CAS, CSD, and elsewhere have sped up data entry, aided quality control, and lowered costs even as the amount of scientific information has exploded.
Meanwhile, organizations are forging ahead to link information repositories in ways that couldn’t be imagined without the Internet. CAS, for example, includes hyperlinks to journal articles and patents. PDB and CSD managers are now collaborating to match small molecules included in protein structures with corresponding CSD entries.
And new fields of cheminformatics and bioinformatics have developed to exploit the collective information contained in these and other scientific databases. “There are whole groups of computational biologists who rely on being able to have PDB data available to design new kinds of proteins synthetically,” adds Helen Berman, a chemistry and chemical biology professor at Rutgers University and associate director of the Research Collaboratory for Structural Bioinformatics, which now manages PDB.
Chemical safety information is also much more widely accessible, at places such as the National Library of Medicine’s Hazardous Substances Data Bank and Toxline, both of which link to literature references. Those involved in the regulatory side of the chemistry enterprise can readily access Occupational Safety & Health Administration and Environmental Protection Agency regulations and guidance online. “Through the 1980s, I subscribed to and was delivered the Federal Register every day” to stay informed about new rules and public notices, says Neal Langerman, a chemist who held faculty positions at Tufts University and Utah State University before starting a consulting business, Advanced Chemical Safety. “Now, I get the Electronic Code of Federal Regulations to deliver to me exactly what I want.”
The ease of accessing so much information does concern some. If an Internet search fails, “How often do our young scientists ask, ‘Well, maybe the data are there, but they haven’t been digitized?’ ” questions Langerman. For a young researcher used to searching through Google, turning to a library to find information that predates the digital era may be a high hurdle.
Also, the work that goes into maintaining access to computerized data is not inconsiderable. Think about what it means to upgrade personal computer software and retain access to 10-year-old documents or data files. Then consider what it must take to ensure data integrity and instantaneous worldwide access to 100 million chemical substances or 103,000 protein structures for 50 or more years. Additionally, for open access databases, funding can be a challenge. “There seems to be a lot of appetite within funding agencies to start resources, but later there seems to be a reluctance to continue funding them,” says Stephen Burley, director of the Research Collaboratory for Structural Bioinformatics and the Center for Integrative Proteomics Research at Rutgers.
All totaled, databases accessed through the Internet have clearly eased scientists’ work. Says CAS’s Toussant, “What we do today is quicken the pace of all of the work that people do.”