Data Organization -- Chromatograms and Sequence Data

Millions of raw data files plus derived sequence/qual score files
Different species, tissues, processes, runs on the same original sample
Production doubled every year

Use of a cluster of multi-cpu Solaris servers, 4 Gb RAM each, shared 6 Tb disk (mirrored or RAID5)
Backup/restore issues. Company's "crown jewels"

Suite of perl scripts to organize orginally, calling in-house and PD s/w
Later, an Oracle DB to organize all, Pro*C s/w to enter/extract data

Data integrity issues - silent corruption experienced. Hence an integrity checker, run in batch mode weekly (DB vs file system vs ASCII index files)

Filters out unreadable files, sequences too short, names that violate conventions (==> lab screw up), etc.
Extracts sequence and qual scores (statistical measure of how good each base call is)

In early 2002, about 14,000,000 chromatograms in Archive & DB; DB was ~100 Gb, biggest table was 19,000,000 rows

Next
7/14
© Copyright 2003 - 2009 Cohen Software Consulting, Inc