Background
The CGVdb obtains its source data by searching the PubMed database with our elaborately composed query string according to expert knowledge. The rudimentary document set fetched at the first step then undergoes a further manual verification to filter out false positives, and only those papers really concerning gene variations of the Chinese population are retained. Basic criteria for data selection are:
- The reported cases with gene variation must come from the Chinese or its sub-ethnic group;
- The paper must be about "inheritable" gene sequence variations;
- SNP data obtained through large scale direct sequencing and lacks known relation to phenotype is excluded;
- Haplotype or allele frequency estimated by STR marker is excluded;
- Somatic variations and artificial variations are excluded;
- Long fragment loss such as cases of cytogenetics, trisomy, translocation, chromosome deletion or deletions without validated sequence position are excluded;
- Results got merely from cell line or animal models are excluded.
To extract useful information and store that into respective database fields, we designed a Text Analysis Standard Operation Procedure. Firstly the verified papers are inspected by domain experts to pull out those wanted fields: Gene Name, Gene Symbol, DNA Variation, Amino Acid Variation, Phenotype, Ethnic Group, and Geographic Location, etc. Then the Gene Name and Gene Symbol have to go through another verification step, which standardizes our notation into the HUGO Human Gene Nomenclature Committee (HGNC) approved active gene names and gene symbols. An internal web interface was created to integrate assistant resources such as Human Gene Nomenclature Database, OMIN, and PubMed, thus has facilitated our process of gene verification. The DNA Variation and Amino Acid Variation are also checked for their standard notation according to Human Genome Variation Society (HGVS) as well as their correctness by EBI's DNA Mutation Checker 2 and by looking up the SwissProt database respectively. Finally a phenotype verification step looks up the OMIN database to find proper phenotypic descriptions for each variation. After all the above verifications finished, each CGVdb record still requires being proofread thoroughly by a domain expert.
Duplicative reports about the same variation of the same gene are merged. And currently hyperlinked external value-adding resources to CGVdb includes: LocusLink (Pruitt & Maglott, 2001), OMIM, HGMD (Krawczak & Cooper, 1997), RefSeq (Pruitt & Maglott), GeneLynx (Lenhard et al, 2001), etc.
Why Chinese Gene Variation Database?
Genetic variation is a nature feature of living organisms, and is collected for a variety of purposes, such as to understand the genetic basis of particular
diseases, to correlate changes in sequence with changes in function, to study
the origin and dispersal of human population (Cooper DN, Krawczak M. Human Gene
Mutation. BIOS Scientific Publishers, Oxford, 1993). The amount of data on genetic
variation is expected to increase after the effort of Human Genome Project has
been achieved. The medical genetic researches are advanced actively worldwide as well as in Taiwan, Hong Kong, Singapore, and Mainland China. The data concerning sequence variation in Chinese populations will increase dramatically in
the near future. Moreover, as the variation data may be published in Chinese
literature, the research communities in the world may not be able to read the
literature. Those data may be valuable to the broad research communities. As
the need for and the amount of such data grow, there is a need to catalogue the
data and make them easily accessible to the public via internet. To achieve
these goals, we have created this database.
For more information about the logo
| |
|