[Bmi776] yeast text corpus

Mark Craven craven at biostat.wisc.edu
Mon Mar 27 10:38:47 CST 2006


Several of you doing text-related projects were interested
in having access to a corpus of text related to yeast genes.
I have put this data in /u/medinfo/bmi776/text-data/ in
the biostat file space.

The directory called "abstracts" contains the abstracts from
more than 40,0000 journal articles.  There is one abstract
per file, and the filename is the PubMed identifier for the
corresponding article.  Nearly all of these abstracts mention
one or more yeast genes.

The file "gene-abstracts" represents a mapping from yeast gene
identifiers to abstracts that mention the gene.  This file
consists of lines that list a canonical name for a yeast gene,
appended with a colon, followed by one or more lines listing
PubMed identifiers for abstracts that mention the gene.
For example the following lines indicate that the yeast gene
"YGR145W" is referenced in the abstracts for the articles
with PubMed identifiers 14690591 and 15590835.

YGR145W:
        14690591
        15590835

The file "gene-aliases" lists various aliases for each yeast gene.
Each line in the file describes a single gene.  It firsts lists
the canonical name and then other names that are used for the gene.

This data set may be rather large for some of your projects.
If so, you can extract a sub-sample of the genes and their associated
abstracts.

Have fun,
Mark







More information about the Bmi776 mailing list