Skip to main content

Table 1 Description of datasets in genomic benchmark package. Several pieces of information are provided about each dataset: a) Name is unique identification of dataset in genomic benchmark package b) # of sequences is combined count of all sequences from all classes c) # of classes is count of all classes in a dataset d) Class ratio is a ratio between number of sequences in a biggest class and number of sequences in a smallest class e) Median length is computed for all sequences from all classes in a dataset f) Standard deviation is also computed for all sequences from all classes in a dataset

From: Genomic benchmarks: a collection of datasets for genomic sequence classification

Name

# of sequences

# of classes

Class ratio

Median length

Standard deviation

dummy_mouse_enhancers_ensembl

1210

2

1.0

2381

984.4

demo_coding_vs_intergenomic_seqs

100000

2

1.0

200

0.0

demo_human_or_worm

100000

2

1.0

200

0.0

drosophila_enhancers_stark

6914

2

1.0

2142

285.5

human_enhancers_cohn

27791

2

1.0

500

0.0

human_enhancers_ensembl

154842

2

1.0

269

122.6

human_ensembl_regulatory

289061

3

1.2

401

184.3

human_nontata_promoters

36131

2

1.2

251

0.0

human_ocr_ensembl

174756

2

1.0

315

108.1