Genomic benchmarks: a collection of datasets for genomic sequence classification

BMC Genomic Data

Table 1 Description of datasets in genomic benchmark package. Several pieces of information are provided about each dataset: a) Name is unique identification of dataset in genomic benchmark package b) # of sequences is combined count of all sequences from all classes c) # of classes is count of all classes in a dataset d) Class ratio is a ratio between number of sequences in a biggest class and number of sequences in a smallest class e) Median length is computed for all sequences from all classes in a dataset f) Standard deviation is also computed for all sequences from all classes in a dataset

Name	# of sequences	# of classes	Class ratio	Median length	Standard deviation
dummy_mouse_enhancers_ensembl	1210	2	1.0	2381	984.4
demo_coding_vs_intergenomic_seqs	100000	2	1.0	200	0.0
demo_human_or_worm	100000	2	1.0	200	0.0
drosophila_enhancers_stark	6914	2	1.0	2142	285.5
human_enhancers_cohn	27791	2	1.0	500	0.0
human_enhancers_ensembl	154842	2	1.0	269	122.6
human_ensembl_regulatory	289061	3	1.2	401	184.3
human_nontata_promoters	36131	2	1.2	251	0.0
human_ocr_ensembl	174756	2	1.0	315	108.1

ISSN: 2730-6844