Usage examples¶
Configuration file¶
First you need to define your datasets
[cadd]
type = tabix
file = cadd_score.tsv.gz
header = CHR, POS, REF, ALT, RAW, PHRED
ctypes = str, int, str, str, float, float
sequence= CHR
begin = POS
end = POS
[exons]
type = tabix
file = cds_exons.tsv.gz
header = CHR, START, STOP, GENE
ctypes = str, int, int, str
sequence = CHR
begin = START
end = STOP
indices = GENE,
Group by¶
You can only use indexes fields as a groupby key. The indexed fields are the ones defined in the indices section of a dataset definition.
gd = Gendas('data/gendas.conf')
# Total genes
genes_size = len(set(gd['exons']['GENE']))
# Group by gene column and compute the mean, maximum and minimum score
bkg_cadd_mean_by_gene = gd.groupby(gd['exons']['GENE']).aggregate({
'MEAN': lambda gd: mean(gd['cadd']['PHRED']),
'MAX': lambda gd: max(gd['cadd']['PHRED']),
'MIN': lambda gd: min(gd['cadd']['PHRED'])
})