snakemake学习笔记007~slurm的cluster提交任务
主要参考
https://eriqande.github.io/eca-bioinf-handbook/snakemake-chap.html
内容是fastp对原始数据进行过滤
snakemake文件的内容
input_folder = "/mnt/shared/scratch/myan/private/practice_data/RNAseq/chrX_data/samples/"
output_folder = "/home/myan/scratch/private/practice_data/RNAseq/20220511/"
SRR,FRR = glob_wildcards(input_folder + "{srr}_chrX_{frr}.fastq.gz")
rule all:
input:
expand(output_folder + "outputfastq/{srr}_chrX_{frr}.fastq",srr=SRR,frr=FRR)
rule first:
input:
read01 = input_folder + "{srr}_chrX_1.fastq.gz",
read02 = input_folder + "{srr}_chrX_2.fastq.gz"
output:
read01 = output_folder + "outputfastq/{srr}_chrX_1.fastq",
read02 = output_folder + "outputfastq/{srr}_chrX_2.fastq",
json = output_folder + "fastpreport/{srr}.json",
html = output_folder + "fastpreport/{srr}.html"
threads:
8
shell:
"""
fastp -i {input.read01} -I {input.read02} -o {output.read01} -O {output.read02} \
--thread {threads} --html {output.html} --json {output.json}
"""
运行命令
snakemake --cluster 'sbatch --cpus-per-task={threads}' --jobs 12 -s snakemake_hpc.py
唰一下就结束了
试了下更长的命令
snakemake --cluster 'sbatch --cpus-per-task={threads} -o slurm_outputs/{rule}_{wildcards}_%j.out -e logs_errors/{rule}/{rule}_{wildcards}_%j.err --mail-type=ALL --mail-user=mingyan24@126.com' --jobs 4 -s snakemake_hpc.py
这个命令一直没有成功
下面这个命令是可以的 加上邮箱通知
snakemake --cluster 'sbatch --cpus-per-task={threads} --mail-type=ALL --mail-user=mingyan24@126.com' --jobs 4 -s snakemake_hpc.py
这里没有遇到内存超出的问题
但是我运行真实数据的时候会遇到内存超出问题

snakemake学习笔记007~slurm的cluster提交任务

我的文件存储层级如上,按照之前的通配符的写法,他会组合出
PRJNA001/SRR0002_1.fastq.gz
的文件
这里的问题是如何指定expand()
函数的组合
流程处理的问题还是 fastp 过滤原始测序数据
import os
import glob
raw_fastq_folder = "/mnt/sdc/xiaoming/MingYan/snakemake_20220513/00.raw.fastq/"
output_folder = "/mnt/sdc/xiaoming/MingYan/snakemake_20220513/"
fq_list = {}
print(os.listdir(raw_fastq_folder))
experiment = os.listdir(raw_fastq_folder)
for i in experiment:
fq_list[i] = [fq.split("_")[0] for fq in os.listdir(os.path.join(raw_fastq_folder,i))]
print(fq_list)
inputs = [(dir,file) for dir,files in fq_list.items() for file in files]
#glob_wildcards(raw_fastq_folder + "{exper}/{srr}_{frr}.fastq")
rule all:
input:
expand(output_folder + "01.fastp.report/" + "{exper}/{srr}.html",zip,exper=[row[0] for row in inputs],srr=[row[1] for row in inputs])
rule firstrule:
input:
read01 = raw_fastq_folder + "{exper}/{srr}_1.fastq.gz",
read02 = raw_fastq_folder + "{exper}/{srr}_2.fastq.gz"
output:
read01 = output_folder + "01.fastp.filter/" + "{exper}/{srr}_clean_1.fastq.gz",
read02 = output_folder + "01.fastp.filter/" + "{exper}/{srr}_clean_2.fastq.gz",
html = output_folder + "01.fastp.report/" + "{exper}/{srr}.html",
json = output_folder + "01.fastp.report/" + "{exper}/{srr}.json"
threads:
2
shell:
"""
fastp -i {input.read01} -I {input.read02} -o {output.read01} -O {output.read02} --json {output.json} --html {output.html} -w {threads}
"""
前面组合文件夹和文件的命令还是有点多的,不知道有没有简单的的方法
看到有的解决办法里还用到了lambda函数,还得仔细看一下lambda的用法
这里换成我真实的数据集后会遇到内存不够的情况,需要再snakemake里写resources
这里默认情况下用多少内存呢?还需要仔细看snakemake的文档
我真实数据的代码
import os
raw_fastq_folder = "/mnt/shared/scratch/myan/private/pomeRTD/00.raw.fastq/"
output_folder = "/home/myan/scratch/private/pomeRTD/"
#Folder,SRR,FRR = glob_wildcards(raw_fastq_folder + "{folder}/{srr}_{frr}.fq.gz")
#print(Folder)
#experiment = os.listdir(raw_fastq_folder)
list_fastq = {}
for experiment in os.listdir(raw_fastq_folder):
list_fastq[experiment] = [x.split("_")[0] for x in os.listdir(raw_fastq_folder + experiment)]
print(list_fastq)
inputs = [(dir,file) for dir,files in list_fastq.items() for file in files]
#glob_wildcards(raw_fastq_folder + "{exper}/{srr}_{frr}.fastq")
rule all:
input:
expand(output_folder + "01.fastp.report/" + "{exper}/{srr}.html",zip,exper=[row[0] for row in inputs],srr=[row[1] for row in inputs])
rule runfastp:
input:
read01 = os.path.join(raw_fastq_folder,"{exper}","{srr}_1.fq.gz"),
read02 = os.path.join(raw_fastq_folder,"{exper}","{srr}_2.fq.gz")
output:
read01 = output_folder + "01.fastp.filtered.reads/{exper}/{srr}_clean_1.fq.gz",
read02 = output_folder + "01.fastp.filtered.reads/{exper}/{srr}_clean_2.fq.gz",
html = output_folder + "01.fastp.report/{exper}/{srr}.html",
json = output_folder + "01.fastp.report/{exper}/{srr}.json"
threads:
8
resources:
mem = 8000
params:
"-q 20 --cut_front --cut_tail -l 30"
shell:
"""
fastp -i {input.read01} -I {input.read02} -o {output.read01} -O {output.read02} \
-w {threads} -h {output.html} -j {output.json} {params}
"""
8000后的单位是MB,暂时不知道GB如何写
运行这个代码的命令
snakemake --cluster 'sbatch --cpus-per-task={threads} --mem={resources.mem} --mail-type=FAIL --mail-user=mingyan24@126.com' --jobs 8 -s pomeRTD_snakemake_v01.py
这种写法会在当前目录下生成一大堆任务提交的日志文件,如何将这些文件输出到指定文件夹呢?

还有一个问题是 slurm 管理的HPC 通常可以用sbatch scripts.sh
提交任务,这里可以把
snakemake --cluster 'sbatch --cpus-per-task={threads} --mem={resources.mem} --mail-type=FAIL --mail-user=mingyan24@126.com' --jobs 8 -s pomeRTD_snakemake_v01.py
这个命令写到.sh
文件中吗?然后用sbatch提交,可以试试
如果不是计算机集群有办法设置jobs吗?
还有好多基础需要看
欢迎大家关注我的公众号
小明的数据分析笔记本
小明的数据分析笔记本 公众号 主要分享:1、R语言和python做数据分析和数据可视化的简单小例子;2、园艺植物相关转录组学、基因组学、群体遗传学文献阅读笔记;3、生物信息学入门学习资料及自己的学习笔记!
网友评论