Data Submission Instructions
This page contains information about the process and documentation necessary to submit data to NIAGADS. Depending on the data size, a member from NIAGADS will work with you on data transfer. Contact niagads@pennmedicine.upenn.edu to deposit data or if you have any questions.
Required Policy Documents
Please email the following required documents to niagads@pennmedicine.upenn.edu in order to deposit and share your data:
- Institutional Certification for ADRD Studies that covers all subjects in your study. Multiple certifications may be required.
- Signed copy of the NIA AD Genomics Sharing Plan.
- Data Registration Template
NOTE: All documents related to the application should be provided in English. For institutions where English is not the primary language, please provide translations of documents along with the original document. Translated documents should be signed by the institutional signing official.
Data Submission Checklist
Genotype data
- Phenotype Data File in tab delimited format (including pedigree structures if applicable) and a data dictionary
- README file (see below for suggested file contents)
- APOE Genotypes (if applicable)
- Genotypes in PLINK or VCF file format (preferred)
- Consent level as specified in the Institutional Certification form for each subject
- List of cohorts included and a description for each
Summary statistics/ Association results
- Results files in .txt format
- README (see below for suggested file contents)
Whole genome/exome or long-read sequencing
- Sequencing read data can be submitted in any of formats:
- FASTQ: please save all reads, including those that could not be mapped to the reference genome.
- BAM: please save all reads, including those that could not be mapped to the reference genome.
- CRAM: please save all reads, including those that could not be mapped to the reference genome.
- VCF: standard VCF4.2 format (recommend split by chr and gz these)
- Provide any relevant sequencing information, including the following:
- Sequencing Center
- Sequencer Machine
- Read Length
- PCR Free or PCR Amplified?
- Kit Name/Version
- Copy of the WES target regions if applicable
- Sequencing quality control metrics
- Phenotype Data File in tab delimited format (including pedigree structures if applicable) and a data dictionary
- APOE Genotypes (if applicable)
- Genotypes in PLINK or VCF file format (preferred)
- Consent level as specified in the Institutional Certification form for each subject
- List of cohorts included and a description for each
RNA-seq- or microarray data
- Sequencing Read Data
- Required Information:
- Sequencing read data can be submitted in any of the following formats: FASTQ, or BAM. The BAM file should contain all reads, including those that could not be mapped to the reference genome.
- Phenotype Data File in tab delimited format (including pedigree structures if applicable) and a data dictionary
- README:
- Sample source (e.g., cell line, tissue, cell types) and organism; provide protocol details if iPSCs/Single cells
- RNA extraction protocol (e.g. Trizol/chloroform extraction, Qiagen RNeasy kit)
- RNA integrity (RIN number) per sample
- Library preparation protocol (i.e. polyA capture, adapters used for ligation, read length and sequencing machine, single cell platform)
- Contributor contact information
- Dataset Reference Genome Build
- Consent level as specified in the Institutional Certification form for each subject
- List of cohorts included and a description for each
- Optional Information:
- QC report per sample (i.e. library characteristics (total number of reads, sequencing read length), GC content, % of rRNAs, % of Aligned reads, coverage, insert size)
- Required Information:
- Summary Data
- Required Information:
- Read abundance files can be submitted as summaries in tab-separated file format with explanations.
- Phenotype Data File in tab delimited format (including pedigree structures if applicable) and a data dictionary
- README:
- Sample source and organism; provide protocol details if iPSCs
- How the RAW data was generated and processed (steps needed, e.g., how mapping was done, how was multi-mapping handled)
- Raw data and library preparation protocol information (e.g., polyA capture, sequencing machine)
- Unit of quantification in these summary files (e.g., genes, exons, etc.)
- Annotation source and version (e.g., ENSEMBL version 94)
- Unit of counts (e.g., raw counts, RPKM values, UMI counts). Please provide details if normalization were performed, technical variations / batch effects were accounted for.
- Software name and version used to generate those counts.
- Contributor contact information
- Dataset Reference Genome Build
- Optional Information:
- Any publication that describes the data or findings from the data
- QC report per sample (i.e. library information (total number of reads, sequencing read length), GC content, % of rRNAs, % of Aligned reads, coverage, insert size)
- Highly recommend to send the workflow via code repository (e.g. github, bitbucket).
- Required Information:
Epigenetics studies (e.g., ChIP-seq, ATAC-seq)
- Sequencing Read Data
- Required Information:
- Sequencing read data can be submitted in any of the following formats: FASTQ, or BAM. Save all reads, including those that could not be mapped to the reference genome. Besides, must include background samples (input or mock IP samples).
- Phenotype Data File in tab delimited format (including pedigree structures if applicable) and a data dictionary
- README:
- Sample source and organism; provide protocol details if iPSCs
- Library preparation protocol (i.e. adapters used for ligation, read length and sequencing machine)
- Contributor contact information
- Dataset Reference Genome Build
- Consent level as specified in the Institutional Certification form for each subject
- List of cohorts included and a description for each
- Optional Information:
- QC report per samples (i.e. Library size (total number of reads), GC content, % of Aligned reads, coverage, insert size)
- Required Information:
- Summary Data
- Required Information:
- Processed peak files can be submitted in BED format with explanations (including significance of called peaks).
- Phenotype Data File in tab delimited format (including pedigree structures if applicable) and a data dictionary
- README:
- Sample source and organism; provide protocol details if iPSCs
- Description of all the BED columns
- Software name and version used to make those values (e.g. how do you filter the reads before calling peaks, was narrow or broad peaks called, how was the p-value corrected if any).
- Contributor contact information
- Dataset Reference Genome Build
- Optional Information:
- Any publication that describes the data or findings from the data
- QC report per samples (i.e. Library size (total number of reads), GC content, % of Aligned reads, coverage, insert size)
- Highly recommend to send the workflow via some code repository (e.g. github, bitbucket).
- Required Information:
Quantitative trait locus (QTL) analysis summary stats
- Required Information:
- Variant position: chr, start, end
- Allele information: ref, alt, a1, a2
- Feature name (e.g. gene name, protein name)
- P-value and or Q-value
- Effect size (Beta and Beta SE), or Spearman correlation p value
- Optional Information:
- Allele frequency or allele count
- Feature location: chr, start, end
- Cis/trans
- Readme:
- Detailed sample source, molecular trait and organism; provide protocol details if iPSCs
- Description of all the columns
- Software name and version used to perform the analyses
- Contributor contact information
- Dataset Reference Genome Build. Annotation resource info: e.g. ensemble version, dbSNP version
For RNA-seq- or microarray data (including single-cell data)
- Sequencing Read Data
- Essential information:
- Sequencing read data can be submitted in any of the following formats: FASTQ, or BAM. If submitting BAM, save all reads, including those that could not be mapped to the reference genome.
- Phenotype Data File in tab-delimited format (including pedigree structures if applicable) and a data dictionary.
- README file containing:
- Sample source (e.g., cell line, tissue, cell types) and organism; provide protocol details if iPSCs/Single cells
- RNA extraction protocol (e.g. Trizol/chloroform extraction, Qiagen RNeasy kit)
- RNA integrity (RIN number) per sample
- Library preparation protocol (i.e. polyA capture, adapters used for ligation, read length and sequencing machine, single cell platform)
- List of included files, formats and any necessary explanations
- Dataset Reference Genome Build
- Contributor contact information
- Optional Information:
- If BAM file, how was the data processed (e.g. how mapping was done, how was multi-mapping handled)
- QC report per sample (i.e. library characteristics (total number of reads, sequencing read length), GC content, % of rRNAs, % of Aligned reads, coverage, insert size)
- Any publication(s) describing the data or findings from the data
- Highly recommended to send the workflow via some code repository (e.g., GitHub, Bitbucket).
- Essential information:
- Summary/Processed data
- Essential information:
- Read abundance files can be submitted as summaries in tab-separated file format with explanations.
- Phenotype Data File in tab delimited format (including pedigree structures if applicable) and a data dictionary
- README file:
- Sample source (e.g., cell line, tissue, cell types) and organism; provide protocol details if iPSCs/Single cells
- How the RAW data was generated and processed (steps needed, e.g., how mapping was done, how was multi-mapping handled)
- Raw data and library preparation protocol information (e.g., polyA capture, sequencing machine, single cell platform)
- Unit of quantification in these summary files (e.g., genes, exons, etc.)
- Annotation source and version (e.g., ENSEMBL version 94)
- Unit of counts (e.g., raw counts, RPKM values, UMI counts). Provide details if normalization were performed, technical variations / batch effects were accounted for.
- Software name and version used to generate those counts.
- List of included files, formats and any necessary explanations
- Dataset Reference Genome Build
- Contributor contact information
- Optional Information:
- Any publication that describes the data or findings from the data
- QC report per sample (i.e. library information (total number of reads, sequencing read length), GC content, % of rRNAs, % of Aligned reads, coverage, insert size)
- Highly recommend to send the workflow via code repository (e.g. Github, bitbucket).
- Highly recommended to send the workflow via some code repository (e.g., GitHub, Bitbucket).
- Essential information:
For epigenetics studies (e.g., ChIP-seq, ATAC-seq) (including single-cell data)
- Sequencing Read Data
- Essential information:
- Sequencing read data can be submitted in any of the following formats: FASTQ, or BAM. If submitting BAM, save all reads, including those that could not be mapped to the reference genome. Besides, must include background samples (input or mock IP samples).
- Phenotype Data File in TSV (tab delimited) format (including pedigree structures if applicable) and a data dictionary
- README file:
- Sample source (e.g., cell line, tissue, cell types) and organism; provide protocol details if iPSCs/Single cells
- Library preparation protocol (i.e. adapters used for ligation, read length and sequencing machine)
- Platform or array information
- List of included files, formats and any necessary explanations
- Dataset Reference Genome Build
- Contributor contact information
- Optional Information:
- QC report per sample (e.g., library size (total number of reads), GC content, % of uniquely aligned reads, coverage, insert size)
- Any publication(s) describing the data or findings from the data
- Highly recommended to send the workflow via some code repository (e.g., GitHub, Bitbucket).
- Essential information:
- Summary/Processed data
- Essential information:
- Processed peak call files can be submitted in BED format with explanations (including significance of called peaks).
- For ATAC-seq or similar protocols, fragment files (in BED format) can be submitted
- Phenotype Data File in TSV (tab delimited) format (including pedigree structures if applicable) and a data dictionary
- README file:
- Sample source (e.g., cell line, tissue, cell types) and organism; provide protocol details if iPSCs/Single cells
- Description of all the BED columns
- Software name and version used to make those values and processing details (e.g., how reads were filtered before calling peaks, was narrow or broad peaks called, how was the p-value corrected).
- List of included files, formats and any necessary explanations
- Contributor contact information
- Dataset Reference Genome Build
- Optional Information:
- Any publication(s) describing the data or findings from the data
- QC report per samples (e.g, library size (total number of reads), GC content, % of uniquely aligned reads, coverage, insert size)
- Highly recommended to send the workflow via some code repository (e.g., GitHub, Bitbucket).
- Essential information:
For Methylation data (e.g., methylation array, bisulfite sequencing)
- Sequencing Read Data / Raw Methylation Data
- Essential information:
- Sequencing read data can be submitted in any of the following formats: FASTQ, or BAM. If submitting BAM, save all reads, including those that could not be mapped to the reference genome.
- Phenotype Data File in TSV (tab delimited) format (including pedigree structures if applicable) and a data dictionary
- README file:
- Sample source (e.g., cell line, tissue, cell types) and organism; provide protocol details if iPSCs/Single cells
- Library preparation protocol (i.e. adapters used for ligation, read length and sequencing machine)
- Platform or array information
- List of included files, formats and any necessary explanations
- Dataset Reference Genome Build
- Contributor contact information
- Optional Information:
- QC report per sample (e.g., library size (total number of reads), % of uniquely aligned reads, coverage)
- Any publication(s) describing the data or findings from the data
- Highly recommended to send the workflow via some code repository (e.g., GitHub, Bitbucket).
- Essential information:
- Summary/Processed data
- Essential information:
- Processed methylation sites / peak call files can be submitted in BED format with explanations (including significance of called peaks).
- Phenotype Data File in TSV (tab delimited) format (including pedigree structures if applicable) and a data dictionary
- README file:
- Sample source (e.g., cell line, tissue, cell types) and organism; provide protocol details if iPSCs/Single cells
- Description of all the BED columns
- Software name and version used to make those values and processing details (e.g., how reads were filtered before calling peaks, was narrow or broad peaks called, how was the p-value corrected).
- List of included files, formats and any necessary explanations
- Contributor contact information
- Dataset Reference Genome Build
- Optional Information:
- Any publication(s) describing the data or findings from the data
- QC report per samples (e.g, library size (total number of reads), GC content, % of uniquely aligned reads, coverage, insert size)
- Highly recommended to send the workflow via some code repository (e.g., GitHub, Bitbucket).
- Essential information:
For Proteomics data
- Mass Spec Related
- Essential information:
- Files in one of the standard Mass Spectrometer Output File Format e.g. mzML, mzXML
- A matrix of samples against peptide/protein information in txt format.
- Phenotype Data File in TSV (tab delimited) format (including pedigree structures if applicable) and a data dictionary
- README file:
- Sample source (e.g., cell line, tissue, cell types) and organism; provide protocol details if iPSCs/Single cells
- Quantification method (e.g. Label-free: intensity, TMT quantitation analysis)
- Digestion Method (e.g. In-solution digestion, on-bead digestion)
- Online LC system (e.g. Agilent 1100- nano LC system, Agilent HPLC 1200 system, Dionex UltiMate 3000)
- Mass Spectrometer (e.g. LTQ Orbitrap, LTQ Orbitrap Velos, Q Exactive HF)
- Protease (e.g. Trypsin)
- Fragmentation method (e.g. CID resonance-type, CID beam-type, high-energy collision-induced dissociation
- Peptide identification and annotation; protein annotation information
- QC/normalization details and steps involved (including outlier detection)
- List of included files, formats and any necessary explanations
- Dataset Reference Genome Build
- Contributor contact information
- Optional Information:
- QC report per sample
- Essential information:
- Protein Array Data
- Essential information:
- Read abundance files can be submitted as summaries in tab-separated file format with explanations.
- Provide UniprotID and Target protein name measured.
- For SOMAscan, provide SOMAScan RFU values (recommend both raw and processed)
- For Olink, provide NPX values (recommend both raw and processed)
- Phenotype Data File in tab delimited format (including pedigree structures if applicable) and a data dictionary
- README file:
- Sample source (e.g., cell line, tissue, cell types) and organism; provide protocol details if iPSCs/Single cells
- How the RAW data was generated (protein array platform, chip version)
- Unit of quantification in these summary files (e.g., proteins.)
- Annotation source and version (e.g., uniprot version xx)
- Unit of counts (e.g., raw counts, RPKM values, UMI counts). Please provide details if normalization were performed, technical variations / batch effects were accounted for.
- Software name and version used to generate those counts.
- Contributor contact information
- Dataset Reference Genome Build
- Optional Information:
- Any publication that describes the data or findings from the data
- QC report per sample
- Highly recommend to send the workflow via code repository (e.g. github, bitbucket).
- Essential information:
NOTE: Please provide md5 checksum for every submitted data file to ensure submission correctness.
README Description
A README should include the following information (please use plain text (.txt), PDF (.pdf), or Microsoft Word (.doc or .docx)):
- Description of the dataset and concise description of the study design
- Platform or array
- Any version information
- List of included files and formats
- Contributor contact information
- Dataset Reference Genome Build
- Publications