PSSD Genome Project [Megathread]
-
- Posts: 283
- Joined: Tue Oct 03, 2017 12:04 pm
- Contact:
Re: PSSD Genome Project [Megathread]
it looks like you are looking at single SNP-level data rather than gene-level. If you are familiar with R, you can get the minor allele frequencies for all of the SNPs in the 1000 Genomes project with the package "MafDb.1Kgenomes.phase3.hs37d5".
Re: PSSD Genome Project [Megathread]
Good advice. I think they have good resources.
R scares me but I have taken a class in it before.
The 1000 genomes looks to be the best option I just need to figure out how to clean the data to be what I want.
R scares me but I have taken a class in it before.
The 1000 genomes looks to be the best option I just need to figure out how to clean the data to be what I want.
- Medical Student & Friendly poltergeist - Lexapro Sept '14. [Hx] [PSSD Lab] [r/PSSD] [Treatment Plan] - Add "Ghost" in replies so I see it 

-
- Posts: 283
- Joined: Tue Oct 03, 2017 12:04 pm
- Contact:
Re: PSSD Genome Project [Megathread]
I can send you a script to fetch that data and then write it to text, so you can import into python. When you say clean the data to be what you- can you be more specific?
Re: PSSD Genome Project [Megathread]
That would be great. So far I've been messing with scikit-allel.
It's worked in that I'm printing out the following:
However, I can't get it all to a text file - it keeps getting cut too short.
Any ideas you have would be appreciated.
By "Cleaning" I just mean getting it into a format that we can use to compare it against what we have already for the PSSD user data. This is hard because I really don't have much idea of what I'm looking for except "increase control sample size and turn it into a text file". The data from 23andme looks like this:
@Cipro really helped with the original version of the program, but basically we need to determine how common an allele is for the PSSD group, and compare it to how common it is in this huge control group. "Cleaning" would also be reducing the number of SNPs we're looking at. For example - Most SNPs are useless because they have no research, or they don't overlap between the databases. Basically, there is a file called "snp.txt" - it's a list of all of the SNPs that I'm interested in or that SNPedia has pages on. I went through last year through 500+ pages and copied/pasted all that data into a .txt file. I guess we don't really need much "cleaning" here because the program checks if the file has the SNP before doing the frequency calculations, but I still think we waste time by making it look through a list of mostly useless SNPs each time it does so. The ideal program would take the document created from the 1000 genomes project and cut it to a list of only data for the SNPs we cared about. Ditto for the PSSD genomes.
So overview:
- Takes all data from some format (either 23+me or VCF/1000 genes) and turns it into 1 type of .txt file with the following information: rs#, Allele frequency for AGCT, and sample size (determined when the program is launched - we don't want data from SNPs that we only have 1 of - leads to a lot of 100% A for instance and that leads to false positives).
- Then, we have 2 text files. PSSD and Control. Each line has 6 points of data(rs/(A/G/C/T)/sample size). They have more SNPs than we care about because not all of them are useful. Therefore, we run it through the 'snp.txt' list and cut out the ones on each list that aren't on there. Now we have SNPs on each list that are known or at least in SNPedia database.
- Cut out the SNPs with small sample size. If only 1 PSSD patient got tested for rs1234, then we don't care about it. It's gotta be something like over 20. But we can set that number at the start like we did in the original program.
- Now, we finally have 2 lists of SNPs that we care about and that are cut so only those with sizeable sample sizes exist. NOW, we can run the program comparing the differences between them, printing the results as they come.
---
I have some lost scripts on my old computer that turn the results into visuals. I will fish those out but eventually I will add those to the end so that for each SNP that we find there is basically a printed out report that is generated with graphs. At this time I don't think I need help on this.
Phew...Sorry that was a lot.
Code: Select all
# import scikit-allel
import allel
# check which version is installed
print(allel.__version__)
import os
os.chdir(r"C:\Users\Ghost\Desktop\Genome")
df = allel.vcf_to_dataframe('ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf', fields=['ID','ALT_1','ALT_2','AF_1','AF_2'], alt_number=2)
results = open('results1000.txt','w')
results.write(str(df))
results.close()
df.compress
print(df)
Code: Select all
ID ALT_1 ALT_2 AF_1 AF_2
0 rs11575897 NaN NaN NaN NaN
1 . NaN NaN NaN NaN
2 . NaN NaN NaN NaN
3 . NaN NaN NaN NaN
4 . NaN NaN NaN NaN
... ... ... ... ... ...
62037 . NaN NaN NaN NaN
62038 . NaN NaN NaN NaN
62039 . NaN NaN NaN NaN
62040 . NaN NaN NaN NaN
62041 . NaN NaN NaN NaN
[62042 rows x 5 columns]
Any ideas you have would be appreciated.
By "Cleaning" I just mean getting it into a format that we can use to compare it against what we have already for the PSSD user data. This is hard because I really don't have much idea of what I'm looking for except "increase control sample size and turn it into a text file". The data from 23andme looks like this:
Code: Select all
# rsid chromosome position genotype
rs4477212 1 72017 AA
rs3094315 1 742429 AG
rs3131972 1 742584 AG
rs12124819 1 766409 AA
rs11240777 1 788822 GG
rs6681049 1 789870 CC
rs4970383 1 828418 AA
rs4475691 1 836671 TT
So overview:
- Takes all data from some format (either 23+me or VCF/1000 genes) and turns it into 1 type of .txt file with the following information: rs#, Allele frequency for AGCT, and sample size (determined when the program is launched - we don't want data from SNPs that we only have 1 of - leads to a lot of 100% A for instance and that leads to false positives).
- Then, we have 2 text files. PSSD and Control. Each line has 6 points of data(rs/(A/G/C/T)/sample size). They have more SNPs than we care about because not all of them are useful. Therefore, we run it through the 'snp.txt' list and cut out the ones on each list that aren't on there. Now we have SNPs on each list that are known or at least in SNPedia database.
- Cut out the SNPs with small sample size. If only 1 PSSD patient got tested for rs1234, then we don't care about it. It's gotta be something like over 20. But we can set that number at the start like we did in the original program.
- Now, we finally have 2 lists of SNPs that we care about and that are cut so only those with sizeable sample sizes exist. NOW, we can run the program comparing the differences between them, printing the results as they come.
---
I have some lost scripts on my old computer that turn the results into visuals. I will fish those out but eventually I will add those to the end so that for each SNP that we find there is basically a printed out report that is generated with graphs. At this time I don't think I need help on this.
Phew...Sorry that was a lot.
- Medical Student & Friendly poltergeist - Lexapro Sept '14. [Hx] [PSSD Lab] [r/PSSD] [Treatment Plan] - Add "Ghost" in replies so I see it 

-
- Posts: 283
- Joined: Tue Oct 03, 2017 12:04 pm
- Contact:
Re: PSSD Genome Project [Megathread]
you can convert 23andMe to vcf format using bcftools: https://www.biostars.org/p/374149/
vcftools will allow you to extract a list of snps very efficiently.
ultimately, you want a matrix with subjects as rows, snps as columns, and cells as genotypes. genotypes are coded as 0 (homozygous for the major allele), 1 (heterozygous), or 2 (homozygous for minor allele).
data QC depends on how many subjects you have. you are correct that you should exclude subjects or snps with too few observations.
Im not sure why the text gets cut short, maybe the file is too big? you could split by chromosome.
do you have actual control subjects, or are you comparing against the population MAF as estimated from 1000 genomes? because that determines what kind of statistical testing you can do.
vcftools will allow you to extract a list of snps very efficiently.
ultimately, you want a matrix with subjects as rows, snps as columns, and cells as genotypes. genotypes are coded as 0 (homozygous for the major allele), 1 (heterozygous), or 2 (homozygous for minor allele).
data QC depends on how many subjects you have. you are correct that you should exclude subjects or snps with too few observations.
Im not sure why the text gets cut short, maybe the file is too big? you could split by chromosome.
do you have actual control subjects, or are you comparing against the population MAF as estimated from 1000 genomes? because that determines what kind of statistical testing you can do.
Re: PSSD Genome Project [Megathread]
I fear you may pass-up a "PSSD SNP" if you go about your project this way. Even if there is no research on a particular SNP, it would be associated with other possibly related genes through its location. Not sure if this is the correct term, but there may be linkage disequilibrium between that SNP and other SNPs that are often inherited with it as a set.For example - Most SNPs are useless because they have no research,
To save time and frustration, the PLINK documentation recommended pre-processing to remove SNPs with only common variants. ...Basically, many SNPs don't say much more than "I am a human," and those will not be of much value when researching a rare disease.
ps- Good to hear this is still going forward.
Re: PSSD Genome Project [Megathread]
I think it has to do with the different datastructures. The way I understand it, Numpy and Pandas set their arrays differently. Scikit-allel uses Numpy I think? There is a setting somewhere to set the truncation level but I don't know how to do that.sovietxrobot wrote:you can convert 23andMe to vcf format using bcftools: https://www.biostars.org/p/374149/
vcftools will allow you to extract a list of snps very efficiently.
ultimately, you want a matrix with subjects as rows, snps as columns, and cells as genotypes. genotypes are coded as 0 (homozygous for the major allele), 1 (heterozygous), or 2 (homozygous for minor allele).
data QC depends on how many subjects you have. you are correct that you should exclude subjects or snps with too few observations.
Im not sure why the text gets cut short, maybe the file is too big? you could split by chromosome.
do you have actual control subjects, or are you comparing against the population MAF as estimated from 1000 genomes? because that determines what kind of statistical testing you can do.
I'm comparing the population from 1000 genes.
- Medical Student & Friendly poltergeist - Lexapro Sept '14. [Hx] [PSSD Lab] [r/PSSD] [Treatment Plan] - Add "Ghost" in replies so I see it 

Re: PSSD Genome Project [Megathread]
Yea that's a worry that I do have. I just know that from experience before I struggled with finding SNPs and having no idea what to do with them because nothing is known.Dubya_B wrote:I fear you may pass-up a "PSSD SNP" if you go about your project this way. Even if there is no research on a particular SNP, it would be associated with other possibly related genes through its location. Not sure if this is the correct term, but there may be linkage disequilibrium between that SNP and other SNPs that are often inherited with it as a set.For example - Most SNPs are useless because they have no research,
To save time and frustration, the PLINK documentation recommended pre-processing to remove SNPs with only common variants. ...Basically, many SNPs don't say much more than "I am a human," and those will not be of much value when researching a rare disease.
ps- Good to hear this is still going forward.
It might be worth it to use PLINK or other GWAS tools to do this work. It definitely lets me do more things. It just gets confusing at some point as my genetics knowledge is limited compared to the PhD's doing most of the work in the field.
Are you aware if a list of those SNPs exists for download? Especially one that I could use to amend my list of those I'm interested in?
- Medical Student & Friendly poltergeist - Lexapro Sept '14. [Hx] [PSSD Lab] [r/PSSD] [Treatment Plan] - Add "Ghost" in replies so I see it 

-
- Posts: 283
- Joined: Tue Oct 03, 2017 12:04 pm
- Contact:
Re: PSSD Genome Project [Megathread]
I am not sure what is meant by removing common variants- common variants can contribute to disease risk. This isn't a step I have ever taken, but it could be that the genotyping platform already ignores these variants.
The typical process for genomics research is to get a sample set genotyped on some platform (usually ~300k SNPs). These data is reduced to only well-sampled variants, and then imputed against a reference genome. Essentially, you use reference data and properties of genetics to interpolate SNPs you haven't directly observed. After QC of the imputed data, you are left with something in the ballpark of millions of SNPs.
Therein lies the problem- the sample size for this project is going to be tiny. A genome-wide approach looking to discover new risk SNPs is not realistic. Analyzing a smaller set of SNPs that you already have a hypothesis on is much more tractable.
The typical process for genomics research is to get a sample set genotyped on some platform (usually ~300k SNPs). These data is reduced to only well-sampled variants, and then imputed against a reference genome. Essentially, you use reference data and properties of genetics to interpolate SNPs you haven't directly observed. After QC of the imputed data, you are left with something in the ballpark of millions of SNPs.
Therein lies the problem- the sample size for this project is going to be tiny. A genome-wide approach looking to discover new risk SNPs is not realistic. Analyzing a smaller set of SNPs that you already have a hypothesis on is much more tractable.
Re: PSSD Genome Project [Megathread]
Good points.sovietxrobot wrote:I am not sure what is meant by removing common variants- common variants can contribute to disease risk. This isn't a step I have ever taken, but it could be that the genotyping platform already ignores these variants.
The typical process for genomics research is to get a sample set genotyped on some platform (usually ~300k SNPs). These data is reduced to only well-sampled variants, and then imputed against a reference genome. Essentially, you use reference data and properties of genetics to interpolate SNPs you haven't directly observed. After QC of the imputed data, you are left with something in the ballpark of millions of SNPs.
Therein lies the problem- the sample size for this project is going to be tiny. A genome-wide approach looking to discover new risk SNPs is not realistic. Analyzing a smaller set of SNPs that you already have a hypothesis on is much more tractable.
- Medical Student & Friendly poltergeist - Lexapro Sept '14. [Hx] [PSSD Lab] [r/PSSD] [Treatment Plan] - Add "Ghost" in replies so I see it 

Who is online
Users browsing this forum: No registered users and 2 guests