That would be great. So far I've been messing with scikit-allel.
Code: Select all
# import scikit-allel
import allel
# check which version is installed
print(allel.__version__)
import os
os.chdir(r"C:\Users\Ghost\Desktop\Genome")
df = allel.vcf_to_dataframe('ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf', fields=['ID','ALT_1','ALT_2','AF_1','AF_2'], alt_number=2)
results = open('results1000.txt','w')
results.write(str(df))
results.close()
df.compress
print(df)
It's worked in that I'm printing out the following:
Code: Select all
ID ALT_1 ALT_2 AF_1 AF_2
0 rs11575897 NaN NaN NaN NaN
1 . NaN NaN NaN NaN
2 . NaN NaN NaN NaN
3 . NaN NaN NaN NaN
4 . NaN NaN NaN NaN
... ... ... ... ... ...
62037 . NaN NaN NaN NaN
62038 . NaN NaN NaN NaN
62039 . NaN NaN NaN NaN
62040 . NaN NaN NaN NaN
62041 . NaN NaN NaN NaN
[62042 rows x 5 columns]
However, I can't get it all to a text file - it keeps getting cut too short.
Any ideas you have would be appreciated.
By "Cleaning" I just mean getting it into a format that we can use to compare it against what we have already for the PSSD user data. This is hard because I really don't have much idea of what I'm looking for except "increase control sample size and turn it into a text file". The data from 23andme looks like this:
Code: Select all
# rsid chromosome position genotype
rs4477212 1 72017 AA
rs3094315 1 742429 AG
rs3131972 1 742584 AG
rs12124819 1 766409 AA
rs11240777 1 788822 GG
rs6681049 1 789870 CC
rs4970383 1 828418 AA
rs4475691 1 836671 TT
@Cipro really helped with the original version of the program, but basically we need to determine how common an allele is for the PSSD group, and compare it to how common it is in this huge control group. "Cleaning" would also be reducing the number of SNPs we're looking at. For example - Most SNPs are useless because they have no research, or they don't overlap between the databases. Basically, there is a file called "snp.txt" - it's a list of all of the SNPs that I'm interested in or that SNPedia has pages on. I went through last year through 500+ pages and copied/pasted all that data into a .txt file. I guess we don't really need much "cleaning" here because the program checks if the file has the SNP before doing the frequency calculations, but I still think we waste time by making it look through a list of mostly useless SNPs each time it does so. The ideal program would take the document created from the 1000 genomes project and cut it to a list of only data for the SNPs we cared about. Ditto for the PSSD genomes.
So overview:
- Takes all data from some format (either 23+me or VCF/1000 genes) and turns it into 1 type of .txt file with the following information: rs#, Allele frequency for AGCT, and sample size (determined when the program is launched - we don't want data from SNPs that we only have 1 of - leads to a lot of 100% A for instance and that leads to false positives).
- Then, we have 2 text files. PSSD and Control. Each line has 6 points of data(rs/(A/G/C/T)/sample size). They have more SNPs than we care about because not all of them are useful. Therefore, we run it through the 'snp.txt' list and cut out the ones on each list that aren't on there. Now we have SNPs on each list that are known or at least in SNPedia database.
- Cut out the SNPs with small sample size. If only 1 PSSD patient got tested for rs1234, then we don't care about it. It's gotta be something like over 20. But we can set that number at the start like we did in the original program.
- Now, we finally have 2 lists of SNPs that we care about and that are cut so only those with sizeable sample sizes exist. NOW, we can run the program comparing the differences between them, printing the results as they come.
---
I have some lost scripts on my old computer that turn the results into visuals. I will fish those out but eventually I will add those to the end so that for each SNP that we find there is basically a printed out report that is generated with graphs. At this time I don't think I need help on this.
Phew...Sorry that was a lot.