Genetic genealogy raw data and Alzheimer's research ...

Alzheimer's, cardiovascular, and other chronic diseases; biomarkers, lifestyle, supplements, drugs, and health care.
J11
Contributor
Contributor
Posts: 3351
Joined: Sat May 17, 2014 4:04 pm

Re: Genetic genealogy raw data and Alzheimer's research ...

Post by J11 »

Would an easier way of determining the variant of interest be to simply take a sample of our loved one DNA and have the lab find out which chromosome has the defect by in vitro screening. This would be so much easier. They could start out with 46 of our loved one,s chromosomes remove 23 of them , replace those 23 with 23 normal chromosomes. If the disease in a dish procedure found that AD was not present, then we would know that our loved one,s variant must be in one of the 23 chromosomes that were replaced. If the disease in a dish procedure found AD present, then we would know that the variant must be in one of the chromosomes that was not replaced. The procedure could continue to narrow down the identity of the variant containing chromosome from 12 to 6 to 3 to 1 chromosomes. Perhaps this could be continued to consider portions of a single chromosome.
Nancy
Contributor
Contributor
Posts: 460
Joined: Wed Jun 08, 2016 1:30 pm

Re: Genetic genealogy raw data and Alzheimer's research ...

Post by Nancy »

:o Wow! This was all way over my head but sounds very impressive! You must have the "very high IQ" gene! Anyway, hope you're right.
3,4
J11
Contributor
Contributor
Posts: 3351
Joined: Sat May 17, 2014 4:04 pm

Re: Genetic genealogy raw data and Alzheimer's research ...

Post by J11 »

New analysis of the IGAP results
Genes PLCG2 and ABI3 are now genome wide significant.

a protective variant in PLCG2(rs72824905: p.Pro522Arg, P = 5.38 × 10-10, odds ratio (OR) = 0.68, minor allele frequency (MAF)cases = 0.0059, MAFcontrols = 0.0093),

a risk variant in ABI3 (rs616338: p.Ser209Phe, P = 4.56 × 10-10, OR = 1.43, MAFcases = 0.011, MAFcontrols = 0.008),

and a new genome-wide significant variant in TREM2

https://www.ncbi.nlm.nih.gov/pubmed/28714976
Last edited by J11 on Fri Aug 04, 2017 9:17 pm, edited 3 times in total.
J11
Contributor
Contributor
Posts: 3351
Joined: Sat May 17, 2014 4:04 pm

Re: Genetic genealogy raw data and Alzheimer's research ...

Post by J11 »

This is what we have been waiting for!

"Here we apply the same analytic approaches to a pathological case control series and show a predictive AUC of 84%. We suggest that this analysis has clinical utility and that there is limited room for further improvement using genetic data."

"....Cytox who are developing an Affymetrix based genetic testing array for Alzheimer’s disease "

Wonder what the SNPs are?

https://www.ncbi.nlm.nih.gov/pubmed/28727176
J11
Contributor
Contributor
Posts: 3351
Joined: Sat May 17, 2014 4:04 pm

Re: Genetic genealogy raw data and Alzheimer's research ...

Post by J11 »

Very interesting!

Research indicates that most of the genetics basis of AD is now known.
If we are so close to the finish line, then there are probably quite a few companies that want to sell you an AD gene chip to help you find your risk.

This might be a great time for consumer science to help take it the last mile. That is some people would receive a low risk signal even if there were a strong AD signal for that person. The genetic result would be discordant with the clinical history. Perhaps a GWAS of such people could help fill in some of the remaining gaps of the research.
J11
Contributor
Contributor
Posts: 3351
Joined: Sat May 17, 2014 4:04 pm

Re: Genetic genealogy raw data and Alzheimer's research ...

Post by J11 »

Very exciting!

The Alzheimer's Sequencing Project has finished sequencing of full genomes and exomes of a large number of
multiplex Alzheimer families many without an APOE 4 allele. I have been worried for some time that victory has been
declared in AD genetics even though the job is still not yet done. This Project should greatly help in uncovering many of
the AD genes that have went undetected to date.

One can only hope that others will follow the lead and do likewise. With a project such as this, the research is much easier to scale.
With a huge GWAS any one group would not be able to identify their particular contribution to the overall result. Yet, with a multiplex
AD Project, you probably could find the causal variants (if present) with even a modest number of affecteds per family. This is money very wisely invested.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5646177/
J11
Contributor
Contributor
Posts: 3351
Joined: Sat May 17, 2014 4:04 pm

Re: Genetic genealogy raw data and Alzheimer's research ...

Post by J11 »

Even more excitement!

Genetic genealogy is going mainstream!
It is almost impossible to imagine; 1.7 million Americans genotyped with Ancestry.com on the Thanksgiving Weekend?
Ancestry.com is now on track to surpass 10 million in its DNA database within the next year.
23andme is now at 3 million and has a goal of also hitting 10 million!

https://www.ancestry.com/corporate/news ... onday-more

https://cruwys.blogspot.ca/2017/11/23an ... stone.html

What a large step forward! Everyone should be genotyped! All of these pieces of genetic information
help to fill in the puzzle of the human genome. Even those who do not have a trait are providing
useful information. This is such a fantastic way to help others and more particularly one's own extended
family.

I am looking forward to the time in the not too distant future where one's entire family tree would be computed
on the basis of DNA. When using only presumed family relationships, there can be various errors or fading of memory
that leads to family trees of dubious quality. With tens of millions of people in DNA databases we should soon have
very accurate and comprehensive genealogies. All one might have to do is upload their gene chip file and then
be shown their family tree. This should make unraveling dementia genetics much easier.

I have not seen any research with these databases yet related to AD, though one wishes that such research
will soon be done. A 10 million person DNA database would cost $1 billion in genotyping charges alone!
Leveraging this dataset for the good of all is simply too good of an opportunity to pass by. Perhaps an online
social fundraising campaign should begin if no other entity shows an initiative.

This will be such an enormously useful resource for researching AD risk!
The IGAP GWAS only had 40,000 with AD, what could be learned from a database of tens of millions of people?
Hopeully, phenotype information will be obtained on some platform and the GWAS analysis conducted.
This would be such an easy and cheap win for those with dementia and any other trait/illness.
J11
Contributor
Contributor
Posts: 3351
Joined: Sat May 17, 2014 4:04 pm

Re: Genetic genealogy raw data and Alzheimer's research ...

Post by J11 »

There has been a large development in the search for the variant that I wanted to relate to the thread.

Up until recently I have been trying to search for the cause of the AD in my family with the help of 23andme, GEDmatch, and Geni.
This has not been an overly effective strategy as almost none of the DNA relatives that were found could be associated to our family tree. I had no idea who these relatives were nor how they were related to me. Most of them had a single DNA match of usually 10 cM or less. It had seemed a hopelessly difficult task. I was researching with one of the largest DNA databases in the world (23andme) yet why were the matches so tiny?

Recently, I logged into my MyHeritage account and what I found there was simply startling!
MyHeritage had at least 10 very high quality DNA relatives with 4 or more DNA matching segments each,
None of the 23andme matches had 4 or more segments.
Amazingly on MyHeritage there was even a match with 15 segments and another with 7 segments!

This is a very large step forward!

With all these nearish relatives it should be relatively straightforward to "bin" the chromosome strands: one for
maternal another paternal. This would then start to make the search at least plausible. Knowing which strands were paternal would immediately allow me to eliminate have the variants. I would then be able to eliminate another half when identifying the family branch known to be free of dementia.

It is all the more exciting because there has been a near panic to get on board the gene chip bandwagon lately. The growth of the DNA datasets is increasing by many millions every year. When only 1% of the population had been genotyped, the typical person would not have close matches in the datasets. Yet, as the mainstream population embraces genotyping, it should soon be expected that most people will have a very large number of near DNA relatives. I expect that there could be many thousands of very high quality matches that will be available to me.

This should dramatically change the search difficulty. With high quality matches that can be assigned a place on the family tree, I would be able to develop a comprehensive understanding of where the AD risk was located. Using high quality matches might allow me to eliminate possibly 3% of the genome with a single match! I might only need roughly 50 matches to have covered with high certainty the entire genome. After finding the region of interest it would be extremely easy to find large numbers of very distant matches who could help definitively confirm the AD variant.

This observation should have very considerable implications for AD genetic research. When AD genetic sleuthing is as easy as I expect it will be, then almost any AD family with strong dominant dementia should be able to identify the genetic source of their
illness. Having this information will be of great importance in developing an in depth understanding of dementia and developing effective treatments for it.

This is an extremely significant development!

About half of Alzheimer's is of unknown genetic causation.
Doubling the number of people who are motivated as those on this forum to do something to stop the epidemic will greatly accelerate the arrival of an effective treatment.


Funding agencies did not seem enthusiastic nor particularly motivated to spend the resources needed to reach the last mile.
Yet, considering how easy unlocking the Alzome should be even for novices, it should be anticipated that we will finally know what is causing the devastation in our families.

Without question knowing what the problem is, has to be the first step in solving the problem.
J11
Contributor
Contributor
Posts: 3351
Joined: Sat May 17, 2014 4:04 pm

Re: Genetic genealogy raw data and Alzheimer's research ...

Post by J11 »

I have left this thread drift for over a year, though I have continued to keep an eye on AD genetics research.

After all of these years I might have found it!

I checked into a possible gene last night (possibly today earlier morning): this might be the one.
Research has suggested a connection to Alzheimer's.
I ran it by Mutation Taster: check Disease Causing.

This mutation is RARE. It is in the range of <1 in 100,000.
I had not thought it would be this rare.
My hunch had been perhaps in the range of 1 in 10,000.

If this had been in the family for let's say the last 10-20 generations
there probably should be a large number of these alleles out there.
At a scale of 10 per million, this variant is essentially our own mutant.
It probably would never have showed up in a typical GWAS study.

I then went to the CADD site.
There is some confusion in my mind whether CADD should be:
computational prediction of deleteriousness (CADD) or
Combined Annotation Dependent Depletion (CADD).

I chose Combined Annotation Dependent Depletion (CADD)
(or it chose me) as it had an online web app. I believe that there
might be more of these SNP pathogenic tools online, though
I stayed with CADD.

{This tool would be extremely helpful for others
wanting to search for rare variants. CADD is a computational tool
that can search through largish vcf files.

This could allow the age of genetics to finally truly begin!
As readers of this thread are well aware, it has taken me years
to look through thousands of variants and I am still unclear
whether or not I have found the right one.

With a computer program, it might take only a few minutes
to rapidly sift through all the variants. Such technology
could dramatically reshape our society in a way that
has not previously been possible.}


I first ran it through the vcf file upload approach.
{Actually I first uploaded the entire file and then it reported
that the file was too large. I then only uploaded a handful of
variants (including the highly suspicious one).}

The vcf approach took some time; in the meantime I found a simple SNP lookup web app on the
site. The result for the web app showed the effect of three variants each having a single
single letter substitution at the site of our variant. All of these results reported a Phred score
over 20 which is considered deleterious. At this genetic location, do not mess with the DNA or else
(here or else implies some bad outcome).

On closer inspection I noticed that none of these were actually the variant of interest;
the variant in our file was a frameshift insertion. That does seems scary; especially given
the other results. Basically, the entire amino acid code was scrambled by this variant at
this location. It is not difficult to imagine that this would be a superbad variant to have.
The precompiled web app was unable to clarify this point further as it did not appear
to allow for insertions. The web app included 8 billion variants, though not the one
that I was interested in.


I then checked back with the file; it reported back. This time after
reading directly from the vcf file it was able to predict the harmfulness of our insertion;
this variant was indeed quite bad as measured by phred score.
Some researchers have used a cutoff score of >20 on phred score to define badness.
Our variant was way past 20.
If this phred score is measured using the log approach used in other phreds, then
our variant is way way past 20 ( it is logarithmically worse than 20).

This is super exciting!
I might have lucked onto this after all this time.

I will be more cautious than usual because as I have demonstrated repeatedly on this
thread, it is very easy to get it wrong. I have been hitting the wrong variants now for
years. My bayesian prior for success now is reasonably about 1 in 1,000.
Hope springs eternal.

Luckily, I have an extremely powerful approach to verifying my hunch. I can simply
contact relatives who share this stretch of DNA. For some reason researchers seem
to avoid this strategy, though if there are others who carry this variant and also have
a strong dementia history then that should provide very strong evidence that we have
finally found it. It will not require many confirmations to convince me.
J11
Contributor
Contributor
Posts: 3351
Joined: Sat May 17, 2014 4:04 pm

Re: Genetic genealogy raw data and Alzheimer's research ...

Post by J11 »

This is an overwhelmingly exciting time in my genetic search!

I returned to the CADD site and uploaded our family member's full exome sequence for analysis.
The variant of interest went almost right to the top of the list right out of the box!
Is it really that easy?

I have spent years and years of sleepless nights searching for this variant, toiling over endless stretches of DNA ---
and now here in the future all you need to do is press a button and the answer pops out right away?
Where is the fun or adventure in that?

Not so long ago the medical community would not even touch a question like this. From my understanding even an offer of money
would not be persuasive enough to induce interest in joining what was then thought of as a largely futile adventure. It must be a fairly difficult problem if money would not help. Yet now basically its upload DNA sequence, run the code, go for a coffee or a walk in the park and then you have the answer.

It was not exactly that easy. I did of course already know what the answer I was expecting so there is a large amount of memory bias.
In a file of over 60,000 genetic variants it really cannot possibly that easy yet (it wasn't), though it was close.

In the actual CADD run there were a large number of variants over the cutoff of 20 phred. This is the presumed boundary of pathology. I think it was ~3,000. How could we have 3,000 different diseases? What I find surprising is how extremely healthy our family has typically been. To narrow it down somewhat more there were ~ 300 variants over 30 phred. Good we're down to only having to worry about 300 possible diseases. At the top of the total list there were some improbably bad phred scores of ~60.
These variants, while they appear truly scary are probably simply too bad to be relevant. If they were as harmful as the model
suggested, then it probably would not be possible to be medically viable.

The variant of interest for us was actually quite a bit further down the list ~200 position. When I mentioned that it was at the top of the list I did not mean the total 60,000 list, but instead the list of frameshift variants. Most of the scariest variants were stop gained variants-- the entire protein would stop being made at a certain point. Those are quite scary-- in theory at least. However, the next set of scary variants were the frameshifts-- they tended to appear somewhat further down the total list. This is where we found our variant. So, it might not be quite as simple as pushing a button and going for a coffee, yet it is no longer difficult to imagine that this is now on the horizon. Many of the genes that were high up on the list did not appear overly plausible Alzheimer's candidates. One of the genes that was high up there was for skin formation. This would not probably be involved with dementing illness. So, if one could simply automatically start eliminating different genetic variants things begin to rapidly clearer.

Remove from the list variants from the list from genes not related to AD, remove variants that are of low sequencing quality, remove low frequency variants, remove variants that are known to be carried by others without noted problems ... . Before long the list begins to become greatly reduced.

Yet, the CADD software did not actually read in the full vcf file, it only read in chromosome, position, genotype. Therefore, it is supplying a generic output. This is a big problem for those who want to do some sleuthing. Other features are important for such
for example minor allele frequencies. In makes a great deal of difference knowing whether the carrier frequency of a variant is
1 in ten or 1 in 1 million. For families with unexplained dominant Alzheimer's 1 in 1 million is often the best place to look.

The CADD software does not appear to supply this information. They supply dozens of other measures, though arguably one of the most important for many searching for genetic answers is absent. I am still not entirely sure how to work around this problem. I have been able to retrieve an R code key from the NIH, though that is as far as I have gotten so far. Roughly I should be able to make a request throurgh R for my list of rs numbers and I should be set. This used to be very simple a few years ago. All you needed to do was go through dbsnp and batch submit your rs numbers. I did this for the entire 23andme gene chip results no problems. For some unknown reason it is now becoming more technically difficult to achieve this task.

This seems surprising as giving people the tools they need to solve their own problems would intuitively be the most sensible. Genetic problems cost us a great deal of money. If people want to save us a great deal of money by researching this, I say go right ahead. I am not going to stop people who are working hard to make me richer. It is somewhat surprising that this clear logic is not being allowed to fully be realized.

Not supplying the additional information in the CADD file is not a good use of the overall government hardware resources. It is the old silo problem. Doing what might appear to be best in one's own self interests (what is best for one's own silo) causes problems for others in their silos. Not inserting the line or two of code in the CADD result then pushes people like me to having to download the entire rs vcf for their exome (or possibly even genome). That represents a substantial computational burden that could easily have been avoided.

The exome file also reported other features that would need workarounds. For example, quality scores. Many of the most interesting variants that I have found in our exome files are variants with very low quality scores. Basically, these are usually just errors in sequencing. With 60 million base pairs in the exome it is amazing to see all the very scary diseases that you might have had if the exome file had actually reported your actual sequence and not sequencing artifacts. The problem is that the CADD software does not report back quality scores. It does not even consider quality scores if they were provided in the vcf. It can take at least some effort using the functions "index" and "match" in a spreadsheet program to workthrough this. Roughly, you want to merge the phred and other scores from the CADD results with your vcf file. For many this would be somewhat intense: this is no longer upload vcf, and take a walk in the park. It is becoming slightly trickier.

For those who want to avoid a serious migraine here is the rough solution for the Libre Office function code to retrieve columns of interest (e.g., SNP quality, frequency, rs numb etc.). The assumption here is that you have a spread sheet from the CADD output
with 5000 variants that you have ordered by phred score. The phred column is the last column perhaps AZ. Go their and use the
Z to A function (to the top right of the Libre Office Calc window) to reverse order the list that is highest phred score first). Copy and paste the first 5,000 variants to another spreadsheet and then run the code below. This is assuming that A1 is the cell that has the first variant, there might be filler text first so just adjust the A1 below to the cell with the first SNP entry.

Also check to see exactly which column is the last and replace this column with the "AZ" used below as filler. Open up your vcf file in Libre Office Calc and then copy the entire column list of position variant numbers and paste this column starting at row 5101 in the A5101 cell of the CADD spreadsheet. Good to give yourself a little extra room instead of starting right at row 5001. These variant positions are all you need for now. All the merged info can then be pasted back into your vcf exome file. Mainly what you are interested in is the phred score from CADD. There are however other deleterious measures that you might also be interested in extracting.

The A1 column might also need to be adjusted. What you are trying to do is find the column in the CADD file that has the position and then find the corresponding column in your vcf file. So the A1 (position column in CADD) needs to line up with the A5101 (position column) in the index-match command. The $ signs below lock the values to their columns are rows. If you do not insert these $ signs correctly then as you drag down through the matrices to fill in the results the columns and rows start drifting around and you do not receive the correct answer. The 0 below tells the software that you want an exact match. If you do not include the 0 then the software might mistake 8115 with 83. You need the exact position match: type in a o.

One other trick that I did not mention was the fill in the matrix trick. If you just drag down the cell box on the bottom right then this can take a while (perhaps an hour). That is not a walk in the park either. People nowadays don't want to sweat it. How can you avoid this? LO has a fill in the cells feature that you can access from one of the top menus. Click the correct drop down and choose the correct option and you're set. It does though put your computer are a great deal of strain. The computer is then trying to search ~5,000 variants against 60,000 variants. 300 million comparisons no problems? Not really, today's computer technology is great, though it still finds 300 million comparisons to require at least some effort. Might take about half an hour. Though after you tell the computer to fill in the cells you can take your coffee break. Without the trick, you would need to manually hold down the cell drag down box for about an hour.

Now, find out which column you want to retrieve from the CADD sphreadsheet. This is somewhat tricky in Libre Office. You need to count the number of columns from the start of the data array (that counting from left to right). Below the data array is defined as $A$1:$AZ$5000. So here Column D would be column 4. Enter 4 as exit at the end of the code below for X. It would have been a great deal simpler if they simply allowed you to enter the column designator D.

index($A$1:$AZ$5000,0, match($A5101, $A$1:$A$5000,0),X)

When you have extracted the information you want you can then just paste it back into your vcf file and it should all be properly
lined up. Of course it would be best to double check that everything got back into its proper place.

One problem that I will leave as an exercise for those interested is how to make sure that positions are not confused in the search.
For example, a variant at position 1234 on chromosome 4 would be confused with a variant at 1234 on chromosome 5.
Libre Office does have if conditionals though I am not entirely sure how they would apply in this particular circumstance.
Basically, you want the variant in the CADD file at position 1234 and on chromosome to match your variant in your vcf file at position 1234 and chromosome 4. I am not entirely sure how to solve this with LO; problems that are easy to explain are not necessarily easy to programmatically code. That leaves the somewhat unappealing solution of toiling away late at night for the next several years and solving this by hand. To avoid such a fate, one easy solution is to simply order the position column in the CADD file
(might also want to do this for the vcf file as well) and then perform pairwise subtraction down the entire column. You could use an if alternative type command in LO with the alternative being let us say 1000 and 0 otherwise. Sum up this column. If it is greater than 0, then you have position matches. The task has been reduced to scrolling through the rows. There might even be some way of now asking LO to provide the location of such non-zero entries.

One final comment is that the frameshift variant that I am interested in now had a particularly interesting feature. It does not appear to have reached dbsnp yet. It is not even in the global genome database! Instead of an rs number in the vcf file it had a period.
Along with a good quality readout and a high phred score and a low allele frequency that clearly something that is something worth further investigation. We have still not yet reached the time when all human genetic variation has been captured by sequencing.

This took at least some effort to work out so I hope it is helpful for others. It will allow you to effortlessly extract information from even large spreadsheets. We should all feel privileged to live in this time when one truly can simply press a button and enormous
computational horsepower is at our disposal. When I asked my computer to search through thousands of variants even it nearly buckled under the demand. If I were required to do this by hand, it would probably have taken taken years of toil. With the computer the job was done in about half an hour.


What does all of the above actually mean?
What's the bottom line here?

Apparently we have now entered the era of genetics.
Our lives are about to undergo truly profound change.
Those with serious genetic disorders will easily be able to select against
variants of large effect. Until now this has mostly been beyond the ability of our technology.

Genetics up till now has for the most part not been able to used as an effective tool to improve our lives.
Now it can be.

Software can effortlessly find variants that matter.
Finding a 1 in a million variant with today's software should be super easy.
It would be much easier than finding a 1 in 10 variant with minimal effect.

I am already receiving feedback from relatives that suggest that I might have found the causative variant.
Post Reply