A golden future awaits Switzerland with the upcoming Swiss Personalized Health Network (SPHN) initiative aiming at harmonizing data formats and IT systems to allow for seamless data exchange between ETH, universities and university hospitals. Starting with funding infrastructure projects, the initiative will subsequently support research projects using the implemented standards. For an executive summary I refer to the «Step#1» newsletter of the SAMW.
One of the major caveats in bringing cutting-edge biomedical research to Swiss clinics is their often outdated IT infrastructure - both sides - software and hardware. It remains to be seen whether the SPHN will have a favorable effect on this predicament - while at the same time not consuming all the national funding for filling these gaps.
Recently, the University Hospital Zurich (USZ) has invested in an integrated data storage to house - eventually - all data or at least connect the gazillion of othe IT systems each clinic seems to have. This is not limited to genomic data or patient anamneses - we are speaking about xrays, histopathological slides, meta data from biobanks (where there are also some several dozens in USZ), test results, etc. A cunning sales person convinced USZ to go for the Oracle Translational Research Center (TRC) platform which is now operational and seeking for projects to be filled with. I am not going to comment on the choice of provider nor the system as such, nevertheless it is important to acknowledge the efforts to have an integrative solution aiming at making patient data holistically available.
As a preparational step towards SPHN ETH has launched together with University Zurich (UZH) and USZ a joint syndicate to concentrate the efforts in Zurich to formulate strong proposals. At a later stage together with Basel, which is already evaluating projects. The Personalized Health Alliance Zurich was initiated September 20 with an info lunch at the ETH main building (not the Sternwarte) and is requesting project ideas by the end of October. The importance of this alliance cannot be too much highlighted; only by uniting researchers and clinicians, it will be possible to get strong proposals with a true chance of translation and ultimately benefit for health care. Furthermore, this top-down approach, initiated by members of executive boards of ETH, UZH and USZ, has the chance to scale projects in an appropriate way - scale which generally needed for genomics and especially for personalized cancer care. However, there will be need for a bottom-up process, initiated by these researchers and clinicians, to get ground-breaking - yet feasible - project ideas. Let's not talk around the bush, many projects/intitiatives have been launched in the last two years, but Zurich as a research location is fragmented and there is risk for reinvention, redundancy and unfruitful lechery. A blend of top-down and bottom-up will enable Zurich and Basel to produce powerful and cutting-edge proposals for personalized health research in the coming years.
In the light of these developments it is basically unconceivable that rumors about a deal between Foundation Medicine and USZ regarding molecular cancer diagnostics are spreading. Amongst other preposterous things it seems that FM has defined a «sample in, report out» process with no access to the data. Not only is this diametrically opposed to SPHN and the Oracle TRC, but also this is hard to link with that first letter in USZ - university.
What do the British say; keep calm and carry on... or...
Donnerstag, 6. Oktober 2016
Freitag, 11. März 2016
Tread carefully on the path from tumor mutations to therapy suggestions
Somatic mutation calling on tumor biopsies coming from cancer patients finds an increased foot hold in clinics all over Switzerland. The results of these tests aiming at rationalizing the use of certain cancer drugs, e.g., using Vemurafenib in case of a V600E mutation in melanoma, or in some cases indicating a lack of benefit for a drug, e.g., an NRAS mutation is predictive for the resistance of colon cancer against Cetuximab. While these examples are well studied and thus the causal link between the mutations and the effect of their respective drug is identified, the process of inferring the use of a certain drug with respect to a given mutation inside the genomic landscape of a tumor remains very complicated.
Identifying the presence of a mutation in the genomic landscape of a tumor is also a complicated task, but in contrast to finding the right drug for a given genomic variant, it can be done using algorithms enabling involved technical personnel to focus their time on quality control. When a list of mutations has been put together, an interdiscilinary team of bioinformaticians, pharmacogeneticists, molecular biologists and oncologists has all their work still ahead of them. They will spend many hours on finding useful drug-gene interactions, assessing the scientific literature for their level of evidence, classifying drugs in potentially beneficial or ineffective, and finally also cross-linking study populations with the clinical data presented by the individual patient. Only then it is possible to make an informed suggestion based on molecular evidence.
Or isn't it...
A lot of the time of this team will be spent on manually searching specific mutations in databases for drug-gene interactions, typing names of mutated genes and drugs in PubMed search boxes, looking up clinical trial opportunities and reading irrelevant scientific publications from a huge pool of cancer literature.
However, the information (not data) required to work this process is all out there. Neatly packed in databases and XML dumps, accessible using command line queries in an automated fashion. For instance the Drug Gene Interaction database combines information from 15 other databases on which drugs might target a certain gene and the functional requirement a mutation has to carry, for the drug to work. Repositories of clinical trials can be searched in the same way, e.g. NIH's clinicaltrials.gov. Abstracts and ultimately whole papers can be indexed using text miners quantifying their relevance according mutational and pharmacological information - like Google orders your websites according to relevance.
Do not get me wrong! This will not replace the expertise of the aforementioned team of experts, making a final decision on therapy suggestions. But it will make their work much more efficient; replacing sequential tasks like querying a database with an algorithm, selecting a set of relevant papers using text mining instead of clicking and skimming a text for some buzz words, and so on.
This will ultimately lead to better informed and faster decisions taking into account all available sources.
Identifying the presence of a mutation in the genomic landscape of a tumor is also a complicated task, but in contrast to finding the right drug for a given genomic variant, it can be done using algorithms enabling involved technical personnel to focus their time on quality control. When a list of mutations has been put together, an interdiscilinary team of bioinformaticians, pharmacogeneticists, molecular biologists and oncologists has all their work still ahead of them. They will spend many hours on finding useful drug-gene interactions, assessing the scientific literature for their level of evidence, classifying drugs in potentially beneficial or ineffective, and finally also cross-linking study populations with the clinical data presented by the individual patient. Only then it is possible to make an informed suggestion based on molecular evidence.
Or isn't it...
A lot of the time of this team will be spent on manually searching specific mutations in databases for drug-gene interactions, typing names of mutated genes and drugs in PubMed search boxes, looking up clinical trial opportunities and reading irrelevant scientific publications from a huge pool of cancer literature.
However, the information (not data) required to work this process is all out there. Neatly packed in databases and XML dumps, accessible using command line queries in an automated fashion. For instance the Drug Gene Interaction database combines information from 15 other databases on which drugs might target a certain gene and the functional requirement a mutation has to carry, for the drug to work. Repositories of clinical trials can be searched in the same way, e.g. NIH's clinicaltrials.gov. Abstracts and ultimately whole papers can be indexed using text miners quantifying their relevance according mutational and pharmacological information - like Google orders your websites according to relevance.
Do not get me wrong! This will not replace the expertise of the aforementioned team of experts, making a final decision on therapy suggestions. But it will make their work much more efficient; replacing sequential tasks like querying a database with an algorithm, selecting a set of relevant papers using text mining instead of clicking and skimming a text for some buzz words, and so on.
This will ultimately lead to better informed and faster decisions taking into account all available sources.
Freitag, 27. November 2015
The Four Horsemen of Statistics
When in the Book of Revelations the Lamb of God opens the
first four of seven seals on the scroll in God’s right hand four figures emerge
on white, red, black and pale horses. The riders represent conquest, war,
famine, and death. They are well known through their group name; the Four
Horsemen of Apocalypse.
In our technological and data-driven world virtually
everyone needs statistics in one form or another. However, most people are not
initiated into the subtleties and dangers of using statistics on real data. The
little bit of learning in your undergrad class is in no way adequate. This is
why it is extremely important to have a statistician in your enterprise or to be
able to fall back on a statistical consultant. But I digress…
In this series of four posts I want to introduce a new
concept; the Four Horsemen of Statistics. Four concepts/situations where great
danger for the uninitiated lies ahead. To keep the audience captivated I will
not disclose the list right now, but start with the first Horseman.
The complications of multiple testing probably ruins the
credibility of more publications than any other statistical concept. This is
most beautifully illustrated by the essay of John P.A. Ioannidis from 2005 with
the intriguing title: «Why most published research findings are false».
Ioannidis, in examining the causes of errors in research findings considers
multiple testing as a major factor.
Though not fully intuitive multiple testing can be explained
in few words. Whenever you perform a statistical test, you allow for a certain
amount of error. Performing additional tests dramatically accumulates this
«allowed» error. Unfortunately, this «allowed» error is necessary for the logic
of testing; without it there would never be a decision. This error has many
names, like significance level, type I error, etc. and is often indicated by
the Greek letter α.
To give an
indication of the magnitude of this situation: assume that you choose a
significance level of α=5%, in which case a single test has the chance of a
false positive, i.e., finding something, when there is nothing, is 5%. Performing
a single additional test at the same level of significance will increase the
probability for at least one false positive to 9.75%. When we perform 13 tests,
the chance of having at least one wrong test result is an overwhelming 50%.
Fortunately, there
is a solution available in dealing with this issue.
It is quite unspectacularly called multiple testing correction. The idea is to adjust the level of significance α, so that the effect of the multiple tests on the probability of making false positive decisions is eliminated. In our above example this would mean that for the two tests performed we do not use 5% as a significance level, but instead divide it by the number of tests, i.e., two, yielding a new α of 2.5%. When we now compute the chance of observing a false positive, we get a mere 4.9% as intended.
It is quite unspectacularly called multiple testing correction. The idea is to adjust the level of significance α, so that the effect of the multiple tests on the probability of making false positive decisions is eliminated. In our above example this would mean that for the two tests performed we do not use 5% as a significance level, but instead divide it by the number of tests, i.e., two, yielding a new α of 2.5%. When we now compute the chance of observing a false positive, we get a mere 4.9% as intended.
There is, however,
an ultimate danger that when having performed a couple of thousand tests,
The uncomfortable
fact remains that having performed a couple of thousand tests, the level of
significance becomes infinitesimal, leaving researchers desperately
trying to find something to publish, with nothing significant at all.
Soon I will present
the second Horseman of Statistics, right here…
Mittwoch, 4. November 2015
Bar plots are nonsense
I was once again utterly stupefied by the amount of bar plots I had to endure during the CC-PM retreat in the beautiful Kartause Ittingen last weekend. Like this one...
What is bad about these plots, you ask? Well, put simply, they couldn't be more misleading. There are several issues with this nonsensical way to represent different samples of measurements, like for instance the amount of 14C-Clucose per well.
Finally, a difference between two groups of measurements is qualified using a statistical test, for instance a t-test, if your data is really nice, or a Wilcoxon rank-sum test, if your data is kind of naughty. However, it is - at least from a marketing perspective - useful to find a way to visualize your results in some way.
Now, there are two things you might want to show when illustrating a group of observations:
Anyhow, adding either a confidence interval or a standard error to a mean value has no descriptive power for the distribution of the data - or variability, or spread!
On the left we have a selection of six groups each with 20 observations. Clearly, these groups are not the same when we look at the scatter plot. However, when using a bar plot it seems that everything is the same in these groups. Even the standard error bars indicate no difference. Probably, we messed up the experiment or something.
If instead, we use the much more useful box plot, we immediately identify different groups. Even more forensic is the use of violin plots, which show the mirrored probability density of the data and as such allow for the identification of bi- or multimodal distribution of the data.
Try it yourself on https://stekhoven.shinyapps.io/barplotNonsense
Not only, this is a great way to indicate the difference between multiple groups - using a combination of violin and boxplots (well, you have to pay attention!) - but also the author uses Hadley's ggplot2 ... so maybe the is still a bright future ahead! I am convinced of it!
| from Beaton et al., 2015, Mol Met, dx.doi.org/10.1016/j.molmet.2015.08.003 |
OK, to be honest that one is not from last weekend - but there were many like it. If you don't believe me, take a stop watch and check how long it takes you on Google Scholar to find one of these useless bar plot - usually less than a minute inside the life sciences.
What is bad about these plots, you ask? Well, put simply, they couldn't be more misleading. There are several issues with this nonsensical way to represent different samples of measurements, like for instance the amount of 14C-Clucose per well.
1. Spread/variation of data versus precision of estimation
The general goal of your average PhD student at a scientific conference, retreat or whatever these events might be called nowadays, is to show that a group of measurements she has done on a control is less (or more) than a group of measurements she has done on a sample, which was in some way disturbed from being a control - usually coined treatment.Finally, a difference between two groups of measurements is qualified using a statistical test, for instance a t-test, if your data is really nice, or a Wilcoxon rank-sum test, if your data is kind of naughty. However, it is - at least from a marketing perspective - useful to find a way to visualize your results in some way.
Now, there are two things you might want to show when illustrating a group of observations:
- The spread/variability of the group.
- How good you were in estimating some kind of summary of a group, i.e., the mean value.
Anyhow, adding either a confidence interval or a standard error to a mean value has no descriptive power for the distribution of the data - or variability, or spread!
2. Bar plots cannot show you differences
Let's look at the following example:On the left we have a selection of six groups each with 20 observations. Clearly, these groups are not the same when we look at the scatter plot. However, when using a bar plot it seems that everything is the same in these groups. Even the standard error bars indicate no difference. Probably, we messed up the experiment or something.
If instead, we use the much more useful box plot, we immediately identify different groups. Even more forensic is the use of violin plots, which show the mirrored probability density of the data and as such allow for the identification of bi- or multimodal distribution of the data.
Try it yourself on https://stekhoven.shinyapps.io/barplotNonsense
3. There might be a bright future
I actually have to be honest to you once more, the first chart I found, wanting to show the distribution of multiple groups of measurements was this one:| from Sonay et al., 2015, Genome Res, doi/10.1101/gr.190868.115 |
Montag, 26. Oktober 2015
No clouds in the clinical sky
It was amazing to learn that a well-known hospital in Zurich
has decided that they will never use any form of cloud infrastructure to
compute bioinformatics or biostatistics computations requiring
more-than-desktop-scale computing power. Reason: they are afraid to appear in
the daily newspaper due to data leaking.
While the reason may or may not be a sound one, it is still
very questionable how they arrived at that conclusion. Certainly not by asking
a professional data security engineer or someone who has experience with using
cloud infrastructures.
Most irritating about the decision is that it would also
include a newly established cloud infrastructure by their own university.
Moreover, it is always said to be for reasons of security. I wonder whether it is in terms of the security of a patient that his data is not used for a study relating to his rare disease or - and this is the point here - a novel method of analysing this patient's molecular preposition cannot be run in time, because the place is lacking computational power.
Observing the ever increasing importance of computational power for genomics-based medicine it is only a question of time until the mentioned hospital will fall behind its fellow competitors who have less prejudice towards (new) technologies, such as an SSH tunnel, virtual machines or cloud computing.
Edit a couple of hours later:
Here's an example for a less prejudiced approach towards modern personalized medicine research - https://www.systemsbiology.org/research/cancer-genomics-cloud/
Freitag, 13. März 2015
Own your data ... all the time
Who owns that X-ray of your shoulder made last year? If it exists as a real print-out, you probably have it at home – because they gave it to you with the comment that it belongs to you anyway. Now, how about the digital version? Is your M.D. up-to-date and uses an X-ray machine that directly produces a dicom? Was this given to you, perhaps in the form of a DVD? Did you also receive the report the doctor made when assessing your shoulder clinically using also the X-ray? What are the chances that you will have this information easily available next time you are in a different hospital and need treatment to the same shoulder?
Your medical data belong to you. You pay for its generation and you are the source of it. Your data as a whole are very valuable. However the vast majority of lands in storage, stowed away, difficult to access for the individual and not available for research and the development of better diagnostic and therapeutic procedures. By sharing our health data (see my last post share or perish) medical science would be able to gain more insight and thus improve prevention, treatment, and healing. Since it is your data in the first place, why not be rewarded for sharing your property with researchers, the government or a corporation?
What we need are individual electronic medical accounts for each person – not unlike a bank account. You are (probably) the only person who can access your money and send it to somewhere or someone else. If you arrive at a different hospital and the M.D. needs the X-ray and the notes from your private physician or investigator then you grant her access to your medical account – generally or specifically for the data she needs.
But how can we ensure that our data is safe and will only be used in circumstances of which we approve? The answer is simple and very Swiss. The entity governing your data has to be a cooperative – owned by its users –, located in Switzerland, with its stringent privacy regulations and its globally trusted way of governance.
The first ever health platform of this type has been established recently and is called healthbank. It is cooperative and strives for the goals sketched above.
Does it need to be global? Yes, because to get enough data for meaningful research as many people as possible need to participate. Read in the next post to learn why we need such large numbers!
Your medical data belong to you. You pay for its generation and you are the source of it. Your data as a whole are very valuable. However the vast majority of lands in storage, stowed away, difficult to access for the individual and not available for research and the development of better diagnostic and therapeutic procedures. By sharing our health data (see my last post share or perish) medical science would be able to gain more insight and thus improve prevention, treatment, and healing. Since it is your data in the first place, why not be rewarded for sharing your property with researchers, the government or a corporation?
What we need are individual electronic medical accounts for each person – not unlike a bank account. You are (probably) the only person who can access your money and send it to somewhere or someone else. If you arrive at a different hospital and the M.D. needs the X-ray and the notes from your private physician or investigator then you grant her access to your medical account – generally or specifically for the data she needs.
But how can we ensure that our data is safe and will only be used in circumstances of which we approve? The answer is simple and very Swiss. The entity governing your data has to be a cooperative – owned by its users –, located in Switzerland, with its stringent privacy regulations and its globally trusted way of governance.
The first ever health platform of this type has been established recently and is called healthbank. It is cooperative and strives for the goals sketched above.
Does it need to be global? Yes, because to get enough data for meaningful research as many people as possible need to participate. Read in the next post to learn why we need such large numbers!
Dienstag, 24. Februar 2015
Share or Perish
I first heard the phrase «publish or perish» from my Dad, when I started my PhD. Its projection onto the problem of high-dimensional data analysis in the context of human genomics using the word «share» came very intuitively ... agreed, after some previous thinking about the problem not related to writing this post.
The fact that in order to gain understanding of relationships within genomic data a large quantity of data is required, is getting increasingly popular. Nevertheless, most people (or institutions) draw the wrong consequence from this; they produce more and more data, since the technological advance enables them to, and store it in isolated silos. Often they do so while wasting precious time on collecting enough data instead of analyzing the proper amount.
The correct consequence would be to realize that public interest is served best if researchers around the globe would coordinate their efforts to make genomic data available to one and another. Moreover, these data need to be interoperable and thus adhering to certain standards such that the overhead of using the data can be held as small as possible.
Coalitions like the Global Alliance for Genomics and Health or national-funded resources like ClinGen are already sparking this idea and even going several steps further by suggesting standards also for bioinformatics and biostatistics methods.
The ongoing reality of centralized and isolated databases shooting out of the ground like mushrooms never reaching a critical amount on their own is a waste of money! Everyone needs to open up their mind towards the solution of sharing data to reach insight and discover patterns obscured by the traditional way of data hoarding.
The fact that in order to gain understanding of relationships within genomic data a large quantity of data is required, is getting increasingly popular. Nevertheless, most people (or institutions) draw the wrong consequence from this; they produce more and more data, since the technological advance enables them to, and store it in isolated silos. Often they do so while wasting precious time on collecting enough data instead of analyzing the proper amount.
The correct consequence would be to realize that public interest is served best if researchers around the globe would coordinate their efforts to make genomic data available to one and another. Moreover, these data need to be interoperable and thus adhering to certain standards such that the overhead of using the data can be held as small as possible.
Coalitions like the Global Alliance for Genomics and Health or national-funded resources like ClinGen are already sparking this idea and even going several steps further by suggesting standards also for bioinformatics and biostatistics methods.
The ongoing reality of centralized and isolated databases shooting out of the ground like mushrooms never reaching a critical amount on their own is a waste of money! Everyone needs to open up their mind towards the solution of sharing data to reach insight and discover patterns obscured by the traditional way of data hoarding.
Abonnieren
Kommentare (Atom)
