In March, 2016, Morgan Kaufmann, an Elsevier imprint, published my book Data Simplification: Taming Information With Open Source Tools. Those of you who are computer-oriented know that data analysis typically takes much less time and effort than data preparation. Moreover, if you make a mistake in your data analysis, you can often just repeat the process, using different tools, or a fresh approach to your original question. As long as the data is prepared properly, you and your colleagues can re-analyze your data to your heart's content. Contrariwise, if your data is not prepared in a manner that supports sensible analysis, there's little you can do to extricate yourself from the situation. For this reason, data preparation is, in my experience, much more important than data analysis.
Throughout my career, I've relied on simple open source utilities and short scripts to simplify my data, producing products that were self-explanatory, permanent, and that could be merged with other types of data. Hence, my book.
Data Simplification: Taming Information With Open Source Tools
Publisher: Morgan Kaufmann; 1 edition (March 23, 2016)
Paperback: 398 pages
Dimensions: 7.5 x 9.2 inches
Chapter 1, The Simple Life, explores the thesis that complexity is the rate-limiting factor in human development. The greatest advances in human civilization and the most dramatic evolutionary improvements in all living organisms have followed the acquisition of methods that reduce or eliminate complexity.
Chapter 2, Structuring Text, reminds us that most of the data on the Web today is unstructured text, produced by individuals, trying their best to communicate with one another. Data simplification often begins with textual data. This chapter provides readers with tools and strategies for imposing some basic structure on free-text.
Chapter 3, Indexing Text, describes the often undervalued benefits of indexes. An index, aided by proper annotation of data, permits us to understand data in ways that were not anticipated when the original content was collected. With the use of computers, multiple indexes designed for differentpurposes, can be created for a single document or data set. As data accrues, indexes can be updated. When data sets are combined, their respective indexes can be merged. A good way of thinking about indexes is that the document contains all of the complexity; the index contains all of the simplicity. Data scientists who understand how to create and use indexes will be in the best position to search, retrieve, and analyze textual data. Methods are provided for automatically creating customized indexes designed for specific analytic pursuits and for binding index terms to standard nomenclatures.
Chapter 4, Understanding Your Data, describes how data can be quickly assessed, prior to formal quantitative analysis, to develop some insight into what the data means. A few simple visualization tricks and simple statistical descriptors can greatly enhance a data scientistís understanding of complex and large data sets. Various types of data objects, such as text files, images, and time-series data, can be profiled with a summary signature that captures the key features that contribute to the behavior and content of the data object. Such profiles can be used to find relationships among different data objects, or to determine when data objects are not closely related to one another.
Chapter 5, Identifying and Deidentifying Data, tackles one of the most under-appreciated and least understood issues in data science. Measurements, annotations, properties, and classes of information have no informational meaning unless they are attached to an identifier that distinguishes one data object from all other data objects, and that links together all of the information that has been or will be associated with the identified data object. The method of identification and the selection of objects and classes to be identified relates fundamentally to the organizational model of complex data. If the simplifying step of data identification is ignored or implemented improperly, data cannot be shared, and conclusions drawn from the data cannot be believed. All well-designed information systems are, at their heart, identification systems: ways of naming data objects so that they can be retrieved. Only well-identified data can be usefully deidentified. This chapter discusses methods for identifying data and deidentifying data.
Chapter 6, Giving Meaning to Data, explores the meaning of meaning, as it applies to computer science. We shall learn that data, by itself, has no meaning. It is the job of the data scientist to assign meaning to data, and this is done with data objects, triples, and classifications (see Glossary items, Data object, Triple, Classification, Ontology). Unfortunately, coursework in the information sciences often omits discussion of the critical issue of "data meaning"; advancing from data collection to data analysis without stopping to design data objects whose relationships to other data objects are defined and discoverable. In this chapter, readers will learn how to prepare and classify meaningful data.
Chapter 7, Object-Oriented Data, shows how we can understand data, using a few elegant computational principles. Modern programming languages, particularly object-oriented programming languages, use introspective data (ie, the data with which data objects describe themselves) to modify the execution of a program at run-time; an elegant process known as reflection. Using introspection and reflection, programs can integrate data objects with related data objects. The implementations of introspection, reflection and integration, are among the most important achievements in the field of computer science.
Chapter 8, Problem Simplification, demonstrates that it is just as important to simplify problems as it is to simplify data. This final chapter provides simple but powerful methods for analyzing data, without resorting to advanced mathematical techniques. The use of random number generators to simulate the behavior of systems, and the application of Monte Carlo, resampling, and permutative methods to a wide variety of common problems in data analysis, will be discussed. The importance of data reanalysis, following preliminary analysis, is emphasized.
TABLE OF CONTENTS Chapter 0. Preface References for Preface Glossary for Preface Chapter 1. The Simple Life Section 1.1. Simplification drives scientific progress Section 1.2. The human mind is a simplifying machine Section 1.3. Simplification in Nature Section 1.4. The Complexity Barrier Section 1.5. Getting ready Open Source Tools for Chapter 1 Perl Python Ruby Text Editors OpenOffice Command line utilities Cygwin, Linux emulation for Windows DOS batch scripts Linux bash scripts Interactive line interpreters Package installers System calls References for Chapter 1 Glossary for Chapter 1 Chapter 2. Structuring Text Section 2.1. The Meaninglessness of free text Section 2.2. Sorting text, the impossible dream Section 2.3. Sentence Parsing Section 2.4. Abbreviations Section 2.5. Annotation and the simple science of metadata Section 2.6. Specifications Good, Standards Bad Open Source Tools for Chapter 2 ASCII Regular expressions Format commands Converting non-printable files to plain-text Dublin Core References for Chapter 2 Glossary for Chapter 2 Chapter 3. Indexing Text Section 3.1. How Data Scientists Use Indexes Section 3.2. Concordances and Indexed Lists Section 3.3. Term Extraction and Simple Indexes Section 3.4. Autoencoding and Indexing with Nomenclatures Section 3.5. Computational Operations on Indexes Open Source Tools for Chapter 3 Word lists Doublet lists Ngram lists References for Chapter 3 Glossary for Chapter 3 Chapter 4. Understanding Your Data Section 4.1. Ranges and Outliers Section 4.2. Simple Statistical Descriptors Section 4.3. Retrieving Image Information Section 4.4. Data Profiling Section 4.5. Reducing data Open Source Tools for Chapter 4 Gnuplot MatPlotLib R, for statistical programming Numpy Scipy ImageMagick Displaying equations in LaTex Normalized compression distance Pearson's correlation The ridiculously simple dot product References for Chapter 4 Glossary for Chapter 4 Chapter 5. Identifying and Deidentifying Data Section 5.1. Unique Identifiers Section 5.2. Poor Identifiers, Horrific Consequences Section 5.3. Deidentifiers and Reidentifiers Section 5.4. Data Scrubbing Section 5.5. Data Encryption and Authentication Section 5.6. Timestamps, Signatures, and Event Identifiers Open Source Tools for Chapter 5 Pseudorandom number generators UUID Encryption and decryption with OpenSSL One-way hash implementations Steganography References for Chapter 5 Glossary for Chapter 5 Chapter 6. Giving Meaning to Data Section 6.1. Meaning and Triples Section 6.2. Driving Down Complexity with Classifications Section 6.3. Driving Up Complexity with Ontologies Section 6.4. The unreasonable effectiveness of classifications Section 6.5. Properties that Cross Multiple Classes Open Source Tools for Chapter 6 Syntax for triples RDF Schema RDF parsers Visualizing class relationships References for Chapter 6 Glossary for Chapter 6 Chapter 7. Object-oriented data Section 7.1. The Importance of Self-explaining Data Section 7.2. Introspection and Reflection Section 7.3. Object-Oriented Data Objects Section 7.4. Working with Object-Oriented Data Open Source Tools for Chapter 7 Persistent data SQLite databases References for Chapter 7 Glossary for Chapter 7 Chapter 8. Problem simplification Section 8.1. Random numbers Section 8.2. Monte Carlo Simulations Section 8.3. Resampling and Permutating Section 8.4. Verification, Validation, and Reanalysis Section 8.5. Data Permanence and Data Immutability Open Source Tools for Chapter 8 Burrows Wheeler transform Winnowing and chaffing References for Chapter 8 Glossary for Chapter 8
In March, 2015, Morgan Kaufmann (an Elsevier imprint), published Repurposing Legacy Data: Innovative Case Studies (Computer Science Reviews and Trends).
Re-Purposing Legacy Data: Innovative Case Studies takes a look at how data scientists repurpose legacy data; whether their own, or legacy data that has been donated to the public domain. The case studies in this book, taken from such diverse fields as cosmology, quantum physics, high-energy physics, microbiology, and medicine all serve to demonstrate the value of old data.Key Features:
Data repurposing involves taking preexisting data and performing any of the following:
Most of the listed types of data repurposing efforts are selfexplanatory and all of them will be followed by examples throughout this book.
Table of Contents: Re-Purposing Legacy Data: Innovative Case Studies
Chapter 1. Introduction Section 1.1. Why bother? Section 1.2. What is data repurposing? Section 1.3. Data worth preserving Section 1.4. Basic data repurposing tools Section 1.5. Personal attributes of data repurposers Chapter 2. Learning from the masters Section 2.1. New physics from old data Case Study: Sky charts Case Study: From Hydrogen spectrum data to quantum mechanics Section 2.2. Repurposing the physical and abstract property of uniqueness Case Study: Fingerprints; from personal identifier to data-driven forensics Section 2.3. Repurposing a 2000 year old classification Case Study: A dolphin is not a fish Case Study: The molecular stopwatch Section 2.4. Decoding the past Case Study: Mayan glyphs: finding a lost civilization Section 2.5. What makes data useful for repurposing projects? Case Study: CODIS Case Study: Zip codes: From postal code to demographic keystone Case Study: The classification of living organisms Chapter 3. Dealing with text Section 3.1. Thus it is written Case Study: New associations in old medical literature Case Study: Ngram analyses of large bodies of text Section 3.2. Search and retrieval Case Study: Sentence extraction Case Study: Term extraction Section 3.3. Indexing text Case Study: Creating an index Case Study: Ranking search results (10) Section 3.4. Coding text Chapter 4. New Life for Old data Section 4.1. New algorithms Case Study: Lunar Orbiter Image Recovery Project Case Study: Finding new planets from old data Case Study: New but ultimately wrong: market prediction Case Study: Choosing better metrics Section 4.2. Taking closer looks Case Study: Expertise is no substitute for data analysis Case Study: Life on mars Case Study: The number of human chromosomes Case Study: Linking global warming to high-intensity hurricanes Case Study: Inferring climate trends with geologic data Case Study: Old tidal data, and the iceberg that sank the Titanic (4) Chapter 5. The purpose of data analysis is to enable data reanalysis Section 5.1. Every initial data analysis on complex data sets is flawed Case Study: Sample mix-ups, a pre-analytic error Case Study: Vanishing exoplanets: an analytic error Case Study: The new life form that wasn't: a post-analytic error Section 5.2. Unrepeatability of complex analyses Case Study: Reality is more complex than we can ever know Case Study: First biomarker publications are seldom reproducible Section 5.3. Obligation to verify and validate Case Study: Reanalysis clarifies earlier findings Case Study: It is never too late to review old data Case Study: Vindication through reanalysis Case Study: Reanalysis bias Section 5.4. Asking what the data really means Case Study: Repurposing the logarithmic tables Case Study: Multimodality in legacy data Case Study: The significance of narrow data ranges. Chapter 6. Dark Legacy: Making sense of someone else's data Section 6.1. Excavating treasures from lost and abandoned data mines Case Study: Re-analysis of old JADE collider data Case Study: 36 year old satellite resurrected Section 6.2. Nonstandard standards Case Study: Standardizing the chocolate teapot (7) Section 6.3. Specifications, not standards Case Study: The fungibility of data standards Section 6.4. Classifications and Ontologies Case Study: An upper level ontology Case Study: Population sampling by class Section 6.5. Identity and uniqueness Case Study: Faulty identifiers Case Study: Timestamping data Section 6.6. When to terminate (or reconsider) a data repurposing project Case Study: Nonsensical Mayan glyphs. Case Study: Flattened data Chapter 7. Social and Economic Issues Section 7.1. Data sharing and reproducible research Case Study: Sharing public-funded research data with the public Section 7.2. Acquiring and storing data Case Study: What is protected data? Section 7.3. Keeping your data forever Case Study: A back-up that back-fired Section 7.4. Data immutability Case Study: Retrospective deletion of data Case Study: Reality tampering Case Study: The dubious right to be forgotten Section 7.5. Privacy and Confidentiality Case Study: Text scrubber Section 7.6. The economics of data repurposing Case Study: Avoiding the time loop Case Study: Bargain basement cancer research Appendix A. Index of Case Studies Appendix B. Glossary Author Biography
From the Preface of Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases, by Jules Berman
For a few decades now, I have been interested in writing a book that treats the rare diseases as a separate specialty within medicine. Most of my colleagues were not particularly receptive to the idea. Here is a sample of their advice, paraphrased: "Don't waste your time on the rare diseases. There are about 7,000 rare diseases that are known to modern medicine. The busiest physician, over the length of a long career, will encounter only a tiny fraction of the total number of rare diseases. Surely, an attempt to learn them all would be silly; an exercise in purposeless scholarship. Furthermore, each rare disease accounts for so few people, it is impractical to devote much research funding to these medical outliers. To get the most bang for our bucks, we should concentrate our research efforts on the most common diseases: heart disease, cancer, diabetes, Alzheimer's disease, and so on."
Other colleagues questioned whether rare diseases are a legitimate area of study "Rare diseases do not comprise a biologically meaningful class of diseases. They are simply an arbitrary construction, differing from common diseases by a numeric accident. A disease does not become scientifically interesting just by being rare." For many of my colleagues, the rare diseases are merely statistical outliers, best ignored.
In biology, there are no outliers; no circumstances that are rare enough to be ignored. Every disease, no matter how rare, operates under the same biological principles that pertain to common diseases. In 1657, William Harvey, the noted physiologist, wrote: "Nature is nowhere accustomed more openly to display her secret mysteries than in cases where she shows tracings of her workings apart from the beaten paths; nor is there any better way to advance the proper practice of medicine than to give our minds to the discovery of the usual law of nature, by careful investigation of cases of rarer forms of disease".
We shall see that the rare diseases are much simpler, genetically, than the common diseases. The rare disease can be conceived as controlled experiments of nature, in which everything is identical in the diseased and the normal organisms, except for one single factor that is the root cause of the ensuing disease. By studying the rare diseases, we can begin to piece together the more complex parts of common diseases.
The book has five large themes that emerge, in one form or another, in every chapter.
Today, there is no recognized field of medicine devoted to the study of rare diseases; but there should be.
Content and Organization of the Book
There are three parts to the book. In Part I (Understanding the Problem), we discuss the differences between the rare and the common diseases, and why it is crucial to understand these differences. To stir your interest, here are just a few of the most striking differences: 1) Most of the rare diseases occur in early childhood, while most of the common diseases occur in adulthood; 2) The genetic determinants of most rare diseases have a simple Mendelian pattern, dependent on whether the disease trait occurs in the father, or mother, or both. Genetic influences in the common diseases seldom display Mendelian inheritance; 3) Rare diseases often occur as syndromes involving multiple organs through seemingly unrelated pathological processes. Common diseases usually involve a single organ or involve multiple organs involved by a common pathologic process.
The most common pathological conditions of humans are aging, metabolic diseases (including diabetes, hypertension, and obesity), diseases of the heart and vessels, infectious diseases, and cancer. Each of these disorders is characterized by pathologic processes that bear some relation to the processes that operate in rare diseases. In Part II (Rare lessons for Common Diseases), we discuss the rare diseases that have helped us understand of the common diseases. Emphasis is placed on the enormous value of rare disease research. We begin to ask and answer some of the fundamental questions raised in Part I. Specifically, how is it possible for two diseases to share the same pathologic mechanisms without sharing similar genetic alterations? Why are the common diseases often caused, in no small part, by environmental (i.e., non-genetic) influences, while the rare disease counterparts are driven by single genetic flaws? Why are the rare diseases often syndromic (i.e. involving multiple organs with multiple types of abnormalities and dysfunctions), while the so-called complex common diseases often manifest in a single pathological process? In Part II, we will discuss a variety of pathologic mechanisms that apply to classes of rare diseases. We will also see how these same mechanisms operate in the common diseases. We will explore the relationship between genotype and phenotype, and we will address one of the most important questions in modern disease biology: "How is it possible that complex and variable disease genotypes operating in unique individuals will converge to produce one disease with the same biological features from individual to individual?"
In Part III (Fundamental Relationships Between Rare and Common Diseases), we answer the as-yet unanswered questions from Part I, plus the new questions raised in Part II. The reasons why rare diseases are different from common diseases are explained. The convergence of pathologic mechanisms and clinical outcome observed in rare diseases and common diseases, as it relates to the prevention, diagnosis, and treatment of both types of diseases, is described in detail.
The book includes a scientific rationale for funding research in the rare diseases. Currently, there is a vigorous lobbying effort, launched by coalitions of rare disease organizations, to attract research funding and donations. Funding for the rare diseases has always been small, relative to the common diseases. Funding agencies find it impractical to devote large portions of their research budget to the rare diseases, while so many people are suffering from the common diseases. As it turns out, direct funding of the common diseases has not been particularly cost-effective. It is time for funders to re-evaluate their goals and priorities.
Laypersons advocating for rare disease research almost always appeal to our charitable instincts, hoping that prospective donors will respond to the plight of a few individuals. Readers will learn that such supplications are unnecessary and misdirect attention from more practical arguments. When rare diseases are funded, everyone benefits. We will see that it is much easier to find effective targeted treatments for the rare diseases than for common diseases. Furthermore, treatments that are effective against rare diseases will almost always find a place in the treatment of one or more common diseases. This assertion is not based on wishful thinking, and is not based on extrapolation from a few past triumphs wherein some treatment overlap has been found in rare and common diseases. The assertion is based on the observation that rare diseases encapsulate the many biological pathways that drive, in the aggregate, our common diseases. This simple theme is described and justified throughout the book. A better approach might be to increase funding for the rare diseases with the primary goal of curing the common diseases. The final chapter of this book discusses promising new approaches to rare disease research.
Table of Contents: Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases
Foreword Preface Part I. Understanding the Problem. Chapter 1. What are the Rare Diseases, and Why do we Care? 1.1. The Definition of Rare Disease. 1.2. Remarkable Progress in the Rare Diseases. Chapter 2. What are the Common Diseases? 2.1. The Common Diseases of Humans, a Short but Terrifying List. 2.2. The Recent Decline in Progress Against Common Diseases. 2.3. Why Medical Scientists have failed to Eradicate the Common Diseases. Chapter 3. Six Observations to Ponder while Reading this Book 3.1. Rare Diseases are Biologically Different from Common Diseases. 3.2. Common Diseases Typically Occur in Adults; Rare Diseases are Often Diseases of Childhood. 3.3. Rare diseases Usually have a Mendelian Pattern of Inheritance; Common diseases are non-Mendelian. 3.4. Rare Diseases Often Occur as Syndromes, Involving Several Organs or Physiologic Systems, Often in Surprising Ways. Common Diseases are Typically Non-syndromic. 3.5. Environmental Factors Play a Major Causative Role in the Common Diseases; Less so in the Inherited Rare Diseases. 3.6. The Difference in Rates of Occurrence of the Rare Diseases Compared with the Common Diseases is Profound, Often on the Order of a Thousand-fold. 3.7. There are Many More Rare Diseases than there are Common Diseases. Part II. Rare lessons for Common Diseases. Chapter 4. Aging 4.1. Normal Patterns of Aging. 4.2. Aging and Immortality. 4.3. Premature Aging Disorders. 4.4. Aging as a Disease of Non-renewable Cells. Chapter 5. Diseases of the Heart and Vessels 5.1. Heart Attacks. 5.2. Rare Desmosome-based Cardiomyopathies. 5.3. Sudden Death and Rare Diseases Hidden in Unexplained Clinical Events. 5.4. Hypertension and Obesity: Quantitative Traits with Cardiovascular Co-morbidities. Chapter 6. Infectious Diseases and Immune Deficiencies 6.1. The Burden of Infectious Diseases in Humans. 6.2. Biological Taxonomy: Where Rare Infectious Diseases Mingle with the Common Infectious Diseases. 6.3. Biological Properties of the Rare Infectious Diseases. 6.4. Rare Diseases of Unknown Etiology. 6.5. Fungi as a Model Infectious Organism Causing Rare Diseases. Chapter 7. Diseases of Immunity 7.1. Immune Status and the Clinical Expression of Infectious Diseases. 7.2. Autoimmune Disorders. Chapter 8. Cancer 8.1. Rare Cancers are Fundamentally Different from Common Cancers. 8.2. The Dichotomous Development of Rare Cancers and Common Cancers. 8.3. The Genetics of Rare Cancers and Common Cancers. 8.4. Using Rare Diseases to Understand Carcinogenesis. Part III. Fundamental Relationships Between Rare and Common Diseases. Chapter 9. Causation and the Limits of Modern Genetics. 9.1. The Inadequate Meaning of Biological Causation. 9.2. The Complexity of the So-called Monogenic Rare Diseases. 9.3. One Monogenic Disorder, Many Genes. 9.4. Gene Variation and The Limits of Pharmacogenetics. 9.5. Environmental Phenocopies of Rare Diseases. Chapter 10. Pathogenesis; Causation's Shadow 10.1. The Mystery of Tissue Specificity. 10.2. Cell Regulation and Epigenomics. 10.3. Disease Phenotype. 10.4. Dissecting Pathways Using Rare Diseases. 10.5. Precursor Lesions and Disease Progression. Chapter 11. Rare Diseases and Common Diseases: Understanding their Fundamental Differences 11.1. Review of the Fundamentals in Light of the Incidentals. 11.2. A Trip to Monte Carlo: How Normal Variants Express a Disease Phenotype. 11.3. Associating Genes with Common Diseases. 11.4. Mutation versus Variation. Chapter 12. Rare Diseases and Common Diseases: Understanding Their Relationships 12.1. Shared Genes. 12.2. Shared Phenotypes. Chapter 13. Shared Benefits 13.1. Shared Prevention. 13.2. Shared Diagnostics. 13.3. Shared Cures. Chapter 14. Conclusion 14.1. Progress in the Rare Diseases: Social and Political Issues. 14.2. Smarter Clinical Trials. 14.3. For the Common Diseases, Animals are Poor Substitutes for Humans. 14.4. Hubris. Appendix I. List of Genes Causing More Than One Disease Appendix II. Rules, Some of Which are Always True, and All of Which Are Sometimes True Glossary
In April, 2014, Amazon published Armchair Science: No Experiments, Just Deduction, as a Kindle eBook.
The book develops the premise that science is not a collection of facts; science is what we can induce from facts. By observing the night sky, without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common.
Armchair Science is written for general readers who are curious about science, and who want to sharpen their deductive skills.
In Armchair Science, the reader is confronted with 129 scientific mysteries, in the fields of cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining.
Each scientific mystery begins with an observation (i.e., the "clue" to solving the mystery), then a deduction based on the clue, then a step-by-step analysis that traces how the deduction was obtained.
I hope that the readers of this blog will visit the book at its Amazon site and read the "look inside" pages.
Book details provided by the Amazon site:
Additional information on this book is available at: http://www.julesberman.info/armchair.htm
Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information, is available now in ebook and print versions.
Amazon has created a good look-inside, so you can get a very good sampling of the contents, should you decide to purchase a copy.
Also, viewers of this web site can use the password "MKFRIEND" (case sensitive) at the Elsevier order site to get a 30% discount plus free shipping, for my Big Data book.
From the Preface of Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information
Data pours into millions of computers every moment of every day. It is estimated that the total accumulated data stored on computers worldwide is about 300 exabytes (that's 300 billion gigabytes). Data storage increases at about 28% per year. The data stored is peanuts compared to data that is transmitted without storage. The annual transmission of data is estimated at about 1.9 zettabytes (1,900 billion gigabytes, see Glossary item, Binary sizes). From this growing tangle of digital information, the next generation of data resources will emerge.
As we broaden our data reach (i.e., the different kinds of data objects included in the resource), and our data time-line (i.e., accruing data from the future and the deep past), we need to find ways to fully describe each piece of data, so that we do not confuse one data item with another, and so that we can search and retrieve data items when we need them. Astute informaticians understand that if we fully describe everything in our universe, we would need to have an ancillary universe to hold all the information, and the ancillary universe would need to be much much larger than our physical universe.
In the rush to acquire and analyze data, it is easy to overlook the topic of data preparation. If the data in our Big Data resources (see Glossary item, Big Data resource) are not well-organized, comprehensive, and fully described, then the resources will have no value. The primary purpose of this book is to explain the principles upon which serious Big Data resources are built. All of the data held in Big Data resources must have a form that supports search, retrieval and analysis. The analytic methods must be available for review, and the analytic results must be available for validation.
Perhaps the greatest potential benefit of Big Data is its ability to link seemingly disparate disciplines, to develop and test hypothesis that cannot be approached within a single knowledge domain. Methods by which analysts can navigate through different Big Data resources to create new, merged data sets, will be reviewed.
What exactly, is Big Data? Big Data can is characterized by the three v's: volume (large amounts of data), variety (includes different types of data), and velocity (constantly accumulating new data). Those of us who have worked on Big Data projects might suggest throwing a few more v's into the mix: vision (having a purpose and a plan), verification (ensuring that the data conforms to a set of specifications), and validation (checking that its purpose is fulfilled; see Glossary item, Validation).
Many of the fundamental principles of Big Data organization have been described in the "metadata" literature. This literature deals with the formalisms of data description (i.e., how to describe data), the syntax of data description (e.g., markup languages such as eXtensible Markup Language, XML), semantics (i.e., how to make computer-parsable statements that convey meaning), the syntax of semantics (e.g., framework specifications such as Resource Description Framework, RDF, and Web Ontology Language, OWL), the creation of data objects that hold data values and self-descriptive information, and the deployment of ontologies, hierarchical class systems whose members are data objects (see Glossary items, Specification, Semantics, Ontology, RDF, XML).
The field of metadata may seem like a complete waste of time to professionals who have succeeded very well, in data-intensive fields, without resorting to metadata formalisms. Many computer scientists, statisticians, database managers, and network specialists have no trouble handling large amounts of data, and they may not see the need to create a strange new data model for Big Data resources. They might feel that all they really need is greater storage capacity, distributed over more powerful computers, that work in parallel with one another. With this kind of computational power, they can store, retrieve, and analyze larger and larger quantities of data. These fantasies only apply to systems that use relatively simple data or data that can be represented in a uniform and standard format. When data is highly complex and diverse, as found in Big Data resources, the importance of metadata looms large. Metadata will be discussed, with a focus on those concepts that must be incorporated into the organization of Big Data resources. The emphasis will be on explaining the relevance and necessity of these concepts, without going into gritty details that are well-covered in the metadata literature.
When data originates from many different sources, arrives in many different forms, grows in size, changes its values, and extends into the past and the future, the game shifts from data computation to data management. I hope that this book will persuade readers that faster, more powerful computers are nice to have, but these devices cannot compensate for deficiencies in data preparation. For the foreseeable future, universities, federal agencies, and corporations will pour money, time and manpower into Big Data efforts. If they ignore the fundamentals, their projects are likely to fail. On the other hand, if they pay attention to Big Data fundamentals, they will discover that Big Data analyses can be performed on standard computers. The simple lesson, that data trumps computation, will be repeated throughout this book, in examples drawn from well-documented events.
There are three crucial topics related to data preparation that are omitted from virtually every other Big Data book: aazzidentifiers, immutability, and introspection.zzaa
A thoughtful identifier system ensures that all of the data related to a particular data object will be attached to the correct object, through its identifier, and to no other object. It seems simple, and it is, but many Big Data resources assign identifiers promiscuously, with the end result that information related to a unique object is scattered throughout the resource, attached to other objects, and cannot be sensibly retrieved when needed. The concept of object identification is of such overriding importance that a Big Data resource can be usefully envisioned as a collection of unique identifiers to which complex data is attached. Data identifiers are discussed in Chapter 2.
Immutability is the principle that data collected in a Big Data resource is permanent, and can never be modified. At first thought, it would seem that immutability is a ridiculous and impossible constraint. In the real world, mistakes are made, information changes, and the methods for describing information changes. This is all true, but the astute Big Data manager knows how to accrue information into data objects without changing the pre-existing data. Methods for achieving this seemingly impossible trick is described in detail in Chapter 6.
Introspection is a term borrowed from object oriented programming, not often found in the Big Data literature. It refers to the ability of data objects to describe themselves when interrogated. With introspection, users of a Big Data resource can quickly determine the content of data objects and the hierarchical organization of data objects within the Big Data resource. Introspection allows users to see the types of data relationships that can be analyzed within the resource and clarifies how disparate resources can interact with one another. Introspection will be described in detail in Chapter 4.
Another subject covered in this book, and often omitted from the literature on Big Data, is data indexing. Though there are many books written on the art of science of so-called back-of-the-book indexes, scant attention has been paid to the process of preparing indexes for large and complex data resources. Consequently, most Big Data resources have nothing that could be called a serious index. They might have a Web page with a few links to explanatory documents, or they might have a short and crude "help" index, but it would be rare to find a Big Data resource with a comprehensive index containing a thoughtful and updated list of terms and links. Without a proper index, most Big Data resources have limited utility for any but a few cognoscenti. It seems odd to me that organizations willing to spend hundreds of millions of dollars on a Big Data resource will balk at investing some thousands of dollars on a proper index.
Aside from these four topics, which readers would be hard-pressed to find in the existing Big Data literature, this book covers the usual topics relevant to Big Data design, construction, operation, and analysis. Some of these topics include data quality, providing structure to unstructured data, data de-identification, data standards and interoperability issues, legacy data, data reduction and transformation, data analysis, and software issues. For these topics, discussions focus on the underlying principles; programming code and mathematical equations are conspicuously inconspicuous. An extensive Glossary covers the technical or specialized terms and topics that appear throughout the text. As each Glossary term is "optional" reading, I took the liberty of expanding on technical or mathematical concepts that appeared in abbreviated form in the main text. The Glossary provides an explanation of the practical relevance of each term to Big Data, and some readers may enjoy browsing the Glossary as a stand-alone text.
The final four chapters are non-technical; all dealing in one way or another with the consequences of our exploitation of Big Data resources. These chapters will cover legal, social, and ethical issues. The book ends with my personal predictions for the future of Big Data, and its impending impact on our futures. When preparing this book, I debated whether these four chapters might best appear in the front of the book, to whet the reader's appetite for the more technical chapters. I eventually decided that some readers would be unfamiliar with some of the technical language and concepts included in the final chapters, necessitating their placement near the end. Readers with a strong informatics background may enjoy the book more if they start their reading at Chapter 12.
Readers may notice that many of the case examples described in this book come from the field of medical informatics. The healthcare informatics field is particularly ripe for discussion because every reader is affected, on economic and personal levels, by the Big Data policies and actions emanating from the field of medicine. Aside from that, there is a rich literature on Big Data projects related to healthcare. As much of this literature is controversial, I thought it important to select examples that I could document, from reliable sources. Consequently, the reference section is large, with over 200 articles from journals, newspaper articles, and books. Most of these cited articles are available for free Web download.
Who should read this book?This book is written for professionals who manage Big Data resources and for students in the fields of computer science and informatics. Data management professionals would include the leadership within corporations and funding agencies who must commit resources to the project, the project directors who must determine a feasible set of goals and who must assemble a team of individuals who,in aggregate, hold the requisite skills for the task: network managers, data domain specialists, metadata specialists, software programmers, standards experts, interoperability experts, statisticians, data analysts, and representatives from the intended user community. Students of informatics, the computer sciences, and statistics will discover that the special challenges attached to Big Data, seldom discussed in university classes, are often surprising; sometimes shocking.
By mastering the fundamentals of Big Data design, maintenance, growth, and validation, readers will learn how to simplify the endless tasks engendered by Big Data resources. Adept analysts can find relationships among data objects held in disparate Big Data resources, if the data is prepared properly. Readers will discover how integrating Big Data resources can deliver benefits far beyond anything attained from stand-alone databases.
Taxonomic Guide to Infectious Diseases:
Understanding the Biologic Classes of Pathogenic Organisms
presents a new approach
to medical microbiology and parasitology by explaining the
biological properties that apply to classes of organisms and
relating these properties to the collection of pathogenic
species that belong to each class. This book drives down
the complexity of the field of infectious disease by providing
a concise but complete guide to the modern classification of
disease organisms. Taxonomic Guide to Infectious Diseases
begins with an overview of modern taxonomy, including a
description of the kingdoms of life and the evolutionary
principles underlying their class hierarchy. Each following
chapter describes one of the 40 classes of infectious
organisms, providing clinically relevant descriptions of the genera
(groups of species) and of the individual diseases that
are caused by members of the class. Taxonomic Guide to
Infectious Diseases is an indispensable resource for serious
healthcare students and providers working in the fields of
infectious disease research, pathology, medical microbiology,
parasitology, mycology, and virology.
Table of Contents, Taxonomic Guide to Infectious Diseases
Preface PART I. Principles of Taxonomy (Chapters 1 - 3) Chapter 1. The magnitude and diversity of infectious diseases Chapter 2. What is a classification? Chapter 3. The Tree of Life PART II. Bacteria (Chapters 4 - 14) Chapter 4. Overview of Class Bacteria Chapter 5. The Alpha Proteobacteria Chapter 6. Beta Proteobacteria Chapter 7. Gamma Proteobacteria Chapter 8. Epsilon Proteobacteria Chapter 9. Spirochaetes Chapter 10. Bacteroidetes and Fusobacteria Chapter 11. Mollicutes Chapter 12. Class Bacilli plus Class Clostridia Chapter 13. Chlamydiae Chapter 14. Actinobacteria PART III. Eukaryotes (Chapters 15 - 37) Chapter 15. Overview of Class Eukaryota and the soon-to-be-abandoned Class Protoctista Chapter 16. Metamonada Chapter 17. Euglenozoa Chapter 18. Percolozoa Chapter 19. Apicomplexa Chapter 20. Ciliophora (Ciliates) Chapter 21. Heterokontophyta Chapter 22. Amoebozoa Chapter 23. Choanozoa Chapter 24. Archaeplastida PART IV. Animals (Chapters 25 to 32) Chapter 25. Overview of Class Animalia Chapter 26. Platyhelminthes (flatworms) Chapter 27. Nematoda (roundworms) Chapter 28. Acanthocephala Chapter 29. Chelicerata Chapter 30. Hexapoda Chapter 31. Crustacea Chapter 32. Craniata PART V. Fungi (Chapters 33 - 37) Chapter 33. Overview of Class Fungi Chapter 34. Zygomycota Chapter 35. Basidiomycota Chapter 36. Ascomycota Chapter 37. Microsporidia PART VI. Nonliving Infectious Agents: Viruses and Prions (Chapters 38 - 46) Chapter 38. Overview of Viruses Chapter 39. Group I Viruses: double stranded DNA Chapter 40. Group II Viruses: single stranded (+)sense DNA Chapter 41. Group III Viruses: double stranded RNA Chapter 42. Group IV Viruses: single stranded (+) sense RNA Chapter 43. Group V Viruses: single stranded (-) sense RNA Chapter 44. Group VI Viruses: single stranded RNA reverse transcriptase viruses with a DNA intermediate in life-cycle Chapter 45. Group VII Viruses: double stranded DNA reverse transcriptase viruses Chapter 46. Prions References Appendices Appendix I. Additional notes on taxonomy Appendix II. Number of occurrences of some common infectious diseases Appendix III. Organisms causing infectious disease in humans Glossary Index
Machiavelli's Laboratory is a satiric examination of ethics in science and medicine. The book is narrated by an evil scientist who tempts the reader to follow an immoral path towards career advancement. Side-stepping the serious, life-and-death dilemmas that fill traditional ethics texts, Machiavelli's Laboratory focuses on the "manners" of science and medicine. There is a right way and a wrong way for scientists to perform their daily rites: attending meetings, interpreting data, reviewing journal articles, hiring personnel, returning phone calls, and so on. Ill-mannered scientists and healthcare professional can wreak havoc on themselves and their colleagues. Drawing examples taken from the history of medicine, and from contemporary life, the evil narrator explains how unwary scientists and physicians commit immoral acts, often without realizing what they have done.
Although the book is written as entertainment, readers will learn a great deal about the history of science and medicine. The stories in Machiavelli's Laboratory are chosen from well-documented events occurring in ancient times and modern times. The petty and selfish activities of scientists throughout the ages can help us make better choices, today. This book exposes the common deceptions employed by academics, laboratory researchers, data analysts, journal editors, department chiefs, government bureaucrats, corporate executives, grants writers, physicians, and medical trainees.
Machiavelli's Laboratory is written for students, teachers, healthcare professionals, and scientists who need know every trick in the book.
The Amazon site has an excellent "look-inside" that includes sample text and a table of contents.