Books by Jules J. Berman, covers

Data Simplification: Taming Information With Open Source Tools,
by Jules J. Berman, cover

In March, 2016, Morgan Kaufmann, an Elsevier imprint, published my book Data Simplification: Taming Information With Open Source Tools. Those of you who are computer-oriented know that data analysis typically takes much less time and effort than data preparation. Moreover, if you make a mistake in your data analysis, you can often just repeat the process, using different tools, or a fresh approach to your original question. As long as the data is prepared properly, you and your colleagues can re-analyze your data to your heart's content. Contrariwise, if your data is not prepared in a manner that supports sensible analysis, there's little you can do to extricate yourself from the situation. For this reason, data preparation is, in my experience, much more important than data analysis.

Throughout my career, I've relied on simple open source utilities and short scripts to simplify my data, producing products that were self-explanatory, permanent, and that could be merged with other types of data. Hence, my book.

Data Simplification: Taming Information With Open Source Tools
Publisher: Morgan Kaufmann; 1 edition (March 23, 2016)
ISBN-10: 0128037814
ISBN-13: 978-0128037812
Paperback: 398 pages
Dimensions: 7.5 x 9.2 inches

Chapter 1, The Simple Life, explores the thesis that complexity is the rate-limiting factor in human development. The greatest advances in human civilization and the most dramatic evolutionary improvements in all living organisms have followed the acquisition of methods that reduce or eliminate complexity.

Chapter 2, Structuring Text, reminds us that most of the data on the Web today is unstructured text, produced by individuals, trying their best to communicate with one another. Data simplification often begins with textual data. This chapter provides readers with tools and strategies for imposing some basic structure on free-text.

Chapter 3, Indexing Text, describes the often undervalued benefits of indexes. An index, aided by proper annotation of data, permits us to understand data in ways that were not anticipated when the original content was collected. With the use of computers, multiple indexes designed for differentpurposes, can be created for a single document or data set. As data accrues, indexes can be updated. When data sets are combined, their respective indexes can be merged. A good way of thinking about indexes is that the document contains all of the complexity; the index contains all of the simplicity. Data scientists who understand how to create and use indexes will be in the best position to search, retrieve, and analyze textual data. Methods are provided for automatically creating customized indexes designed for specific analytic pursuits and for binding index terms to standard nomenclatures.

Chapter 4, Understanding Your Data, describes how data can be quickly assessed, prior to formal quantitative analysis, to develop some insight into what the data means. A few simple visualization tricks and simple statistical descriptors can greatly enhance a data scientistís understanding of complex and large data sets. Various types of data objects, such as text files, images, and time-series data, can be profiled with a summary signature that captures the key features that contribute to the behavior and content of the data object. Such profiles can be used to find relationships among different data objects, or to determine when data objects are not closely related to one another.

Chapter 5, Identifying and Deidentifying Data, tackles one of the most under-appreciated and least understood issues in data science. Measurements, annotations, properties, and classes of information have no informational meaning unless they are attached to an identifier that distinguishes one data object from all other data objects, and that links together all of the information that has been or will be associated with the identified data object. The method of identification and the selection of objects and classes to be identified relates fundamentally to the organizational model of complex data. If the simplifying step of data identification is ignored or implemented improperly, data cannot be shared, and conclusions drawn from the data cannot be believed. All well-designed information systems are, at their heart, identification systems: ways of naming data objects so that they can be retrieved. Only well-identified data can be usefully deidentified. This chapter discusses methods for identifying data and deidentifying data.

Chapter 6, Giving Meaning to Data, explores the meaning of meaning, as it applies to computer science. We shall learn that data, by itself, has no meaning. It is the job of the data scientist to assign meaning to data, and this is done with data objects, triples, and classifications (see Glossary items, Data object, Triple, Classification, Ontology). Unfortunately, coursework in the information sciences often omits discussion of the critical issue of "data meaning"; advancing from data collection to data analysis without stopping to design data objects whose relationships to other data objects are defined and discoverable. In this chapter, readers will learn how to prepare and classify meaningful data.

Chapter 7, Object-Oriented Data, shows how we can understand data, using a few elegant computational principles. Modern programming languages, particularly object-oriented programming languages, use introspective data (ie, the data with which data objects describe themselves) to modify the execution of a program at run-time; an elegant process known as reflection. Using introspection and reflection, programs can integrate data objects with related data objects. The implementations of introspection, reflection and integration, are among the most important achievements in the field of computer science.

Chapter 8, Problem Simplification, demonstrates that it is just as important to simplify problems as it is to simplify data. This final chapter provides simple but powerful methods for analyzing data, without resorting to advanced mathematical techniques. The use of random number generators to simulate the behavior of systems, and the application of Monte Carlo, resampling, and permutative methods to a wide variety of common problems in data analysis, will be discussed. The importance of data reanalysis, following preliminary analysis, is emphasized.


Chapter 0. Preface
   References for Preface
   Glossary for Preface

Chapter 1. The Simple Life
   Section 1.1. Simplification drives scientific progress
   Section 1.2. The human mind is a simplifying machine
   Section 1.3. Simplification in Nature
   Section 1.4. The Complexity Barrier
   Section 1.5. Getting ready
   Open Source Tools for Chapter 1
      Text Editors
      Command line utilities
      Cygwin, Linux emulation for Windows
      DOS batch scripts
      Linux bash scripts
      Interactive line interpreters
      Package installers
      System calls
References for Chapter 1
   Glossary for Chapter 1

Chapter 2. Structuring Text
   Section 2.1. The Meaninglessness of free text
   Section 2.2. Sorting text, the impossible dream
   Section 2.3. Sentence Parsing
   Section 2.4. Abbreviations
   Section 2.5. Annotation and the simple science of metadata
   Section 2.6. Specifications Good, Standards Bad
   Open Source Tools for Chapter 2
      Regular expressions
      Format commands
      Converting non-printable files to plain-text
      Dublin Core
   References for Chapter 2
   Glossary for Chapter 2

Chapter 3. Indexing Text
   Section 3.1. How Data Scientists Use Indexes
   Section 3.2. Concordances and Indexed Lists
   Section 3.3. Term Extraction and Simple Indexes
   Section 3.4. Autoencoding and Indexing with Nomenclatures
   Section 3.5. Computational Operations on Indexes
   Open Source Tools for Chapter 3
      Word lists
      Doublet lists
      Ngram lists
   References for Chapter 3
   Glossary for Chapter 3

Chapter 4. Understanding Your Data
   Section 4.1. Ranges and Outliers
   Section 4.2. Simple Statistical Descriptors
   Section 4.3. Retrieving Image Information
   Section 4.4. Data Profiling
   Section 4.5. Reducing data
   Open Source Tools for Chapter 4
      R, for statistical programming
      Displaying equations in LaTex
      Normalized compression distance
      Pearson's correlation
      The ridiculously simple dot product
   References for Chapter 4 
   Glossary for Chapter 4

Chapter 5. Identifying and Deidentifying Data
   Section 5.1. Unique Identifiers
   Section 5.2. Poor Identifiers, Horrific Consequences
   Section 5.3. Deidentifiers and Reidentifiers
   Section 5.4. Data Scrubbing
   Section 5.5. Data Encryption and Authentication
   Section 5.6. Timestamps, Signatures, and Event Identifiers
   Open Source Tools for Chapter 5
      Pseudorandom number generators
      Encryption and decryption with OpenSSL
      One-way hash implementations
   References for Chapter 5
   Glossary for Chapter 5

Chapter 6. Giving Meaning to Data
   Section 6.1. Meaning and Triples
   Section 6.2. Driving Down Complexity with Classifications
   Section 6.3. Driving Up Complexity with Ontologies
   Section 6.4. The unreasonable effectiveness of classifications
   Section 6.5. Properties that Cross Multiple Classes
   Open Source Tools for Chapter 6
      Syntax for triples
      RDF Schema
      RDF parsers
      Visualizing class relationships
   References for Chapter 6
   Glossary for Chapter 6

Chapter 7. Object-oriented data
   Section 7.1. The Importance of Self-explaining Data
   Section 7.2. Introspection and Reflection
   Section 7.3. Object-Oriented Data Objects
   Section 7.4. Working with Object-Oriented Data
   Open Source Tools for Chapter 7
      Persistent data
      SQLite databases
   References for Chapter 7
   Glossary for Chapter 7

Chapter 8. Problem simplification
   Section 8.1. Random numbers
   Section 8.2. Monte Carlo Simulations
   Section 8.3. Resampling and Permutating
   Section 8.4. Verification, Validation, and Reanalysis
   Section 8.5. Data Permanence and Data Immutability
   Open Source Tools for Chapter 8
      Burrows Wheeler transform
      Winnowing and chaffing
   References for Chapter 8
   Glossary for Chapter 8

Repurposing Legacy Data: Innovative Case Studies (Computer Science Reviews and Trends),
by Jules J. Berman, cover

In March, 2015, Morgan Kaufmann (an Elsevier imprint), published Repurposing Legacy Data: Innovative Case Studies (Computer Science Reviews and Trends).

Re-Purposing Legacy Data: Innovative Case Studies takes a look at how data scientists repurpose legacy data; whether their own, or legacy data that has been donated to the public domain. The case studies in this book, taken from such diverse fields as cosmology, quantum physics, high-energy physics, microbiology, and medicine all serve to demonstrate the value of old data.

Key Features: Excerpted from Preface:

Data repurposing involves taking preexisting data and performing any of the following:

Most of the listed types of data repurposing efforts are selfexplanatory and all of them will be followed by examples throughout this book.

Table of Contents: Re-Purposing Legacy Data: Innovative Case Studies

Chapter 1. Introduction

   Section 1.1. Why bother?
   Section 1.2. What is data repurposing?
   Section 1.3. Data worth preserving
   Section 1.4. Basic data repurposing tools
   Section 1.5. Personal attributes of data repurposers

Chapter 2. Learning from the masters

   Section 2.1. New physics from old data
      Case Study: Sky charts 
      Case Study: From Hydrogen spectrum data to quantum mechanics 
   Section 2.2. Repurposing the physical and abstract property of uniqueness
      Case Study: Fingerprints; from personal identifier to data-driven forensics 
   Section 2.3. Repurposing a 2000 year old classification
      Case Study: A dolphin is not a fish 
      Case Study: The molecular stopwatch 
   Section 2.4. Decoding the past
      Case Study: Mayan glyphs: finding a lost civilization 
   Section 2.5. What makes data useful for repurposing projects?
      Case Study: CODIS 
      Case Study: Zip codes: From postal code to demographic keystone 
      Case Study: The classification of living organisms 

Chapter 3. Dealing with text

   Section 3.1. Thus it is written
      Case Study: New associations in old medical literature 
      Case Study: Ngram analyses of large bodies of text 
   Section 3.2. Search and retrieval
      Case Study: Sentence extraction 
      Case Study: Term extraction 
   Section 3.3. Indexing text
      Case Study: Creating an index 
      Case Study: Ranking search results (10) 
   Section 3.4. Coding text

Chapter 4.  New Life for Old data

   Section 4.1. New algorithms
      Case Study: Lunar Orbiter Image Recovery Project 
      Case Study: Finding new planets from old data 
      Case Study: New but ultimately wrong: market prediction 
      Case Study: Choosing better metrics 
   Section 4.2. Taking closer looks
      Case Study: Expertise is no substitute for data analysis 
      Case Study: Life on mars 
      Case Study: The number of human chromosomes 
      Case Study: Linking global warming to high-intensity hurricanes 
      Case Study: Inferring climate trends with geologic data 
      Case Study: Old tidal data, and the iceberg that sank the Titanic (4) 

Chapter 5. The purpose of data analysis is to enable data reanalysis

   Section 5.1. Every initial data analysis on complex data sets is flawed
      Case Study: Sample mix-ups, a pre-analytic error 
      Case Study: Vanishing exoplanets: an analytic error 
      Case Study: The new life form that wasn't: a post-analytic error 
   Section 5.2. Unrepeatability of complex analyses
      Case Study: Reality is more complex than we can ever know 
      Case Study: First biomarker publications are seldom reproducible 
   Section 5.3. Obligation to verify and validate
      Case Study: Reanalysis clarifies earlier findings 
      Case Study: It is never too late to review old data 
      Case Study: Vindication through reanalysis 
      Case Study: Reanalysis bias 
   Section 5.4. Asking what the data really means
      Case Study: Repurposing the logarithmic tables 
      Case Study: Multimodality in legacy data 
      Case Study: The significance of narrow data ranges. 

Chapter 6. Dark Legacy: Making sense of someone else's data

   Section 6.1. Excavating treasures from lost and abandoned data mines
      Case Study: Re-analysis of old JADE collider data 
      Case Study: 36 year old satellite resurrected 
   Section 6.2. Nonstandard standards
      Case Study: Standardizing the chocolate teapot (7) 
   Section 6.3. Specifications, not standards
      Case Study: The fungibility of data standards 
   Section 6.4. Classifications and Ontologies
      Case Study: An upper level ontology 
      Case Study: Population sampling by class 
   Section 6.5. Identity and uniqueness
      Case Study: Faulty identifiers 
      Case Study: Timestamping data 
   Section 6.6. When to terminate (or reconsider) a data repurposing project
      Case Study: Nonsensical Mayan glyphs. 
      Case Study: Flattened data 

Chapter 7. Social and Economic Issues

   Section 7.1. Data sharing and reproducible research
      Case Study: Sharing public-funded research data with the public 
   Section 7.2. Acquiring and storing data
      Case Study: What is protected data? 
   Section 7.3. Keeping your data forever
      Case Study: A back-up that back-fired 
   Section 7.4. Data immutability
      Case Study: Retrospective deletion of data 
      Case Study: Reality tampering 
      Case Study: The dubious right to be forgotten 
   Section 7.5. Privacy and Confidentiality
      Case Study: Text scrubber 
   Section 7.6. The economics of data repurposing
      Case Study: Avoiding the time loop 
      Case Study: Bargain basement cancer research 

Appendix A. Index of Case Studies
Appendix B. Glossary
Author Biography

Rare Diseases and Orphan Drugs:
Keys to understanding and treating the common diseases, by Jules J. Berman, 

From the Preface of Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases, by Jules Berman

For a few decades now, I have been interested in writing a book that treats the rare diseases as a separate specialty within medicine. Most of my colleagues were not particularly receptive to the idea. Here is a sample of their advice, paraphrased: "Don't waste your time on the rare diseases. There are about 7,000 rare diseases that are known to modern medicine. The busiest physician, over the length of a long career, will encounter only a tiny fraction of the total number of rare diseases. Surely, an attempt to learn them all would be silly; an exercise in purposeless scholarship. Furthermore, each rare disease accounts for so few people, it is impractical to devote much research funding to these medical outliers. To get the most bang for our bucks, we should concentrate our research efforts on the most common diseases: heart disease, cancer, diabetes, Alzheimer's disease, and so on."

Other colleagues questioned whether rare diseases are a legitimate area of study "Rare diseases do not comprise a biologically meaningful class of diseases. They are simply an arbitrary construction, differing from common diseases by a numeric accident. A disease does not become scientifically interesting just by being rare." For many of my colleagues, the rare diseases are merely statistical outliers, best ignored.

In biology, there are no outliers; no circumstances that are rare enough to be ignored. Every disease, no matter how rare, operates under the same biological principles that pertain to common diseases. In 1657, William Harvey, the noted physiologist, wrote: "Nature is nowhere accustomed more openly to display her secret mysteries than in cases where she shows tracings of her workings apart from the beaten paths; nor is there any better way to advance the proper practice of medicine than to give our minds to the discovery of the usual law of nature, by careful investigation of cases of rarer forms of disease".

We shall see that the rare diseases are much simpler, genetically, than the common diseases. The rare disease can be conceived as controlled experiments of nature, in which everything is identical in the diseased and the normal organisms, except for one single factor that is the root cause of the ensuing disease. By studying the rare diseases, we can begin to piece together the more complex parts of common diseases.

The book has five large themes that emerge, in one form or another, in every chapter.

  1. In the past two decades, there have been enormous advances in the diagnosis and treatment of the rare diseases. In the same period, progress in the common diseases has stagnated. Advances in the rare diseases have profoundly influenced the theory and the practice of modern medicine.

  2. The molecular pathways that are operative in the rare diseases contribute to the pathogenesis of the common diseases. Hence, the rare diseases are not the exceptions to the general rules that apply to common diseases; the rare diseases are the exceptions upon which the general rules of common diseases are based.

  3. Research into the genetics of common diseases indicates that these diseases are much more complex than we had anticipated. Many rare diseases have simple genetics, wherein a mutation in a single gene accounts for a clinical outcome. The same simple pathways found in the rare diseases serve as components of the common diseases. If the common diseases are the puzzles that modern medical researchers are mandated to solve, then the rare diseases are the pieces of the puzzles.

  4. If we fail to study the rare diseases in a comprehensive way, we lose the opportunity to see the important biological relationships among diseases consigned to non-overlapping sub-disciplines of medicine.

  5. Every scientific field must have a set of fundamental principles that describes, explains, or predicts its own operation. Rare diseases operate under a set of principles, and these principles can be inferred from well-documented pathologic, clinical, and epidemiologic observations.

Today, there is no recognized field of medicine devoted to the study of rare diseases; but there should be.

Content and Organization of the Book

There are three parts to the book. In Part I (Understanding the Problem), we discuss the differences between the rare and the common diseases, and why it is crucial to understand these differences. To stir your interest, here are just a few of the most striking differences: 1) Most of the rare diseases occur in early childhood, while most of the common diseases occur in adulthood; 2) The genetic determinants of most rare diseases have a simple Mendelian pattern, dependent on whether the disease trait occurs in the father, or mother, or both. Genetic influences in the common diseases seldom display Mendelian inheritance; 3) Rare diseases often occur as syndromes involving multiple organs through seemingly unrelated pathological processes. Common diseases usually involve a single organ or involve multiple organs involved by a common pathologic process.

The most common pathological conditions of humans are aging, metabolic diseases (including diabetes, hypertension, and obesity), diseases of the heart and vessels, infectious diseases, and cancer. Each of these disorders is characterized by pathologic processes that bear some relation to the processes that operate in rare diseases. In Part II (Rare lessons for Common Diseases), we discuss the rare diseases that have helped us understand of the common diseases. Emphasis is placed on the enormous value of rare disease research. We begin to ask and answer some of the fundamental questions raised in Part I. Specifically, how is it possible for two diseases to share the same pathologic mechanisms without sharing similar genetic alterations? Why are the common diseases often caused, in no small part, by environmental (i.e., non-genetic) influences, while the rare disease counterparts are driven by single genetic flaws? Why are the rare diseases often syndromic (i.e. involving multiple organs with multiple types of abnormalities and dysfunctions), while the so-called complex common diseases often manifest in a single pathological process? In Part II, we will discuss a variety of pathologic mechanisms that apply to classes of rare diseases. We will also see how these same mechanisms operate in the common diseases. We will explore the relationship between genotype and phenotype, and we will address one of the most important questions in modern disease biology: "How is it possible that complex and variable disease genotypes operating in unique individuals will converge to produce one disease with the same biological features from individual to individual?"

In Part III (Fundamental Relationships Between Rare and Common Diseases), we answer the as-yet unanswered questions from Part I, plus the new questions raised in Part II. The reasons why rare diseases are different from common diseases are explained. The convergence of pathologic mechanisms and clinical outcome observed in rare diseases and common diseases, as it relates to the prevention, diagnosis, and treatment of both types of diseases, is described in detail.

The book includes a scientific rationale for funding research in the rare diseases. Currently, there is a vigorous lobbying effort, launched by coalitions of rare disease organizations, to attract research funding and donations. Funding for the rare diseases has always been small, relative to the common diseases. Funding agencies find it impractical to devote large portions of their research budget to the rare diseases, while so many people are suffering from the common diseases. As it turns out, direct funding of the common diseases has not been particularly cost-effective. It is time for funders to re-evaluate their goals and priorities.

Laypersons advocating for rare disease research almost always appeal to our charitable instincts, hoping that prospective donors will respond to the plight of a few individuals. Readers will learn that such supplications are unnecessary and misdirect attention from more practical arguments. When rare diseases are funded, everyone benefits. We will see that it is much easier to find effective targeted treatments for the rare diseases than for common diseases. Furthermore, treatments that are effective against rare diseases will almost always find a place in the treatment of one or more common diseases. This assertion is not based on wishful thinking, and is not based on extrapolation from a few past triumphs wherein some treatment overlap has been found in rare and common diseases. The assertion is based on the observation that rare diseases encapsulate the many biological pathways that drive, in the aggregate, our common diseases. This simple theme is described and justified throughout the book. A better approach might be to increase funding for the rare diseases with the primary goal of curing the common diseases. The final chapter of this book discusses promising new approaches to rare disease research.

Table of Contents: Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases


Part I. Understanding the Problem.

Chapter 1. What are the Rare Diseases, and Why do we Care?
     1.1. The Definition of Rare Disease.
     1.2. Remarkable Progress in the Rare Diseases. 

Chapter 2. What are the Common Diseases?
     2.1. The Common Diseases of Humans, a Short but Terrifying List. 
     2.2. The Recent Decline in Progress Against Common Diseases. 
     2.3. Why Medical Scientists have failed to Eradicate the Common Diseases.

Chapter 3. Six Observations to Ponder while Reading this Book
     3.1. Rare Diseases are Biologically Different from Common Diseases. 
     3.2. Common Diseases Typically Occur in Adults; 
          Rare Diseases are Often Diseases of Childhood. 
     3.3. Rare diseases Usually have a Mendelian Pattern of Inheritance; 
          Common diseases are non-Mendelian.
     3.4. Rare Diseases Often Occur as Syndromes, Involving Several Organs or 
          Physiologic Systems, Often in Surprising Ways. 
          Common Diseases are Typically Non-syndromic.
     3.5. Environmental Factors Play a Major Causative Role in the 
          Common Diseases; 
          Less so in the Inherited Rare Diseases. 
     3.6. The Difference in Rates of Occurrence of the Rare Diseases Compared 
          with the Common Diseases is Profound, Often on the Order 
          of a Thousand-fold. 
     3.7. There are Many More Rare Diseases than there are Common Diseases. 

Part II. Rare lessons for Common Diseases.

Chapter 4. Aging
     4.1. Normal Patterns of Aging. 
     4.2. Aging and Immortality. 
     4.3. Premature Aging Disorders. 
     4.4. Aging as a Disease of Non-renewable Cells. 

Chapter 5. Diseases of the Heart and Vessels
     5.1. Heart Attacks.
     5.2. Rare Desmosome-based Cardiomyopathies.
     5.3. Sudden Death and Rare Diseases Hidden in Unexplained Clinical Events.
     5.4. Hypertension and Obesity: Quantitative Traits with Cardiovascular 

Chapter 6. Infectious Diseases and Immune Deficiencies
     6.1. The Burden of Infectious Diseases in Humans.
     6.2. Biological Taxonomy: Where Rare Infectious Diseases Mingle with the 
          Common Infectious Diseases.
     6.3. Biological Properties of the Rare Infectious Diseases.
     6.4. Rare Diseases of Unknown Etiology.
     6.5. Fungi as a Model Infectious Organism Causing Rare Diseases.

Chapter 7. Diseases of Immunity
     7.1.  Immune Status and the Clinical Expression of Infectious Diseases.
     7.2.  Autoimmune Disorders.

Chapter 8. Cancer
     8.1. Rare Cancers are Fundamentally Different from Common Cancers.
     8.2. The Dichotomous Development of Rare Cancers and Common Cancers.
     8.3. The Genetics of Rare Cancers and Common Cancers.
     8.4. Using Rare Diseases to Understand Carcinogenesis.

Part III. Fundamental Relationships Between Rare and Common Diseases.

Chapter 9. Causation and the Limits of Modern Genetics.
     9.1.  The Inadequate Meaning of Biological Causation.
     9.2. The Complexity of the So-called Monogenic Rare Diseases.
     9.3. One Monogenic Disorder, Many Genes.
     9.4. Gene Variation and The Limits of Pharmacogenetics.
     9.5. Environmental Phenocopies of Rare Diseases.

Chapter 10. Pathogenesis; Causation's Shadow
     10.1. The Mystery of Tissue Specificity.
     10.2. Cell Regulation and Epigenomics.
     10.3. Disease Phenotype.
     10.4. Dissecting Pathways Using Rare Diseases.
     10.5. Precursor Lesions and Disease Progression.

Chapter 11. Rare Diseases and Common Diseases: Understanding their 
            Fundamental Differences
     11.1. Review of the Fundamentals in Light of the Incidentals.
     11.2. A Trip to Monte Carlo: How Normal Variants Express a Disease Phenotype. 
     11.3. Associating Genes with Common Diseases.
     11.4. Mutation versus Variation.

Chapter 12. Rare Diseases and Common Diseases: Understanding Their Relationships
     12.1. Shared Genes.
     12.2. Shared Phenotypes. 

Chapter 13. Shared Benefits
     13.1. Shared Prevention.
     13.2. Shared Diagnostics.
     13.3. Shared Cures.

Chapter 14. Conclusion
     14.1. Progress in the Rare Diseases: Social and Political Issues. 
     14.2. Smarter Clinical Trials.
     14.3. For the Common Diseases, Animals are Poor Substitutes for Humans.
     14.4. Hubris.

Appendix I. List of Genes Causing More Than One Disease

Appendix II. Rules, Some of Which are Always True, and All of Which Are Sometimes True


Armchair Science: No Experiments, Just Deduction
by Jules J. Berman, 

In April, 2014, Amazon published Armchair Science: No Experiments, Just Deduction, as a Kindle eBook.

The book develops the premise that science is not a collection of facts; science is what we can induce from facts. By observing the night sky, without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common.

Armchair Science is written for general readers who are curious about science, and who want to sharpen their deductive skills.

In Armchair Science, the reader is confronted with 129 scientific mysteries, in the fields of cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining.

Each scientific mystery begins with an observation (i.e., the "clue" to solving the mystery), then a deduction based on the clue, then a step-by-step analysis that traces how the deduction was obtained.

I hope that the readers of this blog will visit the book at its Amazon site and read the "look inside" pages.

Book details provided by the Amazon site:

Additional information on this book is available at:

Principles of Big Data: Preparing, 
Sharing, and Analyzing Complex Information, by Jules J. Berman, 

Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information, is available now in ebook and print versions.

Amazon has created a good look-inside, so you can get a very good sampling of the contents, should you decide to purchase a copy.

Also, viewers of this web site can use the password "MKFRIEND" (case sensitive) at the Elsevier order site to get a 30% discount plus free shipping, for my Big Data book.

From the Preface of Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information

Data pours into millions of computers every moment of every day. It is estimated that the total accumulated data stored on computers worldwide is about 300 exabytes (that's 300 billion gigabytes). Data storage increases at about 28% per year. The data stored is peanuts compared to data that is transmitted without storage. The annual transmission of data is estimated at about 1.9 zettabytes (1,900 billion gigabytes, see Glossary item, Binary sizes). From this growing tangle of digital information, the next generation of data resources will emerge.

As we broaden our data reach (i.e., the different kinds of data objects included in the resource), and our data time-line (i.e., accruing data from the future and the deep past), we need to find ways to fully describe each piece of data, so that we do not confuse one data item with another, and so that we can search and retrieve data items when we need them. Astute informaticians understand that if we fully describe everything in our universe, we would need to have an ancillary universe to hold all the information, and the ancillary universe would need to be much much larger than our physical universe.

In the rush to acquire and analyze data, it is easy to overlook the topic of data preparation. If the data in our Big Data resources (see Glossary item, Big Data resource) are not well-organized, comprehensive, and fully described, then the resources will have no value. The primary purpose of this book is to explain the principles upon which serious Big Data resources are built. All of the data held in Big Data resources must have a form that supports search, retrieval and analysis. The analytic methods must be available for review, and the analytic results must be available for validation.

Perhaps the greatest potential benefit of Big Data is its ability to link seemingly disparate disciplines, to develop and test hypothesis that cannot be approached within a single knowledge domain. Methods by which analysts can navigate through different Big Data resources to create new, merged data sets, will be reviewed.

What exactly, is Big Data? Big Data can is characterized by the three v's: volume (large amounts of data), variety (includes different types of data), and velocity (constantly accumulating new data). Those of us who have worked on Big Data projects might suggest throwing a few more v's into the mix: vision (having a purpose and a plan), verification (ensuring that the data conforms to a set of specifications), and validation (checking that its purpose is fulfilled; see Glossary item, Validation).

Many of the fundamental principles of Big Data organization have been described in the "metadata" literature. This literature deals with the formalisms of data description (i.e., how to describe data), the syntax of data description (e.g., markup languages such as eXtensible Markup Language, XML), semantics (i.e., how to make computer-parsable statements that convey meaning), the syntax of semantics (e.g., framework specifications such as Resource Description Framework, RDF, and Web Ontology Language, OWL), the creation of data objects that hold data values and self-descriptive information, and the deployment of ontologies, hierarchical class systems whose members are data objects (see Glossary items, Specification, Semantics, Ontology, RDF, XML).

The field of metadata may seem like a complete waste of time to professionals who have succeeded very well, in data-intensive fields, without resorting to metadata formalisms. Many computer scientists, statisticians, database managers, and network specialists have no trouble handling large amounts of data, and they may not see the need to create a strange new data model for Big Data resources. They might feel that all they really need is greater storage capacity, distributed over more powerful computers, that work in parallel with one another. With this kind of computational power, they can store, retrieve, and analyze larger and larger quantities of data. These fantasies only apply to systems that use relatively simple data or data that can be represented in a uniform and standard format. When data is highly complex and diverse, as found in Big Data resources, the importance of metadata looms large. Metadata will be discussed, with a focus on those concepts that must be incorporated into the organization of Big Data resources. The emphasis will be on explaining the relevance and necessity of these concepts, without going into gritty details that are well-covered in the metadata literature.

When data originates from many different sources, arrives in many different forms, grows in size, changes its values, and extends into the past and the future, the game shifts from data computation to data management. I hope that this book will persuade readers that faster, more powerful computers are nice to have, but these devices cannot compensate for deficiencies in data preparation. For the foreseeable future, universities, federal agencies, and corporations will pour money, time and manpower into Big Data efforts. If they ignore the fundamentals, their projects are likely to fail. On the other hand, if they pay attention to Big Data fundamentals, they will discover that Big Data analyses can be performed on standard computers. The simple lesson, that data trumps computation, will be repeated throughout this book, in examples drawn from well-documented events.

There are three crucial topics related to data preparation that are omitted from virtually every other Big Data book: aazzidentifiers, immutability, and introspection.zzaa

A thoughtful identifier system ensures that all of the data related to a particular data object will be attached to the correct object, through its identifier, and to no other object. It seems simple, and it is, but many Big Data resources assign identifiers promiscuously, with the end result that information related to a unique object is scattered throughout the resource, attached to other objects, and cannot be sensibly retrieved when needed. The concept of object identification is of such overriding importance that a Big Data resource can be usefully envisioned as a collection of unique identifiers to which complex data is attached. Data identifiers are discussed in Chapter 2.

Immutability is the principle that data collected in a Big Data resource is permanent, and can never be modified. At first thought, it would seem that immutability is a ridiculous and impossible constraint. In the real world, mistakes are made, information changes, and the methods for describing information changes. This is all true, but the astute Big Data manager knows how to accrue information into data objects without changing the pre-existing data. Methods for achieving this seemingly impossible trick is described in detail in Chapter 6.

Introspection is a term borrowed from object oriented programming, not often found in the Big Data literature. It refers to the ability of data objects to describe themselves when interrogated. With introspection, users of a Big Data resource can quickly determine the content of data objects and the hierarchical organization of data objects within the Big Data resource. Introspection allows users to see the types of data relationships that can be analyzed within the resource and clarifies how disparate resources can interact with one another. Introspection will be described in detail in Chapter 4.

Another subject covered in this book, and often omitted from the literature on Big Data, is data indexing. Though there are many books written on the art of science of so-called back-of-the-book indexes, scant attention has been paid to the process of preparing indexes for large and complex data resources. Consequently, most Big Data resources have nothing that could be called a serious index. They might have a Web page with a few links to explanatory documents, or they might have a short and crude "help" index, but it would be rare to find a Big Data resource with a comprehensive index containing a thoughtful and updated list of terms and links. Without a proper index, most Big Data resources have limited utility for any but a few cognoscenti. It seems odd to me that organizations willing to spend hundreds of millions of dollars on a Big Data resource will balk at investing some thousands of dollars on a proper index.

Aside from these four topics, which readers would be hard-pressed to find in the existing Big Data literature, this book covers the usual topics relevant to Big Data design, construction, operation, and analysis. Some of these topics include data quality, providing structure to unstructured data, data de-identification, data standards and interoperability issues, legacy data, data reduction and transformation, data analysis, and software issues. For these topics, discussions focus on the underlying principles; programming code and mathematical equations are conspicuously inconspicuous. An extensive Glossary covers the technical or specialized terms and topics that appear throughout the text. As each Glossary term is "optional" reading, I took the liberty of expanding on technical or mathematical concepts that appeared in abbreviated form in the main text. The Glossary provides an explanation of the practical relevance of each term to Big Data, and some readers may enjoy browsing the Glossary as a stand-alone text.

The final four chapters are non-technical; all dealing in one way or another with the consequences of our exploitation of Big Data resources. These chapters will cover legal, social, and ethical issues. The book ends with my personal predictions for the future of Big Data, and its impending impact on our futures. When preparing this book, I debated whether these four chapters might best appear in the front of the book, to whet the reader's appetite for the more technical chapters. I eventually decided that some readers would be unfamiliar with some of the technical language and concepts included in the final chapters, necessitating their placement near the end. Readers with a strong informatics background may enjoy the book more if they start their reading at Chapter 12.

Readers may notice that many of the case examples described in this book come from the field of medical informatics. The healthcare informatics field is particularly ripe for discussion because every reader is affected, on economic and personal levels, by the Big Data policies and actions emanating from the field of medicine. Aside from that, there is a rich literature on Big Data projects related to healthcare. As much of this literature is controversial, I thought it important to select examples that I could document, from reliable sources. Consequently, the reference section is large, with over 200 articles from journals, newspaper articles, and books. Most of these cited articles are available for free Web download.

Who should read this book?This book is written for professionals who manage Big Data resources and for students in the fields of computer science and informatics. Data management professionals would include the leadership within corporations and funding agencies who must commit resources to the project, the project directors who must determine a feasible set of goals and who must assemble a team of individuals who,in aggregate, hold the requisite skills for the task: network managers, data domain specialists, metadata specialists, software programmers, standards experts, interoperability experts, statisticians, data analysts, and representatives from the intended user community. Students of informatics, the computer sciences, and statistics will discover that the special challenges attached to Big Data, seldom discussed in university classes, are often surprising; sometimes shocking.

By mastering the fundamentals of Big Data design, maintenance, growth, and validation, readers will learn how to simplify the endless tasks engendered by Big Data resources. Adept analysts can find relationships among data objects held in disparate Big Data resources, if the data is prepared properly. Readers will discover how integrating Big Data resources can deliver benefits far beyond anything attained from stand-alone databases.

Taxonomic Guide to Infectious Diseases, by Jules J. Berman, cover

Taxonomic Guide to Infectious Diseases: Understanding the Biologic Classes of Pathogenic Organisms presents a new approach to medical microbiology and parasitology by explaining the biological properties that apply to classes of organisms and relating these properties to the collection of pathogenic species that belong to each class. This book drives down the complexity of the field of infectious disease by providing a concise but complete guide to the modern classification of disease organisms. Taxonomic Guide to Infectious Diseases begins with an overview of modern taxonomy, including a description of the kingdoms of life and the evolutionary principles underlying their class hierarchy. Each following chapter describes one of the 40 classes of infectious organisms, providing clinically relevant descriptions of the genera (groups of species) and of the individual diseases that are caused by members of the class. Taxonomic Guide to Infectious Diseases is an indispensable resource for serious healthcare students and providers working in the fields of infectious disease research, pathology, medical microbiology, parasitology, mycology, and virology.

Table of Contents, Taxonomic Guide to Infectious Diseases


PART I. Principles of Taxonomy (Chapters 1 - 3)

Chapter 1. The magnitude and diversity of infectious diseases
Chapter 2. What is a classification?
Chapter 3. The Tree of Life

PART II. Bacteria (Chapters 4 - 14)

Chapter 4. Overview of Class Bacteria
Chapter 5. The Alpha Proteobacteria
Chapter 6. Beta Proteobacteria
Chapter 7. Gamma Proteobacteria
Chapter 8. Epsilon Proteobacteria
Chapter 9. Spirochaetes
Chapter 10. Bacteroidetes and Fusobacteria
Chapter 11. Mollicutes
Chapter 12. Class Bacilli plus Class Clostridia
Chapter 13. Chlamydiae
Chapter 14. Actinobacteria

PART III. Eukaryotes (Chapters 15 - 37)

Chapter 15. Overview of Class Eukaryota and the soon-to-be-abandoned 
            Class Protoctista
Chapter 16. Metamonada
Chapter 17. Euglenozoa
Chapter 18. Percolozoa
Chapter 19. Apicomplexa
Chapter 20. Ciliophora (Ciliates)
Chapter 21. Heterokontophyta
Chapter 22. Amoebozoa
Chapter 23. Choanozoa
Chapter 24. Archaeplastida

PART IV. Animals (Chapters 25 to 32)

Chapter 25. Overview of Class Animalia
Chapter 26. Platyhelminthes (flatworms)
Chapter 27. Nematoda (roundworms)
Chapter 28. Acanthocephala
Chapter 29. Chelicerata
Chapter 30. Hexapoda
Chapter 31. Crustacea
Chapter 32. Craniata

PART V. Fungi (Chapters 33 - 37)

Chapter 33. Overview of Class Fungi
Chapter 34. Zygomycota
Chapter 35. Basidiomycota
Chapter 36. Ascomycota
Chapter 37. Microsporidia

PART VI. Nonliving Infectious Agents: Viruses and Prions (Chapters 38 - 46)

Chapter 38. Overview of Viruses
Chapter 39. Group I Viruses: double stranded DNA
Chapter 40. Group II Viruses: single stranded (+)sense DNA
Chapter 41. Group III Viruses: double stranded RNA
Chapter 42. Group IV Viruses: single stranded (+) sense RNA
Chapter 43. Group V Viruses: single stranded (-) sense RNA
Chapter 44. Group VI Viruses: single stranded RNA reverse transcriptase 
            viruses with a DNA intermediate in life-cycle
Chapter 45. Group VII Viruses: double stranded DNA reverse transcriptase 
Chapter 46. Prions
  Appendix I. Additional notes on taxonomy
  Appendix II. Number of occurrences of some common infectious diseases
  Appendix III. Organisms causing infectious disease in humans

Machiavelli's Laboratory, by Jules J. Berman, cover

Machiavelli's Laboratory is a satiric examination of ethics in science and medicine. The book is narrated by an evil scientist who tempts the reader to follow an immoral path towards career advancement. Side-stepping the serious, life-and-death dilemmas that fill traditional ethics texts, Machiavelli's Laboratory focuses on the "manners" of science and medicine. There is a right way and a wrong way for scientists to perform their daily rites: attending meetings, interpreting data, reviewing journal articles, hiring personnel, returning phone calls, and so on. Ill-mannered scientists and healthcare professional can wreak havoc on themselves and their colleagues. Drawing examples taken from the history of medicine, and from contemporary life, the evil narrator explains how unwary scientists and physicians commit immoral acts, often without realizing what they have done.

Although the book is written as entertainment, readers will learn a great deal about the history of science and medicine. The stories in Machiavelli's Laboratory are chosen from well-documented events occurring in ancient times and modern times. The petty and selfish activities of scientists throughout the ages can help us make better choices, today. This book exposes the common deceptions employed by academics, laboratory researchers, data analysts, journal editors, department chiefs, government bureaucrats, corporate executives, grants writers, physicians, and medical trainees.

Machiavelli's Laboratory is written for students, teachers, healthcare professionals, and scientists who need know every trick in the book.

The Amazon site has an excellent "look-inside" that includes sample text and a table of contents.

Author Information
Book corrections

For additional information on supplemental documents and scripts for my previously published books, please visit my Resource Page

   Last modified: March 23, 2015
Books by Jules J. Berman, covers

Tags: informatics, perl programming, ruby programming, perl scripts, ruby scripts, ethics, medical ethics, fraud, dishonesty, misconduct, data fabrication, data falsification, data misinterpretation, history of science, history of medicine, politics of science, politics of medicine, cancer biology, precancer, tumors, neoplastic disorders, free ebook, books by Jules J. Berman big data, heterogeneous data, complex datasets, Jules J. Berman, Ph.D., M.D., immutability, introspection, identifiers, de-identification, deidentification, confidentiality, privacy, massive data, lotsa data