Arguably the biggest bottleneck in AI is the lack of clean, parseable data. A great example of this is the healthcare problem that science.io is trying to solve:

We have so much healthcare data that could be analyzed and modeled to diagnose, treat, or cure several diseases, however this data is locked away in esoteric systems, loose leaf, or in the minds of doctors. If all this data were easily understood by computers, the average lifespan would probably increase at least 5% (guess).

Data cleanliness can be approached from two ends:

  • Publish clean data as it is acquired
  • Create programs that can clean data for itself