last modified: 2017-10-10
Find the reading list for this session on Pinterest: https://fr.pinterest.com/seinecle/what-is-data-what-is-big-data/
1. Big data is a mess
Jokes aside, defining big data and what it covers needs a bit of precision. Let’s bring some clarity.
2. The 3 V
Big data is usually described with the "3 Vs":
V for Volume
The size of datasets available today is staggering (ex: Facebook had 250 billion pics in 2016).
We should also note that the volumes of data are increasing at an accelerating rate. According to sources, "90% of all the data in the world has been generated over the last two years" (statement from 2013) or said differently, "More data will be created in 2017 than the previous 5,000 years of humanity"
V for Variety
This is a bit less intuitive. "Variety" means here that data is increasingly unstructured and messy, and this is an important characteristic of the "big data" phenomenon. To carictature a bit, try to picture a shift from A to B:
A - Structured data:
phonebooks, accounting books, governmental statistics… anything that can be represented as well organized tables of numbers and short pieces of text with the expected format, size, and conventions of writing.
B - Unstructured data:
datasets made of "unruly" items: text of any length, without proper categorization, encoded in different formats, including possibly pictures, sound, geographical coordinates and what not…
V for Velocity
In a nutshell, the speed of creation and communication of data is accelerating (examples taken from here):
Facebook hosts 250 billion pics? It receives 900 million more pictures per day
Examining tweets can be done automatically (with computers). If you want to connect to Twitter to receive tweets in real time as they are tweeted, be prepared to receive in excess of 500 million tweets per day. Twitter calls this service the "firehose", which reflects the velocity of the stream of tweets.
Sensor data is bound to increase speed as well. While pictures, tweets, individual records… are single item data sent at intervals, more and more sensors can send data in a continuous stream (measures of movement, sound, etc.)
So, velocity poses challenges of its own: while a system can handle (store, analyze) say 100Gb of data in a given time (day or month), it might not be able to do it in say, a single second. Big data refers to the problems and solutions raised by the velocity of data.
A 4th V can be added, for Veracity
Veracity relates to trustworthiness and compliance: is the data authentic? Has it been corrupted at any step of its processing?
We will devote a session of this course to data compliance, which is a broad topic covering data privacy, cybersecurity, and the societal impacts of data.
3. What is the minimum size to count as "big data"? It’s all relative
There is no "threshold" or "minimum size" of a dataset where "data" would turn from "small data" to "big data".
It is more of a relative notion: it is big data if current IT systems struggle to cope with the datasets.
(see Wikipedia definition developing on this.)
"Big data" is a relative notion… how so?
1. relative to time
what was considered "big data" in the early 2000s would be considered "small data" today, because we have better storage and computing power today.
this is a never ending race: as IT systems improve to deal with "current big data", data gets generated in still larger volumes, which calls for new progress / innovations to handle it.
2. relative to the industry
what is considered "big data" by non tech SMEs (small and medium-sized entreprises) can be considered trivial to handle by tech companies.
3. not just about size
the difficulty for an IT system to cope with a dataset can be related to the size (try analyzing 2 Tb of data on your laptop…), but also related to the content of the data.
For example the analysis of customer reviews in dozens of languages is harder than the analysis of the same number of reviews in just one language.
So the general rule is: the less the data is structured, the harder it is to use it, even if it’s small in size (this relates to the "V" of variety seen above).
4. no correlation between size and value
Big data is often called "the new oil", as if it would flow like oil and would power engines "on demand".
Actually, big data is created: it needs work, conception and design choices to even exist (what do I collect? how do I store it? what structure do I give to it?). The human intervention in creating data determines largely whether data will be of value later.
Example: Imagine customers can write online reviews of your products. These reviews are data. But if you store these reviews without an indication of who has authored the review (maybe because reviews can be posted without login oneself), then the reviews become much less valuable. Simple design decisions about how the data is collected, stored and structured have a huge impact on the value of the data.
So, in reaction to large, unstructured and badly curated datasets with low value at the end, a notion of "smart data" is sometimes put forward: data which can be small in size but which is well curated and annotated, enhancing its value (see also here).
5. as an expression, "big data" is evolving
It is interesting to note that "hot" expressions, like "big data", tend to wear out fast. They are too hyped, used in all circumstances, become vague and over sold. For big data, we observe that it is peaking in 2017, while new terms appear:
What are the differences between these terms?
"Big data" is by now a generic term
"Machine learning" puts the focus on the scientific and software engineering capabilities enabling to do something useful with the data (predict, categorize, score…)
"Artificial intelligence" puts the emphasis on human-like possibilities afforded by machine learning. Often used interchangeably with machine learning.
And "data science"? This is a broad term encompassing machine learning, statistics, … and any analytical methods to work with data and interpret it. Often used interchangeably with machine learning. "Data scientist" is a common job description in the field.
4. Where did big data come from?
1. Data got generated in bigger volumes because of the digitalization of the economy
2. Computers became more powerful
3. Storing data became cheaper every year
4. The mindset changed as to what "counts" as data
Unstructured (see above for definition of "unstructured") textual data was usually not stored: it takes a lot space, and software to query it was not sufficiently developped.
Network data (also known as graphs) (who is friend with whom, who likes the same things as whom, etc.) was usually neglected as "not true observation", and hard to query. Social networks like Facebook made a lot to make businesses aware of the value of graphs (especially social graphs).
Geographical data has democratized: specific (and expensive) databases existed for a long time to store and query "place data" (regions, distances, proximity info…) but easy-to-use solutions have multiplied recently.
5. With open source software, the rate of innovation accelerated
In the late 1990s, a rapid shift in the habits of software developers kicked in: they tended to use more and more open source software, and to release their software as open source. Until then, most of the software was "closed source": you buy a software without the possibility to reuse / modify / augment its source code. Just use it as is.
Open source software made it easy to get access to software built by others and use it to develop new things. Today, all the most popular software in machine learning are free and open source.
See the Wikipedia article for a developed history of open source software: https://en.wikipedia.org/wiki/History_of_free_and_open-source_software
6. Hype kicked in
The Gartner hype cycle is a tool measuring the maturity of a technology, differentiating expectations from actual returns:
This graph shows the pattern that all technologies follow along their lifetime:
at the beginning (left of the graph), an invention or discovery is made in a research lab, somewhere. Some news reporting is done about it, but with not much noise.
then, the technology starts picking the interest of journalists, consultant, professors, industries… expectations grow about the possibilities and promises of the tech. "With it we will be able to [insert amazing thing here]"
the top of the bump is the "peak of inflated expectations". All techs tend to be hyped and even over hyped. This means the tech is expected to deliver more than it surely will, in actuality. People get overdrawn.
then follows the "Trough of Disillusionment". Doubt sets in. People realize the tech is not as powerful, easy, cheap or quick to implement as it first seemed. Newspapers start reporting depressing news about the tech, some bad buzz spreads.
then: slope of Enlightenment. Heads get colder, expectations get in line with what the tech can actually deliver. Markets stabilize and consolidate: some firms close and key actors continue to grow.
then: plateau of productivity. The tech is now mainstream.
(all technology can "die" - fall into disuse - before reaching the right side of the graph of course).
In 2014, big data was near the top of the curve: it was getting a lot of attention but its practical use in 5 to 10 years were still uncertain. There were "great expectations" about its future, and these expectations drive investment, research and business in big data.
In 2017, "big data" is still on top of hyped technologies, but is broken down in "deep learning" and "machine learning". Note also the "Artificial General Intelligence" category:
6. Big data transforms industries, and has become an industry in itself
Firms active in "Big data" divide in many subdomains: the industry to manage the IT infrastructure for big data, the consulting firms, software providers, industry-specific applications, etc…
→ the field is huge.
Matt Turck, VC at FirstMarkCap, creates every year a sheet to visualize the main firms active in these subdomains. This is the 2017 version:
You can find a high res version of this pic, an Excel sheet version, and a very interesting comment all here.
5. What is the future of big data?
1. More data is coming
The Internet of things (IoT) designates the extension of Internet to objects, not just web pages and emails (see here for details).
These connected objects are used to do things (display stuff on screen, pilote robots, etc.) but also very much to collect data in their environments (through sensors).
The development of connected objects will lead to a tremendous increase in the volume of data collected.
We have a session devoted to IoT later in this course. You can already starting reading the documents for this session:
2. Discussions about big data will fuse with AI
Enthusiasm, disappointment, bad buzz, worries, debates, promises… the discourse about AI will grow. AI is fed on data, so the future of big data will intersect with what AI becomes.
We have a session devoted to data science / machine learning / AI later in this course. You can already start reading the documents for this course:
3. Regulatory frameworks will grow in complexity
Societal impacts of big data and AI are not trivial, ranging from racial, financial and medical discrimination to giant data leaks, or economic (un)stability in the age of robots and AI in the workplace.
Public regulations at the national and international levels are trying to catch up with these challenges. As technology evolves quickly, we can anticipate that societal impacts of big data will take center stage.
We have a session devoted to data compliance in this course. You can already start reading the documents for this course: