On Definition of BigData

. The term Big Data (or BigData) is widely used in scientific, educational, and business literature; however, there does not exist a single definition that can be unreservedly called “canonical”. A careless use of Big Data term to promote commercial software further emphasizes the importance of this issue. In this paper, we have performed a review of definitions of Big Data and highlighted the principal features that are attributed to Big Data. We compared all these principal features with features of databases compiled using Edgar F. Codd’s publications, and showed that they are not unique and can also be attributed to the databases. Having studied C. Lynch original work, we proposed the definition of Big Data based on the so-called conservation institution. The key point of this definition is a shift from purely technical attitude towards public institutions. Since the current use of the Big Data term may lead to a loss of meaning. There is a need not only to spread out best practices but also to eliminate or minimize the use of dubious or misleading ones.


Introduction
Specific study of a given phenomenon requires determination of a common terms dictionary that ensures consistent communications and understanding of the object being investigated.The Big Data term is widely used in relation to scientific, educational and business tasks but there is no single specific definition that can be unreservedly called as "canonical" Big Data definition.The use of Big Data term to promote commercial software intelligence solutions further exaggerates the situation.
Clifford Lynch is considered the person who firstly introduced the term Big Data [1].Curiously, his paper does not provide explicit definition of the Big Data.Instead, it discusses the challenges that appear due to a significant increase of the data volumes and considers new solutions that allow to obtain, transform, store, and analyze those huge datasets.As the key solution C. Lynch formulated a foundation of what he called "preservation institutions".
Attempts to generate added value from the data, to produce new knowledge and methods to deal with the data, are, in particular, reflected in the development of information theory as well as the database theory.For example, Edgar F. Codd in 1970 published the article [2] entitled "A relational model of data for large shared data banks".Keeping in mind that large is larger than big and huge is bigger than large, a few mostly rhetorical questions might arise.Do the large data banks from 1970 th refer to bigger things in contrast with nowadays' Big Data?Should we expect the appearance of Huge Data in the nearest future?
This paper presents our attempt to formalize principle features of the data that make the Data Big by the nature.

Methods
In this investigation we perform a review of recently used Big Data definitions and use-cases, and contrast them with each other to discriminate commonly accepted features.A few Big Data definitions are summarized in Table 1.If the definition is given not in English, a translation to English is provided.After that the discriminated Big Data features are discussed.

Results
A few typical examples of modern Big Data definitions one may see in Table 1.The following features are usually declared to make the data Big: • large volumes of the data; • the required large-scale computer power; • the lack of the structure of the data; • the need for specialized hardware, software, and algorithms to deal with, The mentioned above is usually provided in relative units, i.e. in contrast with currently available solutions.

Definition of Big Data
Ref.
"The term BigData is used to describe massive digital datasets that require innovations in analytical techniques in order to exploit them and create new forms of value.Big data's vastness is not about absolute size but about the required scale of analysis" Ref. [3] Big Data is often understood as: large volumes of data arrays and the need to use large-scale computing power, custom software and methods for extracting value from data in a reasonable amount of time.
(Translated by V. Rzhannikova, see original Russian text in [4]) Ref. [4] Big data is a term that defines not only the size of data sets that exceeds the capabilities of conventional databases, but also unstructured information, which can't be process and analyze by traditional algorithms.(Translated by V. Rzhannikova, see original Russian text in [5])

Ref.[5]]
The term Big Data refers to data sets whose size exceeds the capabilities of typical databases for storing, managing and analyzing information.(Translated by V. Rzhannikova, see original Russian text in [6]) Ref. [6] "Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value" Ref. [7] "Big data is a term for massive data sets having large, more varied and complex structure with the difficulties of storing, analyzing and visualizing for further processes or results" Ref. [8] "Big Data is a data that's too big, too fast, or too hard for existing tools to process" Ref. [9] "Datasets which could not be captured, managed, and processed by general computers within an acceptable scope." Ref. [10] "Big Data concern large-volume, complex, growing data sets with multiple, autonomous sources" Ref. [11] 4

Discussion
As it follows from Table 1, the most often mentioned Big Data feature is the size of the data that we are able to process on a "regular" computer.This criterion is not very stable.Dating back to 1980 th , a typical random access memory (RAM) capacity increased from units of kB to GB nowadays, i.e. 10 6 times.Similarly, a persistent storage (like hard disk drives, tapes, etc…) capacity increased for more than 9 orders -from kB to TB, or even more if consider special devices or cloud storage solutions.Thus, classification of the data as Big in this case depends on the currently available hardware, and it will be eventually changed.
Another widely used criterion is the required computational power (CPU-or machine-power).It does not make any significant changes to the mentioned above because the performance of the computing machines increased greatly till nowadays.For example, the CPU clock frequencies raised from MHz to GHz, i.e. 103 times or 3 orders increase.
The requirement for a specific hardware and / or software to process Big Data is also not unique.For example, Edgar F. Codd considered specific problems of multiprogramming scheduling [12][13][14].
Definition of the term Big Data basing on the only structure of the considered data is also incomplete.Problem of complexly structured, unstructured, or semistructured data representation is the well-known topic in the frame of, e.g., relation databases [2,[15][16] from the time of their appearance.
The requirement to get the result in reasonable time does not introduce any new features and seems to be like an attempt to define technical (software and hardware) characteristics without their explicit formulation.Such criterion may be considered as an attempt to move from low-level technical domain into the constrained with business requirements domain.
The requirement for innovations to deal with data is also not innovative.For example, in the frame of relational databases a relational algebra was developed to describe and investigate the properties of relations and corresponding operations.Moreover, techniques [2,[17][18][19] were developed that served as a guide for application of newly developed relational databases and relational theory for practical businesspurposes.Besides, a few most popular for-that-timeinnovative products were evaluated against the requirements for databases to be relational one [20].
The requirement to unlock the business value is not specific to the Big Data neither original.In a slightly old-fashioned manner the same problem was discussed by Edgar F. Codd [21] in term of "productivity" in his 1981 ACM Turing Award lecture entitled "Relational Database: A Practical Foundation for Productivity".A generation of value is sometimes explained in terms of new knowledge extraction that triggers, e.g., new use-cases and user experience or significantly change the way the user interact with.Similar topics were also covered by Edgar F. Codd in relation to the databases.For example, an ability to use natural languages as a database query language was considered in [22].Specific problems related to the "semantic models" representation and extraction were covered in [23].A related but different problem to describe, denote and manipulate the (representation of) missing information was discussed in [24][25] on the basis of three-valued logic.
An attempt to classify the data as big basing on its origin (nature) is also not reliable.Data from any research field (including but not limited to chemistry, physics, computer vision investigations like, e.g., in [26][27][28]) might produce huge and small pieces of data depending on the considered spatial scale and time-step.
Below we discuss a few concrete Big Data definitions.Considering Big Data as data "that's too big, too fast, or too hard for existing tools to process" [9] is likely to be a motto but it is not a robust definition.It might reveal its' place to attract attention, to promote the technology, or to involve a community around an ecosystem.Being understood literally, it implies that Big Data can't be processed at all.Such assumption does not seem reasonable.
Another definition [10] stating that Big Data could not be "… processed by general computers…" requires further explanation.For example, the term "general computers" might refer to general-purpose computers as well as to a typical "averaged" computer in use for a given architecture or a use-case.
As one may see, a common problem to define the Big Data term arises from an attempt to build a reference frame basing on relative conditions.The relative nature of Big Data is explicitly noted, e.g., in [3].It means, any Big Data definition directly or indirectly refers to the currently available hardware and software abilities will eventually become outdated.
Keeping in mind the original paper [1], we define Big Data as the data that requires the "preservation institutions" to deal with.It is important that preservation institutions are not about hardware or software, i.e., requirements (only).They are strongly linked to the organizational and legal issues as well as with authorities and public society communications.An example of such a mature institution with a long history is the libraries.The challenge is to guarantee operations for periods exceeding human life or even the time of some countries existence.Such challenge can't be addressed by an individual or a not-specialized organization.There is also well-known Internet-related example, i.e., Web Archive (https://web.archive.org/).By the way, the Web Archive -Internet Archive is officially registered as the library.
Preservation institutions in contrast with libraries in addition should provide guarantees on specific means to deal with the data (to generate added value or knowledge).The value of such operation may be demonstrated by the following case.Users of the PyGlow (https://github.com/timduly4/pyglow)geophysical package were surprised when NGDC (National Geophysical Data Center) NOAA (National Ocean and Atmosphere Administration) discontinued to provide (update) a few geophysical indexes.PyGlow is designed to provide python wrappers to a set of wellknown geophysical models (IRI -International Reference Ionosphere, HWM -Horizontal Wind Model, etc…) and by design depended on those indexes to perform models' runs.
In this paper we considered Big Data term definition from scientific publications only.Definitions provided with commercial products by corporations (like Amazon, Google, IBM, Microsoft, Yandex, etc) are out of the scope of this article.The revealed from the available publications principle Big Data features we evaluated against their ability to be unique on specific to Big Data.Basing on the E.F.Codd's publications only [2,[12][13][14][15][16][17][18][19][20][21][22][23][24][25] we clearly demonstrated that all the "specific" to the Big Data features were considered long before, e.g., in the frame of database theory and database management systems' implementation.We intentionally analyzed decades-aging publication to demonstrate that the "specific" to Big Data problems appeared long before the Big Data term appearance.Despite the relative youth of the computer science, it is possible to illustrate similar problems with even earlier publications.But we could not find a researcher (except Edgar F. Codd) who touched all those problem jointly and consistently.

Conclusion
In this paper we present a review of typical Big Data definitions.They all rely on the following features or a combination of them.(1) A volume of the data.
(2) Technical characteristics of the required hardware and software.(3) A few 'natural' or business-like characteristics similar to time-to-process or time-todeliver.(4) The above characteristics are usually nominated in relative units win contrast with currently available data-processing means.Some researchers denote the relative nature of the Big Data explicitly.
We demonstrated that it is impossible to define the Big Data term in absolute units because almost any data being classified as the Big one at a given moment time will eventually become un-Big due to the hardware and software facilities improvements.
Basing on the initial paper by C. Lynch [1], we proposed the definition of Big Data as the data that requires the preservation institutions to deal with, i.e. to generate added value or to extract a new knowledge.This definition also has a kind of relative nature but principally shifts the key features from purely technical or business domains into the institutional domain.
Big Data term is also used as an "umbrella" term which hides different brunches of IT-technologies.Moreover, it is often impossible to recognize the specific technology hidden behind.This usage of the term Big Data supports the "hype" around corresponding technologies and results in significant development in the educational and business applications.But it also makes harder to build a common terms' dictionary and spoils the understanding of the object being considered.
Thus, like Edsger Dijkstra's letter "Go To Statement Considered Harmful" [29] triggered the revolution in software development aimed to abolish the harmful practices, today there is the strong need to eliminate harmful practices in the field of the Big Data.