– Activity constraint. AI algorithms are good for performing targeted tasks, but do not generalize their knowledge. Unlike humans, an AI trained to play chess cannot play another similar game, such as checkers. In addition, even in-depth training is not good at processing data that deviates from his teaching examples. To use the same ChatGPT effectively, you need to be an industry expert from the beginning and formulate a conscious and clear request, and then check the correctness of the answer.
– Costs of creation and operation. To create neuronetworks requires a lot of money. According to a report by Guosheng Securities, the cost of learning the natural language processing model GPT-3 is about $1.4 million. It may take $2 million to learn a larger model. For example, ChatGPT only requires over 30,000 NVIDIA A100 GPUs to handle all user requests. Electricity will cost about $50,000 a day. Team and resources (money, equipment) are required to ensure their «vital activity». It is also necessary to consider the cost of engineers for escort.
P.S.
Machine learning is moving towards an increasingly low threshold of entry. Very soon it will be as a website builder, where basic application does not need special knowledge and skills.
Creation of neural networks and data-companies is already developing on the model of «service as a service», for example, DSaaS – Data Science as a Service.
The introduction to machine learning can begin with AUTO ML, its free version, or DSaaS with initial audit, consulting and data markup. At the same time, even data markup can be obtained for free. All this reduces the threshold of entry.
The branch neuronetworks will be created and the direction of recommendatory networks, so-called digital advisers or solutions of the class «support and decision-making system (DSS) for various business tasks» will be developed more actively.
I discussed the AI issue in detail in a separate series of articles available via QR and link.
Big Data (Big Data)
Big data (big data) is the cumulative name for structured and unstructured data. Additionally, in volumes that are simply impossible to handle manually.
Often this is still understood as tools and approaches to work with such data: how to structure, analyze and use for specific tasks and purposes.
Unstructured data is information that has no predefined structure or is not organized in a specific order.
Application Field
– Process Optimization. For example, big banks use big data to train a chat bot – a program that can replace a live employee with simple questions, and if necessary, will switch to a specialist. Or the detection of losses generated by these processes.
– Forecasting. By analysing big sales data, companies can predict customer behaviour and customer demand depending on the season or the location of goods on the shelf. They are also used to predict equipment failures.
– Model Construction. The analysis of data on equipment helps to build models of the most profitable operation or economic models of production activities.
– Sources of Big Data Collection
– Social – all uploaded photos and sent messages, calls, in general everything that a person does on the Internet.
– Machine – generated by machines, sensors and the «Internet of things»: smartphones, smart speakers, light bulbs and smart home systems, video cameras in the streets, weather satellites.
– Transactions – purchases, transfers of money, deliveries of goods and operations with ATMs.
– Corporate databases and archives. Although some sources do not assign them to Big Data. Here there are disputes. Additionally, the main problem – non-compliance with the criteria of «renewability» of data. More about this a little below.
Big Data Categories
– Structured data. Have a related table and tag structure. For example, Excel tables that are linked together.
– Semi-structured or loosely structured data. They do not correspond to the strict structure of tables and relationships but have «labels» that separate semantic elements and provide a hierarchical structure of records. Like information in e-mails.
– Unstructured data. They have no structure, order, hierarchy at all. For example, plain text, like in this book, is image files, audio and video.
Such data is processed on the basis of special algorithms: first, the data is filtered according to the conditions that the researcher sets, sorted and distributed among individual computers (nodes). The nodes then calculate their data blocks in parallel and transmit the result of the computation to the next stage.
Big data feature
According to different sources, big data have three, four and, according to some opinions, five, six or even eight components. However, let’s focus on what I think is the most sensible concept of four components.
– Volume (volume): Information should be a lot. Usually speak of quantity from 2 terabytes. Companies can collect a huge amount of information, the size of which becomes a critical factor in analytics.
– Velocity (speed): data must be updated, otherwise they become obsolete and lose value. Almost everything that happens around us (search queries, social networks) produces new data, many of which can be used for analysis.
– Variety (variety): generated information is heterogeneous and can be presented in different formats: video, text, tables, numerical sequences, sensor readings.
– Veracity (reliability): the quality of the data analysed. They must be reliable and valuable for analysis, so that they can be trusted. Low-fidelity data also contain a high percentage of meaningless information, which is called noise and has no value.
Restrictions on the Big Data Implementation
The main limitation is the quality of the raw data, critical thinking (what do we want to see? What pain? – This is done ontological models), the right selection of competencies. Well, and most importantly – people. Data-Scientists are engaged in work with the data. Additionally, there is one common joke: 90% of the data-scientists are data-satanists.
Digital doppelgangers
A digital double is a digital/virtual model of any object, system, process or person. In its conception, it accurately reproduces the shape and actions of the physical original and is synchronized with it. The error between the double and the real object must not exceed 5%.
It must be understood that it is almost impossible to create an absolute digital counterpart, so it is important to determine which domain is rationally modelled.
The concept of the digital counterpart was first described in 2002 by Michael Grieves, a professor at the University of Michigan. In the book «The Origin of Digital Doubles» he divided them into three main parts:
1) physical product in real space;
2) virtual product in virtual space;
3) data and information that combine virtual and physical products.
The digital double itself can be:
– prototype – the analogue of the real object in the virtual world, which contains all the data for the production of the original;
– a copy – a history of operation and data about