Background - Beliefs Ratings Beliefs Explanations Audits

Beliefs Ratings Beliefs Explanations Audits

4.2 Background

Knuth (1997) deﬁned an algorithm as a set of rules that deﬁnes a se-quence of operations such that each rule is effective and deﬁnite and such that the sequence terminates in ﬁnite time. Under the paradigm of imperative programming, a programmer explicitly formulates these computational rules of the system in a programming language. Study-ing an algorithm and the set of rules has proven to provide insights into the values and ideas inscribed into information systems. For example, Mackenzie (2013) argued that software developers and programmers live in ‘regimes of anticipation through technical practices’. Suchman (2012) proposed the concept of ‘sociomaterial conﬁgurations’ to draw our attention to the ‘imaginaries’ and ‘materialities’ that technologies

‘join together’ (p.48). Central to these considerations is the assump-tion that software developers are expert designers of informaassump-tion sys-tems. An important method in ﬁelds such as critical code studies or software studies are, therefore, interviews with software developers.

Such interviews have become a standard method of qualitative study designs. These empirical inquiries rely on a framing of programmers as programming subjects that exert power over design decisions while implementing a system. Such assumptions are also very prominent in software development approaches such as co-design or participatory design, which normatively claim that those affected by (future) infor-mation systems should have a say in design decisions, e.g. Bratteteig and Wagner (2014) and Vines et al. (2013). Hence, most software stud-ies frame software developers and programmers as making inscriptions to their code and algorithms which in turn allows for critical analysis.

In contrast, Machine Learning-based systems are trained, not programmed.

While they are still trained by somebody, this process is very different from imperative programming (Mackenzie, 2013). To train an ML

sys-tem, a mathematical model is formulated and a cost function is deﬁned.

The parameters of the mathematical model are optimised to minimise the cost function with respect to certain input and output data. For example, Mackenzie (2013) states that:

The long-standing AI question of how to get machines to ‘learn’ is less important in machine learning today. Rather, machine learning is largely focused on optimising the predictive power of statistical models. In con-trast to the many imaginings of AI as some form of omniscient expert, the mundane and increasingly pervasive use of machine learning in di-verse domains of social media, ﬁnance, security, many natural, clinical and engineering sciences, is largely in the service of increased predictiv-ity. By predictivity, I refer not to prediction as such, but to the scope and diversity of prediction. Predictivity is gauged less by veriﬁcation, than by optimisation.

With machine learning, very little is gained from studying the generic set of rules used to ‘train’ the model by minimising a cost function.

Algorithms like gradient descent or expectation-maximisation, which are used to train ML-based systems, merely describe the optimisation routine and are algorithmically trivial. Therefore, machine learning models escape critical examination using established methods from e.g.

software studies. The code of ML-based systems cannot be studied as text in which intentions of programming subjects are inscribed.

Figure 4.1 demonstrates this crucial difference using a concrete exam-ple. The ﬁgure shows a full-ﬂedged Python implementation of an ML-based system that can detect spam messages using a Support Vector Machine. With sufficient examples of spam and non-spam emails in the ﬁleemails.csv, the code can be used to train a working spam ﬁl-ter. Due to the user-friendly ML library scikit-learn by Pedregosa et al.

(2011b), training such a system only requires three lines of library im-ports and ﬁve lines of code. The code illustrates why a focus different from algorithms is required. We present this code example to highlight a number of things. First, thanks to powerful open-source ML libraries, training ML systems has become a straightforward task that can be ac-complished re-using existing code. Second, there is little task-speciﬁc to the code of ML systems. Third, ML systems depend entirely on the data used as input/training data.

The yellow highlights in Figure 4.1 emphasise all those aspects in the code that are speciﬁc to spam ﬁltering. The highlights show that only the data that is loaded and the dimensionality of the data is speciﬁc to the spam ﬁltering application. The ﬁleemails.csv could be easily replaced by a ﬁle called cancer.csv, thus turning the spam detection

# load Python libraries from pandas import read_csv

from sklearn.model_selection import \ train_test_split

from sklearn.svm import SVC

# load data

data = read_csv("emails.csv").values

# split data into features (X) and targets (y) X, y = data[ :, 1:19 ], data[ :, 19 ]

# split data into 80% train and 20% test data X_train, X_test, y_train, y_test = \

train_test_split( X, y, test_size=.2 )

# train classifier on training data clf = SVC()

clf.fit( X_train, y_train )

Figure 4.1: Self-contained Python imple-mentation of an ML system that detects spam in emails. Yellow highlights indi-cate the aspects that are speciﬁc to spam ﬁltering.

system into a cancer screening tool. For the self-contained example in Figure 4.1, it would also be trivial to replace the ML algorithm used to train the model. The Support Vector Machine could be easily replaced by a Neural Network or a Decision Tree. This would only require the changing of two lines – importing something other than theSupport Vector Classiﬁer (SVC) and assigning this to the variable clf. The ML system demonstrates clearly how central data is to ML. Algorithmi-cally, there is nothing speciﬁc about this example; the code is trivial and could be used by a layperson to train the ML system.

Expertise and the Role of Tutorials in the Formation of ML Practice

The point of departure for our study is Mackenzie’s (2017) book Ma-chine Learners: Archaeology of a Data Practice, which describes ML as a situationally aware calculative knowledge practice. The book ex-plores what it means to learn ML, arguing that ML crossed an epistemic threshold formulated in a statistical apparatus. Mackenzie (2017)

fur-ther argues that the data practices associated with ML delimit a posi-tivity of knowledge.

In this article, we focus on informal education in the form of online tu-torials to understand how ML is framed and deﬁned in practice. The analysis of ML tutorials is an expedient endeavour considering the in-creasing application and demand for Machine Learning. In 2018, Ten-cent estimated the global number of Artiﬁcial Intelligence researchers and industry practitioners to be between 200,000 to 300,000 people (Kahn, 2018). Compared to 18 million software developers in the world, this means that only about one per cent of software developers have the skills to engage with AI and ML as novel paradigms of program-ming. Considering the habitual practice of software developers to self-educate and the increasing demand for ML techniques, informal educa-tion in Machine Learning can be expected to increase signiﬁcantly con-sidering its growing importance and demand for workforce. A Stack Over-ﬂow (2019) survey demonstrated that informal learning and self-teaching are important ways of knowledge acquisition in a world in which de-velopment frameworks and technologies rapidly change. 86.8% of pro-fessional software developers (N=71,796) stated that they have taught themselves a new programming language, framework, or tool without taking a formal course. This is considerably higher than the 60.1% that have taken online courses in programming or software development (e.g. a MOOC).

Our focus on ML tutorials is motivated by the important role that in-formal education plays in software development and computer science.

Professional software developers frequently do not have a formal back-ground in computer science and are used to teaching themselves new technologies. This is especially important since many of the technolo-gies used in practice were invented and introduced long after they ﬁn-ished their formal education. The Stack Overﬂow (2019) survey of pro-fessional software developers (N = 71,796) found that 49.1% have a Bachelor’s degree, 25.4% have a Master’s degree, and 3.1% have a Doc-toral degree. Of those developers that studied at university level (N = 66,823), only 63.3% named computer science, computer engineering, or software engineering as their undergraduate major. 6.9% majored in information systems, information technology, or system administra-tion, and 3.9% in mathematics or statistics. This means that only three-fourth of the professional software developers have a university-level education. Of those, less than two-thirds have undergraduate training in computer science.

This is problematic because understanding and effectively applying

Machine Learning requires practitioners to have a strong background in linear algebra and statistics (Goodfellow et al., 2016). This lack of education regarding the application of ML is even more problematic considering that a large proportion of software developers graduated college before recent advances in machine learning and deep learn-ing were published and before curricula were upgraded to reﬂect the strong demand for ML practitioners.

While expertise used to be understood as something logical, Evans and Collins (2008) argue that the understanding of it has moved towards ideas of expertise as something practical that is ‘based in what you can do rather than what you can calculate or learn’. They also argue that the distinction between expert and non-expert cannot be neatly mapped onto the boundaries of social institutions and highlight (along with other STS scholars) that expertise and knowledge also ‘exist out-side the mainstream scientiﬁc community’ (Evans and Collins, 2008, p 609). Expertise is hence not solely a quality of individuals but also be-longs to a community. Within distributed and dispersed communities, the question of how knowledge may be shared and transferred becomes crucial (e.g. Vaast and Walsham, 2009). Orlikowski (2002, p. 249) ar-gues that knowledge is not something static or a stable disposition, but something that is continuously produced and reproduced in everyday practice. Tutorials are one way to reproduce and circulate a commu-nity’s knowledge practices. Tutorials constitute the practical doing of a community in material form and can be conceived as ‘boundary ob-jects’ (Star, 2010; Star and Griesemer, 1989) to share the expertise and knowledge of practitioners.

The style of ML tutorials can be described as technical writing which needs to fulﬁl requirements such as accuracy, understandability, and accessibility to a variety of readers (Wakkary et al., 2015). In that re-spect they may be understood asstandardised forms that enforce a par-ticular normative work practice across distant practitioners and pro-vide a shared format for solving problems (Neal, 1998) by explaining components, tools, and processes related to ML. As such, they aim to convey procedural knowledge about a speciﬁc technology, service, and/or product (Torrey et al., 2007). In addition, they produce anideal type of how Machine Learning should be understood, thus framing it in a particular way. Overall, the importance of tutorials, in particular in computing, has increased since the late 1990s. Recently, they have – to a great extent – replaced in-person tutorials (Anderson, 1997) and

‘have reached an unprecedented level of popularity’ (Wakkary et al., 2015, p. 610).

Im Dokument Users & Machine Learning-based Curation Systems (Seite 100-105)