ZsoltTest: Unterschied zwischen den Versionen

Aus FernFH MediaWiki
Zur Navigation springen Zur Suche springen
(Der Seiteninhalt wurde durch einen anderen Text ersetzt: „ffgg“)
Markierung: Ersetzt
Zeile 1: Zeile 1:
ffgg
<span id="einführung-in-computational-intelligence-und-ai"></span>
= Einführung in Computational Intelligence und AI =
 
Die Idee hinter der künstlichen Intelligenz, KI (Artificial Intelligence - AI), bestand darin, die Prozesse der menschlichen Wahrnehmung wie Denken, Lernen oder Mustererkennung nachzubilden. Das Erscheinen der KI in der akademischen Welt begann mit der Simulation des Verhaltens neuronaler Netze bei IBM im Jahr 1955. Dies führte zu einer Konferenz zu diesem Thema, die heute als Dartmouth-Konferenz bekannt ist und als Geburtsstunde der künstlichen Intelligenz gilt.
 
Die Geschichte der KI als akademische Disziplin hat einen langen, weitreichenden und abenteuerlichen Weg zurückgelegt, der optimistische und zweifelhafte, verrufene Phasen durchquerte und durch viele wissenschaftliche Bereiche führte. Es begann mit dem Studium der ,,formalen“Argumentation, das unter anderem zu Alan Turings Berechnungstheorie oder zur Programmiersprache Lisp sowie zum kommerziellen Erfolg von Expertensystemen in den frühen 1980er Jahren führte. Es folgte eine eher zweifelhafte Zeitraum, während andere Ansätze wuchsen, wie z.B. Bewegen von technischen Maschinen.
 
Mit der rasanten Entwicklung der Rechengeschwindigkeit und -fähigkeiten der Computer entstanden spätestens in den 1990er Jahren rechenintensive Zweige der KI. Computational Intelligence umfasst die biologisch motivierten Bereiche der Informationsverarbeitung, wie z.B. evolutionäre Algorithmen. Maschinelles Lernen ist ein umfassender Zweig der KI, der viele rechenintensive Teilbereiche umfasst. Eine davon ist die auf statistischen Ansätzen basierende Mustererkennung, die Ende der 1990er Jahre zu einem großen kommerziellen Erfolg bei Befehls- und Kontrollsystemen (command&amp;control systems), Spracherkennungssystemen mit großem Vokabular (Large Vocabulary Speech Recognition Systems - LVSRS) und ihren Anwendungen in Callcentern und in der Radiologie führte. Diese Ansätze haben sich in den 2000er Jahren abgeflacht, als sie die Sättigung des angewandten statistischen Ansatzes erreichten.
 
Mittlerweile führte die Entwicklung von Ideen zur Wissensdarstellung zu fortgeschritteneren Ansätzen wie Ontologie oder Wissensgraphen (Knowledge Graphs - KG).
 
Die wichtigste Entwicklung war jedoch die Wiederbelebung der neuronalen Netzwerkforschung, die zu vielen erfolgreichen Anwendungen führte, wie z.B. Erkennung handgeschriebener Ziffern. Der eigentliche Durchbruch bei der Verwendung neuronaler Netze kam mit Deep Learning, das erfolgreich auf die Klassifizierung großer Datenmengen angewendet werden kann, wie z.B. Bilder. Der Erfolg von Deep Learning führte zu einem enormen Anstieg des Interesses an und der Finanzierung von KI. Deep Reinforcement Learning ermöglicht die erfolgreiche Umsetzung automatischer Steuerungsaufgaben.
 
Die jüngste Entwicklung im Bereich der KI ist die Anwendung spezifischer großer Sprachmodelle (Large Language Model - LLM) und fortschrittlicher Darstellungstechniken in prädiktiven Transformatoren, die zu Produkten führen, wie z. B. ChatGPT 3.5.
 
Im Wesentlichen soll KI die unterschiedlichen Fähigkeiten der menschlichen Intelligenz umsetzen. Dazu gehören Lernen, Wahrnehmung, logisches Denken, Abstraktion sowie komplexere Fähigkeiten wie z.B. Zusammenfassung, Kontrolle, Aufgabenlösungsfähigkeiten und vieles mehr. Eines der langfristigen Ziele ist die Schaffung einer allgemeinen Intelligenz, der sogenannten Künstlichen Allgemeinen Intelligenz (Artificial General Intelligence - AGI), die eine Vielzahl von Problemen ähnlich der menschlichen Intelligenz lösen könnte.
 
Raymond Kurzweil, Pionier unter anderen der optischen Texterkennung (Optical Character Recognition - OCR), Sprachsynthese (speech synthesis), Spracherkennung, ein Visionär der künstlichen Intelligenz hat vorhergesagt, dass (einfach formuliert) spätestens im Jahr 2029 die KI klüger sein werden als das Mensch. Kritiker sagen, dass in der Richtung der Entwicklung zukünftiger Technologien er wahrscheinlich Recht hat, aber die prognostizierte Zeitspanne stimmt nicht, er ist zu optimistisch.
 
KI ist eine unvermeidliche zukünftige Entwicklungsrichtung und eine der Kernfachgebiete der Wissenschaftsfeld Data Science. Beachten Sie, dass Data Science in wirtschaftlichen Kontext auch als Business Analytics bekannt ist.
 
<span id="grundlagen-von-computational-intelligence-und-ai"></span>
== Grundlagen von Computational Intelligence und AI ==
 
<span id="mathematische-grundlagen-von-ki"></span>
=== Mathematische Grundlagen von KI ===
 
Es gibt keine Einheitliche Definition von KI. Eine mögliche ausdrucksvolle Definition kann wie folgt angegeben werden: KI ist the field of study in computer science that develops and studies intelligent machines. Allerdings unter KI ist auch die ,,intelligente Maschine“ selbst zu verstehen.
 
Die Ziele von KI sind Teilmenge möglicher Arten menschlicher intelligenter Aktivitäten zu implementieren. Dazu gehört Argumentation (reasoning), Wissensrepräsentation (knowledge representation), Planung (planning), Lernen (learning), Verarbeitung natürlicher Sprache (natural language processing), Wahrnehmung (perception) und Unterstützung für Robotik (support for robotics).
 
Im Folgenden geben wir eine kurze mathematische Beschreibung einiger ausgewählten grundlegender Konzepte der KI. Das beinhaltet
 
# Regelbasierte Methoden,
# Wissensrepräsentation,
# Linear regression mit MSE-Framework
# Klassifikation - Bayesian decision theory
# Klassifikation - lineare Diskriminanzfunktionen
# Lernen aus Beispiele
# Learning durch Steuereung - MDP
 
<span> '''''Regelbasierte Methoden'''''</span>
 
Logische Argumentation zielt auf menschliche Fähigkeit ab, Schritt für Schritt zu argumentieren oder/und logische Schlussfolgerungen zu ziehen. Dies wird realisiert durch Regelbasierte Methoden (rule-based methods). Sie bieten eine deterministische Methode zur Klassifizierung in Problemen, in welchen Klassen aus Beziehungen von Entitäten bestimmt werden können, die Eigenschaften einiger Entitäten darstellen. Solche Entitätenbeziehungen werden üblicherweise durch die breite Klasse der if-then rules (Wenn-Dann-Regeln) dargestellt. Eine einfache if-then rule lautet:<br />
IF EinerDerFünfGrößtenOrteÖsterreichs(x) THEN Stadt(x)<br />
was natürlich bedeutet, dass wenn ein Objekt x der Eigenschaft hat, einer der fünf größten Orte von Österreichs zu sein, dann ist x eine Stadt.
 
If-then rules können auch logische Verknüpfungen einbeziehen, wie es im folgenden Beispiel gezeigt wird:<br />
IF Fahrzeug(x) AND AufSchienenFahrendes(x) THEN Bahnfahrzeug(x)<br />
Darüber hinaus sind in if-then rules auch Funktionen erlaubt, die numerische Werte zurückgeben. Dies wird wie folgt veranschaulicht:<br />
IF Female(x) AND (Age(x) &lt; 16) THEN Girl(x),<br />
wobei der Age(x) ist eine Funktion, die das numerische Alter in Jahren zurückgibt, und der Ausdruck (Age(x) &lt; 16) gibt entweder logisch ,,True“ oder ,,False“ zurück.
 
Eine predicate (Aussage), wie z.B. Vögel(x), HatFlügel(x) and HatKörperbedeckungAusFeder(x) and HatSchnabel(x), ist ein Test, der entweder logisch ,,True“ oder ,,False“ zurückgibt. Es gibt zwei Haupttypen von if-then rules, die propositional (variablenfrei Aussage) and first-order Regeln. Die bisherige Beispiele sind alle first-order Regel, da alle sich auf eine Variable x beziehen. Ein Beispiel für propositional if-then rules ist<br />
IF EineDerFünfGrößtenOrteVonÖsterreich (Wien) THEN Stadt(Wien),<br />
da diese Regel eine Aussage nur für die logische Konstante (=bestimmte atomare Instanz) Wien beschreibt.
 
Die first-order if-then rules sind leistungsstarke logische Regeln, was durch folgendes Beispiel verdeutlicht wird:<br />
IF Parent(x,y) AND Parent(y,z) THEN Grandchild(z,x)
 
If-then rules können auf algorithmischem Wege erlernt werden. Der Sequential Covering Learning Algorithmus basiert auf dem Lernen aus Trainingsdaten, die aus einer Menge positiver und negativer Aussagen (Prädikate) bestehen. Der Algorithmus erstellt Regeln, die die Menge der positiven Prädikate abdecken, aber keines der negativen Prädikate. Das Grundprinzip des Algorithmus besteht darin, jeweils eine Regel zu lernen und diesen Vorgang zu wiederholen um schrittweise alle positiven Prädikate abzudecken. Daher heißt der &quot;Sequential Covering&quot;. Entsprechend umfasst der Algorithmus drei Schritte: Lernen einer einzelnen Regel um am meisten positive Prädikate abzudecken (Learn-One-Rule), Löschen der dadurch abgedeckte Prädikate und Iterieren. Das Ergebnis ist ein disjunktes Regelwerk, das die gesamte Menge der positiven Prädikate abdeckt.
 
'''Beispiel .'''
 
Als Trainingsdaten werden die folgende Aussagen über Möbelwaren angegeben:<br />
 
 
<div class="center">
 
<div id="tab:Td_Aussagen_über_Möbelwaren">
 
{| class="wikitable"
|+ Trainingsdaten - Aussagen über Möbelwaren
|-
! style="text-align: left;"| Aussage-ID  
! style="text-align: center;"|   Größe  
! style="text-align: center;"|   Möbelhaus  
! style="text-align: center;"|   Teuer
|-
| style="text-align: left;"| 1  
| style="text-align: center;"|   groß  
| style="text-align: center;"|   Möbelix  
| style="text-align: center;"|   False
|-
| style="text-align: left;"| 2  
| style="text-align: center;"|   klein  
| style="text-align: center;"|   Möbelix  
| style="text-align: center;"|   False
|-
| style="text-align: left;"| 3  
| style="text-align: center;"|   groß  
| style="text-align: center;"|   Kika  
| style="text-align: center;"|   True
|-
| style="text-align: left;"| 4  
| style="text-align: center;"|   klein  
| style="text-align: center;"|   Kika  
| style="text-align: center;"|   False
|-
| style="text-align: left;"| 5  
| style="text-align: center;"|   groß  
| style="text-align: center;"|   XXXLutz  
| style="text-align: center;"|   True
|-
| style="text-align: left;"| 6  
| style="text-align: center;"|   klein  
| style="text-align: center;"|   XXXLutz  
| style="text-align: center;"|   True
|}
 
 
</div>
 
</div>
Die Aussagen 3,5 und 6 bilden die Menge der positiven Prädikaten, während die anderen (1,2 und 4) der Menge der negativen Prädikaten gehören. Zuerst werden mittels Learn-One-Rule die beste Hypothese bestimmt, d.h. die Regel die am meisten positive Prädikate abdeckt:<br />
IF Möbelhaus(x,XXXLutz) THEN Teuer(x),<br />
wobei x die Variable für Möbelwaren ist. Dann werden die Prädikaten 5,6 aus dem Trainingsdaten gelöscht. Wendet man Learn-One-Rule noch einmal an, bekommt man die Regel<br />
IF Größe(x,groß) AND Möbelhaus(x,Kika) THEN Teuer(x)<br />
Damit alle positiven Prädikate wurde abgedekt, also endet der Algorithmus hier. Beachten Sie, dass die beiden gefundenen Regeln disjunkt sind, was aus dem Verlauf des Algorithmus folgt.
 
Die endgültige Regel lautet wie folgt:<br />
IF Möbelhaus(x,XXXLutz) (deckt die Beispiele 5,6)<br />
OR (Größe(x,groß) AND Möbelhaus(x,Kika)) (deckt das Beispiel 3)<br />
Then Teuer(x)
 
Die Visualisierung des Sequential Covering Learning Algorithmus wird in Abbildung 1.1 gezeigt.
 
Abbildung 1.1 Visualisierung des Sequential Covering Learning Algorithmus
 
Das vorige Beispiel zeigt den Aufbau eines Regels mittels Sequential Covering Learning Algorithmus für Binäre Klassifizierung, welche die mögliche Prädikate in eine von zwei Klassen eingeteilt werden: Teuer(x) = True oder Teuer(x) = False. Die Regelbasierte Methoden können für Klassifizierung nicht nur in zwei sondern in mehrere Klassen angewendet werden. Nachfolgend wird der Sequential Covering Learning Algorithmus für den allgemeinen Fall von Lernregeln für die Klassifizierung in eine endliche Anzahl von Klassen gezeigt.
 
<span>'''—————————————————————————————'''</span><br />
Algorithm  Sequential Covering Algorithm<br />
<span>'''—————————————————————————————'''</span><br />
Eingabe:<br />
- Die Menge T, Trainingsdaten, bestehend aus positiver und negativer Prädikaten<br />
- Die geordnete Menge von Klassen, <math display="inline">C={c1,...,c_k}</math><br />
Output: Individuelle Regel für jede Klasse<br />
in Form eine mit AND verbundener Liste der gefundenen Regeln,<br />
<math display="inline">R_i</math> für die Klasse <math display="inline">c_i, i=1,...,k</math>,<br />
die die Klassifikation durch Ausprobieren der Regellisten, <math display="inline">R_i</math>,<br />
in der Reihenfolge <math display="inline">i=1,...,k</math>, realisieren.<br />
<span>'''—————————————————————————————'''</span><br />
1 Initialize the rule lists, <math display="inline">R_i = {}</math>, <math display="inline">i=1,...,k</math><br />
2 for i= 1:(k-1)<br />
3   while not every positive predicate is covered in class <math display="inline">c_i</math><br />
4     <math display="inline">r</math> &lt;– Learn-One-Rule<br />
5     Removing predicates from training data covered by <math display="inline">r</math><br />
6     Add r to the end of the rule list <math display="inline">R_i</math><br />
7   end<br />
8 end<br />
<span>'''—————————————————————————————'''</span><br />
 
 
Auf diese Weise werden mögliche komplexe if-then rules erlernt. Dieser Greedy-Algorithmus ist nicht unbedingt optimal, d. h. er liefert nicht unbedingt die kompaktesten Regeln. Daher können die resultierenden Regeln in einem Nachbearbeitungsschritt durch die Anwendung logischer Umformulierungsregeln weiter vereinfacht werden. Der Schritt Learn-One-Rule kann auf auf unterschiedliche Art und Weise realisiert werden. Eine effektiver Ansatz ist beam search. Man kann über Sequential Covering Algorithmenfamilie reden, da geänderte und erweiterte Varianten des ursprünglichen Algorithmus existieren. Typische Varianten sind Ripper Algorithmus, Foil Algorithmus oder CN2 Algorithmus. If-then rules können auch auf andere Weise, z. B. mittels Entscheidungsbäume auch gelernt werden.
 
Regeln haben den Vorteil, dass sie einfach zu interpretieren sind und in relationalen Datenbanken implementiert werden können. Allerdings hängt die Auswahl der Prädikate für die Trainingsdaten stark von der Aufgabe ab und ist in der Praxis eine schwierigere Aufgabe als das Erlernen der Regeln. Regelbasierte Methoden sind integraler Bestandteil von Expertensystemen in der KI.
 
<span> '''''Wissensrepräsentation'''''</span>
 
Representing knowledge on word level can be realized by probabilistic model of natural language, which is called Language Model (LM) Usually an LM is used to estimate the probability of a given word sequence <math display="inline">w_1,w_2,...,w_n</math>, i.e. <math display="inline">P(w_1,w_2,...,w_n)</math>, which can be given as a product of the conditional probabilities as<br />
<math display="inline">P(w_1,w_2,...,w_n) =P(w_1)P(w_2|w_1)...P(w_n|w_{n-1},...,w_2,w_1)</math>.
 
In the N-gram LM, due to practical reasons, the word sequence probability, <math display="inline">P(w_1,w_2,...,w_n)</math> is approximated as<br />
<math display="inline">P(w_1,w_2,...,w_n) = \prod_{i=1}^{n} P(w_i|w_{i-1},w_{i-2},...,w_{i-N+1})</math>,<br />
where the words with negative index must be omitted from the probabilities. The conditional probabilities <math display="inline">P(w_i|w_{i-1},w_{i-2},...,w_{i-N+1})</math> can be estimated from the relative frequencies of the N-grams and (N-1)-gramms as<br />
<math display="inline">P(w_i|w_{i-1},w_{i-2},...,w_{i-N+1}) = \frac{c(w_{i},w_{i-1},...,w_{i-N+1})}{c(w_{i-1},w_{i-2},...,w_{i-N+1})}</math>,<br />
where <math display="inline">c(x)</math> stands for the number of occurrences of string x.
 
The probabilities <math display="inline">P(w_i) for N=1, P(w_i,w_j)</math> for <math display="inline">N=2</math> and <math display="inline">P(w_i,w_j,w_k)</math> for <math display="inline">N=3</math> are called unigram, bigram and trigram probabilities, respectively. The N-gram LM is purely Statistical LM (SLM), since it is trained on N-gram statistics of the training corpora. The estimates of the conditional probabilities <math display="inline">P(w_i|w_{i-1},w_{i-2},...,w_{i-N+1})</math> to be reliable requires that the corresponding N-grams and (N-1)-grams must be present in the training corpora. This implies that the training corpus must be large enough. The SLM can not assign any probability to N-grams, which does not appear in the training corpus. This problem is usually solved by applying smoothing techniques, in which a small part of the probability measure is shifted to the unseen grams.
 
The most effective method of smoothing is the Kneser–Ney smoothing, in which a fixed value is subtracted from the conditional probability’s of N-grams with lower frequencies and the so won probabilities are spread over the conditional probabilities of all non-appearing N-grams. The equation for the conditional bigram probabilities is given by<br />
<math display="inline">P_{KN}(w_i|w_{i_1}) = \frac{max(c(w_{i-1},w_i)-\delta,0)}{\sum_{w_j} c(w_{i-1},w_j)} + \lambda_{w_{i-1}}P_{KN}(w_i)</math><br />
and<br />
<math display="inline">P_{KN}(w_i) = \frac{|{w_j: 0 < c(w_j,w_i)}|}{|{(w_j,w_k): 0 < c(w_j,w_k)}|}</math>.<br />
Here <math display="inline">\delta</math> is the constant discount value to be subtracted and <math display="inline">\lambda_{w_{i-1}}</math> is a normalization constant, which is set to make the sum of <math display="inline">P_{KN}(w_i|w_{i_1})</math> over all <math display="inline">w_i</math> equal to one. The conditional probability’s of N-grams with lower frequencies will be determined by the unusual probability term <math display="inline">P_{KN}(w_i)</math>, which realizes an estimation for the conditional probability of the unseen bigram form other bigram counts related to <math display="inline">w_i</math>.
 
Besides of LMs, another ways of knowledge representation are
 
* Semantic network - a graph representing semantic relations between concepts. The concepts are represented by the nodes and the edges specify the semantic relations. One reason to design semantic networks is to represent knowledge in machine readable form. Knowledge graphs are semantic networks with limited set of semantic relations.
* Ontology- hierarchical representation of concepts and their relationships, which can be managed by standard ontology language such as OWL or RDF.
 
<span> '''''Linear regression mit MSE-Framework''''' </span>
 
The task of linear regression is to give an linear prediction for the random scalar output <math display="inline">y \in \mathbb{R}</math> from the random input vector <math display="inline">{\bf x}</math>. Let <math display="inline">\hat{y}</math> be the predicted value of <math display="inline">y</math>. It is assumed that the <math display="inline">y</math> depends linearly from the vector <math display="inline">{\bf x}</math>, hence the predicted value <math display="inline">\hat{y}</math> can be given as linear combination of <math display="inline">{\bf x}</math> as <math display="inline">\hat{y}= w_1*x_1 + \ldots + w_n*x_n</math> or in vector form
 
<math display="block">\hat{y}= {\bf w}^T {\bf x},</math>
 
where <math display="inline">{\bf w} = (w_1,\ldots,w_n)^T</math> is a column vector of parameters. Here <math display="inline">^T</math> stands for the transpose operation and the vectors are by default column vectors. The parameters <math display="inline">w_i</math> can be seen as weights. If input value <math display="inline">x_i</math> has positive weight, then increasing <math display="inline">x_i</math> also increases the predicted value <math display="inline">\hat{y}</math>. Similarly if <math display="inline">x_i</math> has negative weight, then increasing <math display="inline">x_i</math> decreases the predicted value <math display="inline">\hat{y}</math>. If <math display="inline">w_i</math> is a large value then <math display="inline">x_i</math> has a large impact on <math display="inline">\hat{y}</math>. If <math display="inline">w_i=0</math> then <math display="inline">x_i</math> has no effect on <math display="inline">\hat{y}</math>.
 
Suppose that <math display="inline">K</math> examples of the vector <math display="inline">{\bf x}</math> and the correct value of output <math display="inline">y</math> to each of them are given. We arrange the input vectors in a matrix <math display="inline">{\bf X}</math> on that way that vector <math display="inline">{\bf x}_k</math> is placed into the <math display="inline">k</math>-th row of matrix <math display="inline">{\bf X}</math>, <math display="inline">k=1,\ldots,K</math>. The output values and the predicted values are also arranged into a column vector <math display="inline">{\bf y}</math> and <math display="inline">{\bf \hat{y}}</math> so, that the correct value <math display="inline">y_k</math> and the predicted value <math display="inline">\hat{y}_k</math> belonging to the input vector <math display="inline">{\bf x}_k</math> comes to the <math display="inline">k</math>-th position in the vector <math display="inline">{\bf y}</math> and <math display="inline">{\bf \hat{y}}</math>, respectively.
 
The optimal parameter vector <math display="inline">{\bf w}</math> can be determined from an optimum task with respect to a performance measure quantifying the quality of prediction.
 
<span> ''Linear regression with mean squared error''</span>
 
One possible choice for the performance measure is the Mean Squared Error (MSE) of the predictions, which is biven by
 
<math display="block">MSE = \frac{1}{K} \sum_{k} ({\bf \hat{y}}_k - {\bf y}_k)^2 = \frac{1}{K} \lVert {\bf \hat{y}} - {\bf y} \rVert_2^2,</math>
 
where <math display="inline">\lVert z \rVert_2</math> stands for the <math display="inline">\mathbb{L}_2</math> norm of <math display="inline">z</math>.
 
Then the optimal value of parameter vector <math display="inline">{\bf w}</math> is obtained by minimizing the MSE. The necessary condition of having a local minimum of MSE is to have a value of parameter vector <math display="inline">{\bf w}</math>, for which the gradient of MSE is 0. It can be shown, that in our case, MSE as function of <math display="inline">{\bf w}</math> is a convex function, in which case there is only one such value of <math display="inline">{\bf w}</math>, and hence it is also the global minimum point. Therefore the best value of parameter vector <math display="inline">{\bf w}</math> is obtained by solving the equation
 
<math display="block">\nabla_{{\bf w}} MSE = \nabla_{{\bf w}} \left( \frac{1}{K} \lVert {\bf \hat{y}} - {\bf y} \rVert_2^2 \right) = 0,</math>
 
where <math display="inline">\nabla_{{\bf w}}</math> stands for the gradient with respect to <math display="inline">{\bf w}</math>. The set of predicted outputs can be given in vector-matrix form as <math display="inline">{\bf \hat{y}} = {\bf X}{\bf w}</math>. Applying it in the above relation gives the equation to be solved in terms of <math display="inline">{\bf w}</math> as
 
<math display="block">\nabla_{{\bf w}} MSE = \nabla_{{\bf w}} \left( \frac{1}{K} \lVert {\bf X}{\bf w}  - {\bf y} \rVert_2^2 \right) = 0.</math>
 
Omitting the constant <math display="inline">\frac{1}{K}</math>, evaluating the gradient and performing several rearrangements we get
 
<math display="block">\begin{aligned}
\nabla_{{\bf w}} \left( \lVert {\bf X}{\bf w}  - {\bf y} \rVert_2^2 \right) &&\hspace{-5mm}= \nabla_{{\bf w}} \left({\bf X}{\bf w}  - {\bf y} \right)^T \left({\bf X}{\bf w}  - {\bf y} \right) = \nabla_{{\bf w}} \left( {\bf w}^T {\bf X}^T{\bf X}{\bf w} - 2  {\bf w}^T {\bf X}^T {\bf y} +  {\bf y}^T  {\bf y}\right) \nonumber \\
&&\hspace{-5mm}= 2 {\bf X}^T{\bf X}{\bf w} - 2 {\bf X}^T {\bf y} = 0, \nonumber
\end{aligned}</math>
 
where we utilized in the second equality that <math display="inline">{\bf y}^T {\bf X} {\bf w} = {\bf w}^T {\bf X}^T {\bf y}</math>, since <math display="inline">{\bf y}^T {\bf X} {\bf w}</math> is a scalar and thus it equals to its transpose, and in the last equality that matrix <math display="inline">{\bf X}^T{\bf X}</math> is symmetric enabling the above form of <math display="inline">\nabla_{{\bf w}}\left( {\bf w}^T {\bf X}^T{\bf X}{\bf w}\right)</math>.
 
Finally the solution for the parameter vector <math display="inline">{\bf w}</math> can be expressed from the above equation as <math display="block">{\bf w} = \left({\bf X}^T{\bf X}\right)^{-1} {\bf X}^T {\bf y}.</math>
 
Matrix <math display="inline">{\bf X}^T{\bf X}</math> is quadratic enabling taking the inverse of it. Nevertheless the inverse exists only if matrix <math display="inline">{\bf X}^T{\bf X}</math> is non-singular, which usually holds, since matrix <math display="inline">{\bf X}</math> is constructed from independent input examples.
 
The above equations in scalar form lead to system of equations, which are called as normal equations.
 
Linear regression is also used for a slightly extended prediction model, which is given as <math display="block">\hat{y}= {\bf w}^T {\bf x} + b,</math>
 
where the scalar <math display="inline">b</math> is called bias and results that the hyperplane described by the multivariate function <math display="inline">y({\bf w})</math> does not pass through the origin. This extended case can be led back to the base case by introducing the extended vectors <math display="inline">{\bf x}^* = ({\bf x}, 1)^T</math> and <math display="inline">{\bf w}^* = ({\bf w}, b)^T</math> leading to the equivalent description
 
<math display="block">\hat{y}= {\bf w}^{*T} {\bf x}^*.</math>
 
An illustrating example with one-dimensional input values is shown in Figure 1.
 
<span> ''Connection of linear regression with MSE to maximum likelihood estimation''</span>
 
Let <math display="inline">{\bf z} = (z_1,\ldots,z_M)</math> be a sequence of samples taken from the underlying population producing independent samples, each of them from the same random variable <math display="inline">Z</math> with a likelihood function <math display="inline">p(z|{\mbox{\boldmath$\theta$}})</math> from a known distribution family with unknown parameter vector <math display="inline">{\mbox{\boldmath$\theta$}}</math>. The likelihood function equals to the probability (mass) function and the probability density function (pdf) for discrete and continuous distribution, respectively. The Maximum Likelihood Estimation (ML estimation or MLE) of the parameter vector <math display="inline">{\bf \theta}</math> is the value, which maximizes the probability of the observed sequence of samples. In other words
 
<math display="block">{\mbox{\boldmath$\theta$}}_{ML} = \arg \max_{{\mbox{\boldmath$\theta$}}} p({\bf z}|{\mbox{\boldmath$\theta$}}) =  \arg \max_{{\mbox{\boldmath$\theta$}}} p(z_1,\ldots,z_M|{\mbox{\boldmath$\theta$}}) = \arg \max_{{\mbox{\boldmath$\theta$}}} \prod_{m=1}^{M} p(z_m|{\mbox{\boldmath$\theta$}}),</math>
 
where we utilized that the samples are taken independently from the same random variable <math display="inline">Z</math>.
 
The product over more likelihoods can cause numerical issues, like e.g. numerical underflow. Therefore it is usual to use the logarithm of the likelihood instead of the likelihood. This modification does not cause any change in argmax and max, since the logarithm function is monotone. This gives the usual form of the ML estimate of the parameter vector <math display="inline">{\bf \theta}</math> as
 
<math display="block">{\mbox{\boldmath$\theta$}}_{ML} = \arg \max_{{\mbox{\boldmath$\theta$}}} log \left( \prod_{m=1}^{M} log p(z_m|{\mbox{\boldmath$\theta$}})\right) = \arg \max_{{\mbox{\boldmath$\theta$}}}\sum_{m=1}^{M} log p(z_m|{\mbox{\boldmath$\theta$}}).</math>
 
Applying the principle of MLE to a conditional likelihood function <math display="inline">p(y|{\bf x})</math>, the conditional likelihood function from a known distribution family with unknown parameter vector <math display="inline">{\mbox{\boldmath$\theta$}}</math> becomes <math display="inline">p(y|{\bf x}, {\mbox{\boldmath$\theta$}})</math>. Thus the ML estimate of the parameter vector <math display="inline">{\bf \theta}</math> in such case is given by
 
<math display="block">{\mbox{\boldmath$\theta$}}_{ML} = \arg \max_{{\mbox{\boldmath$\theta$}}}\sum_{k=1}^{K} log~ p(y_k|{\bf x}_k, {\mbox{\boldmath$\theta$}}).</math>
 
Now we select the normal distribution as the known distribution family and set the mean of the distribution to be the predicted output value, <math display="inline">\hat{y}</math> and the variance of the distribution to a fixed value <math display="inline">\sigma^2</math>. Then the conditional likelihood function, which in this case a conditional pdf, becomes <math display="inline">p(y|{\bf x},{\mbox{\boldmath$\theta$}}) = N(y| \hat{y}({\bf x},{\bf w}),\sigma^2)</math>, where <math display="inline">N(y|\mu,\sigma^2)</math> stands for the pdf of the normal distribution with mean <math display="inline">\mu</math> and variance <math display="inline">\sigma^2</math>. Applying the expression of the pdf of the normal distribution <math display="inline">N(y|\mu,\sigma^2)= \frac{1}{\sqrt(2 \pi \sigma^2)}exp\left(-\frac{1}{2} \frac{(y-\mu)^2}{\sigma^2} \right)</math>, we get for the ML estimate of the predicted output values, <math display="inline">{\bf \hat{y}}</math>
 
<math display="block">\begin{aligned}
{\bf \hat{y}} &&\hspace{-5mm}= \arg \max_{\bf \hat{y}}\sum_{k=1}^{K} log N(y_k| \hat{y_k}({\bf x_k},{\bf w}),\sigma^2)\nonumber \\
&&\hspace{-5mm} =  \arg \max_{\bf \hat{y}}\sum_{k=1}^{K} log \left(\frac{1}{\sqrt(2 \pi \sigma^2)}exp\left(-\frac{1}{2} \frac{(y_k-\hat{y_k})^2}{\sigma^2} \right)\right) \nonumber \\
&&\hspace{-5mm}= \arg \max_{\bf \hat{y}} \left(-K log \sigma -\frac{K}{2} log(2 \pi) - \sum_{k=1}^{K} \frac{(y_k-\hat{y_k})^2}{2\sigma^2} \right) = \arg \max_{\bf \hat{y}} \left(-\sum_{k=1}^{K} ((y_k-\hat{y_k})^2\right) \nonumber \\
&&\hspace{-5mm}= \arg \min_{\bf \hat{y}} \sum_{k=1}^{K} ((y_k-\hat{y_k})^2 =  \arg \min_{\bf \hat{y}} \lVert {\bf \hat{y}} - {\bf y} \rVert_2^2 \nonumber.
\end{aligned}</math>
 
Assuming the linear relation among <math display="inline">\hat{y}</math> and <math display="inline">{\bf x}</math> as <math display="inline">{\bf \hat{y}} = {\bf X}{\bf w}</math>, the above optimization with respect to <math display="inline">{\bf w}</math> results in the same estimation for <math display="inline">{\bf w}</math> as minimizing the MSE. Consequently the minimizing MSE can be seen as a ML estimation, which justifies the use of MSE due to the desirable properties of the ML estimation.
 
<span> '''''Klassifikation - Bayesian decision theory'''''</span>
 
Bayesian decision theory is a fundamental probabilistic approach to the task of classification. This approach uses probabilistic arguments and losses associated to decisions.
 
<span> ''Bayes decision rule - with minimizing the error probability''</span>
 
Let <math display="inline">c_1</math> and <math display="inline">c_2</math> be two classes, e.g. two types of see fishes, like herring and mackerel. Suppose that an observer watching fishes arriving along a conveyor belt tries to predict which type of fish is the next one. The arriving type of fishes is random. It can be assumed that there is a probability <math display="inline">P(c_1)</math> that the next fish is herring and a probability <math display="inline">P(c_2)=1-P(c_1)</math> that the next fish is mackerel. These probabilities are called a priori probabilities (or priors) reflecting that they represent the only knowledge on the emergence of these types of fishes. Having only these probabilities available for the decision, it seems logical to use a decision rule: Decide <math display="inline">c_1</math> if <math display="inline">P(c_1) > P(c_2)</math> and otherwise decide <math display="inline">c_2</math>. This decision rule makes sense if there is only one fish to decide about its type. However applying this rule for a sequence of fishes seems to be incorrect: although we know that both types of fishes appear, we always decide the next one to be the same type. On this way we introduce an error whose probability is always the smaller one among <math display="inline">P(c_1)</math> and <math display="inline">P(c_2)</math>.
 
Fortunately in most of the practical situations we have not to make decisions with so little amount of informations. Let us assume that informations on several features of the fish types, like e.g. their length, weights, etc. are available in advance. These features can be measured for the next emerging fish and can be organized in a vector <math display="inline">{\bf x} \in \mathbb{R}^d</math>. Now we assume the more general case of having more <math display="inline">C</math> classes, <math display="inline">c_1,\ldots, c_C</math>. The information on the features belonging to class <math display="inline">c_i</math> can be given by the class-conditional density function <math display="inline">p({\bf x}|c_i)</math>. Suppose that both the priori probabilities <math display="inline">P(c_i)</math> and the class-conditional density function <math display="inline">p({\bf x}|c_i)</math> are given. Then the probability of each class, taking into account the informations observed in the features can be given by the conditional probability of that class, given the observed values of features, <math display="inline">P(c_i|{\bf x})</math>. This conditional probability can be expressed by means of the Bayes formula as
 
<math display="block">P(c_i|{\bf x}) = \frac{p({\bf x}|c_i)P(c_i)}{p({\bf x})} = \frac{p({\bf x}|c_i)P(c_i)}{\sum_{j=1}^C p({\bf x}|c_j)P(c_j)}.</math>
 
The probability <math display="inline">P(c_i|{\bf x})</math> is called posteriori probability (or posterior) reflecting the incorporation of the informations on the features into the probability. The likelihood <math display="inline">p({\bf x}|c_i)</math> indicates that if the priors of the classes are equal then the class with the highest likelihood is the most likely to be the right class.
 
At this point it seems logical to establish the decision rule:<br />
Decide <math display="inline">c_i</math> with <math display="inline">P(c_i|{\bf x}) > P(c_j|{\bf x})</math> for every <math display="inline">j \neq i</math>.<br />
The expected value of the probability of error for this decision rule is given by
 
<math display="block">P(error) = \int_{\mathbb{R}^d} P(error,{\bf x} )d{\bf x} = \int_{\mathbb{R}^d} P(error|{\bf x}) p({\bf x}) d{\bf x}.</math>
 
Deciding class <math display="inline">c_i</math> with highest <math display="inline">P(c_i|{\bf x})</math> leads to error with probability <math display="inline">1 - P(c_i|{\bf x}</math>, for which <math display="inline">1 - P(c_i|{\bf x} < 1 - P(c_j|{\bf x}</math> for every <math display="inline">j \neq i</math>. Therefore <math display="inline">1 - P(c_i|{\bf x} = \min P(\mbox{~error~}|{\bf x} )</math>, i.e. this error is the minimal one, for the given <math display="inline">{\bf x}</math>. It ensures that <math display="inline">P(\mbox{~error~}|{\bf x} )</math> in the above integral is as small as possible and hence the integral and thus also <math display="inline">P(error)</math> is minimal. So the above decision rule minimizes the probability of error. In fact it is a Bayes decision rule with minimizing the probability of error.
 
<span> ''The general Bayes decision rule''</span>
 
We would like to generalize the Bayes decision rule by introducing a loss function instead of the probability of error. The decision among the classes is performed by taking the action <math display="inline">\alpha_i</math>, <math display="inline">i=1,\ldots, A</math>. However if the true class is <math display="inline">c_j</math> then we involve a loss <math display="inline">\lambda(\alpha_i|c_j)</math>. Hence the expected loss by taking the action <math display="inline">\alpha_i</math> is given by
 
<math display="block">R(\alpha_i|{\bf x}) =  \sum_{j=1}^{C} \lambda(\alpha_i|c_j) P(c_j|{\bf x})</math>
 
In the decision-theoretic terminology <math display="inline">R(\alpha_i|{\bf x})</math> is called as conditional risk. Let <math display="inline">\alpha({\bf x})</math> be the decision function taking one of the <math display="inline">\alpha_i</math> actions for each <math display="inline">{\bf x}</math>. The overall risk is given by
 
<math display="block">R = \int_{\mathbb{R}^d} R(\alpha({\bf x})|{\bf x}) p({\bf x}) d{\bf x}.</math>
 
Te goal is to find the decision function which minimizes the overall risk. To achieve it the action must be taken for each <math display="inline">{\bf x}</math>, which minimizes the conditional risk for that <math display="inline">{\bf x}</math>. In other words take <math display="inline">\alpha_i</math> with <math display="inline">i = \arg\min_{i} R(\alpha_i|{\bf x})</math>. This leads to Bayes decision rule (also called Bayes decision procedure):<br />
To minimize the overall risk,
 
# Compute the conditional risk <math display="block">R(\alpha_i|{\bf x}) =  \sum_{j=1}^{C} \lambda(\alpha_i|c_j) P(c_j|{\bf x})</math> for <math display="inline">i=1,\ldots, A</math> and
# Select the action <math display="inline">\alpha_i</math> for which <math display="inline">R(\alpha_i|{\bf x})</math> is minimum, in other words select <math display="inline">\alpha_i</math> with <math display="inline">i = \arg\min_{i} R(\alpha_i|{\bf x})</math>.
 
The overall risk resulted by applying the Bayes decision rule gives the Bayes risk, denoted by <math display="inline">R^*</math>, which is the achievable lowest risk.
 
<span> ''The general case with two classes''</span>
 
Let us <math display="inline">\alpha_1</math> and <math display="inline">\alpha_2</math> be interpreted as deciding class <math display="inline">c_1</math> and <math display="inline">c_2</math>, respectively. This enables the simplified notation <math display="inline">\lambda_{i,j}= \lambda(\alpha_i|c_j)</math>, <math display="inline">i,j = 1,2</math>. The conditional risk for <math display="inline">\alpha_1</math> and <math display="inline">\alpha_2</math> can be given as
 
<math display="block">\begin{aligned}
&&\hspace{-5mm} R(\alpha_1|{\bf x}) =  \lambda_{1,1} P(c_1|{\bf x}) + \lambda_{1,2} P(c_2|{\bf x}), \nonumber \\
&&\hspace{-5mm} R(\alpha_2|{\bf x}) =  \lambda_{2,1} P(c_1|{\bf x}) + \lambda_{2,2} P(c_2|{\bf x}). \nonumber
\end{aligned}</math>
 
The Bayes decision rule becomes
 
<math display="block">\mbox{~Decide~} c_1 \mbox{~if~}  R(\alpha_1|{\bf x}) < R(\alpha_2|{\bf x}), \mbox{~otherwise~} c_2.</math>
 
In terms of posteriori probabilities this becomes <math display="block">\mbox{~Decide~} c_1 \mbox{~if~}  (\lambda_{2,1} - \lambda_{1,1}) P(c_1|{\bf x}) > (\lambda_{1,2} - \lambda_{2,2}) P(c_2|{\bf x}), \mbox{~otherwise~} c_2.</math>
 
Usually making error should lead to higher loss, which means that both <math display="inline">(\lambda_{2,1} - \lambda_{1,1})</math> and <math display="inline">(\lambda_{1,2} - \lambda_{2,2})</math> are positive.
 
The above inequality can be rearranged by applying Bayes rule in it, which gives <math display="inline">(\lambda_{2,1} - \lambda_{1,1}) p({\bf x}|c_1 ) P(c_1) > (\lambda_{1,2} - \lambda_{2,2})p({\bf x}|c_2 ) P(c_2)</math>. Hence the Bayes decision rule can be alternatively formulated as <math display="block">\mbox{~Decide~} c_1 \mbox{~if~} \frac{p({\bf x}|c_1 )}{p({\bf x}|c_2 )} > \frac{(\lambda_{1,2} - \lambda_{2,2}}{\lambda_{2,1} - \lambda_{1,1}} \frac{P(c_2)}{P(c_1)}, \mbox{~otherwise~} c_2.</math>
 
The quantities <math display="inline">p({\bf x}|c_i)</math> are likelihoods, which enables to interpret this Bayes decision rule as deciding <math display="inline">c_1</math> if the likelihood ratio <math display="inline">\frac{p({\bf x}|c_1 )}{p({\bf x}|c_2 )}</math> exceeds some <math display="inline">{\bf x}</math> independent threshold.
 
<span> ''Criteria for specific Bayes decision rules - minimum error-rate classification''</span>
 
From now on <math display="inline">\alpha_i</math> is interpreted as deciding class <math display="inline">c_i</math> and the simplified notation <math display="inline">\lambda_{i,j}= \lambda(\alpha_i|c_j)</math> is enabled. Let us consider the loss given by the symmetrical or zero-one loss function:
 
<math display="block">\begin{aligned}
\lambda_{i,j} = \left\{
\begin{array}{l}
0, \mbox{~if~} i=j  \\
1, \mbox{~if~} i \neq j
\end{array}
\right\}, ~i,j = 1, \ldots, C. \nonumber
\end{aligned}</math>
 
This loss function assigns 0 loss to correct decision and a unit loss to any erroneous decisions, i.e. all errors are equally punished. Thus this loss function can be seen as a function indicating the errors. The conditional risk is given by
 
<math display="block">R(\alpha_i|{\bf x}) = \sum_{j=1}^C \lambda_{i,j} P(c_j|{\bf x}) = \sum_{j \neq i}P(c_j|{\bf x}) = 1 - P(c_i|{\bf x}),</math>
 
which is exactly the probability of error, i.e. the error rate. Minimizing the probability of error is equivalent to maximizing the posteriori probability. Therefore the Bayes decision rule minimizing the conditional risk gives the decision rule
 
<math display="block">\mbox{~Decide~} c_1 \mbox{~if~} P(c_1|{\bf x}) > P(c_2|{\bf x}) , \mbox{~otherwise~} c_2.</math>
 
This is the Bayes decision rule with minimizing the error rate, i.e. the error probability, which we already introduced before. Hence this logical decision rule corresponds to select the loss function being the zero-one function indicating the errors.
 
<span> ''Criteria for specific Bayes decision rules - minimax criterion''</span>
 
The principle of minimax criterion will be shown for the classification with two classes. We show that the overall risk is a linear function of the prior, i.e. it is linear in e.g. <math display="inline">P(c_1)</math>. Let <math display="inline">\mathcal{R}_1</math> and <math display="inline">\mathcal{R}_2</math> the (unknown) region, where the classifier decides <math display="inline">c_1</math> and <math display="inline">c_2</math>, respectively. It enables to express the overall risk in terms of the conditional risks as
 
<math display="block">\begin{aligned}
R &&\hspace{-5mm}= \int_{\mathcal{R}_1} \left( \lambda_{1,1} P(c_1) p({\bf x}|c_1 ) + \lambda_{1,2} P(c_2) p({\bf x}|c_2 )\right) d{\bf x} \nonumber \\
&&\hspace{-5mm}+ \int_{\mathcal{R}_2} \left( \lambda_{2,1} P(c_1) p({\bf x}|c_1 ) + \lambda_{2,2} P(c_2) p({\bf x}|c_2 )\right) d{\bf x}. \nonumber
\end{aligned}</math>
 
Using the relations <math display="inline">P(c_2) = 1 - P(c_1)</math> and <math display="inline">\int_{\mathcal{R}_1}p({\bf x}|c_i )d{\bf x} + \int_{\mathcal{R}_2}p({\bf x}|c_i )d{\bf x} = 1</math>, for <math display="inline">i=1,2</math> the above expression of the overall risk can be rearranged as
 
<math display="block">\begin{aligned}
R &&\hspace{-5mm}= \lambda_{1,1} P(c_1)  \left(1- \int_{\mathcal{R}_2}  p({\bf x}|c_1 ) d{\bf x} \right) + \lambda_{1,2} \left(1-P(c_1)\right) \int_{\mathcal{R}_1} p({\bf x}|c_2 ) d{\bf x} \nonumber \\
&&\hspace{-5mm}+  \lambda_{2,1} P(c_1) \int_{\mathcal{R}_2} p({\bf x}|c_1 ) d{\bf x}  + \lambda_{2,2} \left(1- P(c_1)\right) \left(1 - \int_{\mathcal{R}_1} p({\bf x}|c_2 ) d{\bf x} \right) \nonumber \\
&&\hspace{-5mm}= \lambda_{2,2} + \left(\lambda_{1,2}-\lambda_{2,2}\right)  \int_{\mathcal{R}_1} p({\bf x}|c_2 ) d{\bf x} \nonumber \\
&&\hspace{-5mm}+ P(c_1) \bigg(\left(\lambda_{1,1}-\lambda_{2,2}\right) +  \left(\lambda_{2,1}-\lambda_{1,1}\right)  \int_{\mathcal{R}_2} p({\bf x}|c_1 ) d{\bf x} - \left(\lambda_{1,2}-\lambda_{2,2}\right)  \int_{\mathcal{R}_1} p({\bf x}|c_2 ) d{\bf x}\bigg). \nonumber
\end{aligned}</math>
 
Therefore the overall risk usually depends on the prior <math display="inline">P(c_1)</math> on linear way. Sometimes it is necessary to have a classifier, which performs well over a range of prior probabilities. This is the situation for example if the priori probabilities of the types of fishes arriving along a conveyor belt vary or we do not know the priori probabilities at all. Such a classifier can be designed by setting the coefficient of <math display="inline">P(c_1)</math> in the above expression of the overall risk to <math display="inline">0</math>, i.e. by setting<br />
<math display="inline">\bigg(\left(\lambda_{1,1}-\lambda_{2,2}\right) +  \left(\lambda_{2,1}-\lambda_{1,1}\right)  \int_{\mathcal{R}_2} p({\bf x}|c_1 ) d{\bf x} - \left(\lambda_{1,2}-\lambda_{2,2}\right)  \int_{\mathcal{R}_1} p({\bf x}|c_2 ) d{\bf x}\bigg) = 0</math>,<br />
in which case the overall risk becomes independent of the prior <math display="inline">P(c_1)</math>. This setting is called the minimax solution and the resulted overall risk <math display="inline">\lambda_{2,2} + \left(\lambda_{1,2}-\lambda_{2,2}\right)  \int_{\mathcal{R}_1} p({\bf x}|c_2 ) d{\bf x}</math> is the minimax risk. It can be seen that the resulted risk is the maximum among the prior dependent Bayes risks (=minimum risks besides given prior), which explains the name minimax for this criterion.
 
<span> ''Estimation of the posteriors''</span>
 
The design of a classifier based on Bayes decision rule requires the knowledge of the posteriori probabilities <math display="inline">P(c_i|{\bf x})</math>, for <math display="inline">i=1,ldots,C</math>, which are determined by the priors <math display="inline">P(c_i)</math> and the likelihoods <math display="inline">p({\bf x}|c_i )</math>. Usually the likelihoods are computed by applying parametric statistic estimations, in which the distribution family is assumed to be either unimodal (like e.g. the multivariate normal density) or multimodal.
 
<span> '''''Klassifikation - lineare Diskriminanzfunktionen'''''</span>
 
<span> ''General discriminant functions''</span>
 
A general way of representing a classifier is to specify it by a set of discriminant functions <math display="inline">d_i({\bf x})</math>, for <math display="inline">i=1,ldots,C</math>. The classifier decides to assign the class <math display="inline">c_i</math> to the input feature vector <math display="inline">{\bf x}</math>, if
 
<math display="block">d_i({\bf x}) > d_j({\bf x})  \mbox{~for~all~} j \neq i.</math>
 
The Bayes classifier can be also represented by means of discriminant functions. In general case the discriminant functions can be mapped as
 
<math display="block">d_i({\bf x}) = - R(\alpha_i|{\bf x}).</math>
 
The maximal among the <math display="inline">C</math> discriminant functions determines the class with the minimum conditional risks. A still simpler discriminant function can be given for the minimum error-rate classifier as
 
<math display="block">d_i({\bf x}) = P(c_i|{\bf x}).</math>
 
This can be further simplified by utilizing that the denominator of <math display="inline">P(c_i|{\bf x})= \frac{p({\bf x}|c_i ) P(c_i)}{\sum_{j=1}{C} p({\bf x}|c_j ) P(c_j) }</math> is independent of <math display="inline">i</math>, which leads to alternative discriminant function as
 
<math display="block">d_i({\bf x}) = p({\bf x}|c_i ) P(c_i)).</math>
 
Any monotone function can be applied to discriminant functions, since it does not change the class belonging to the maximal one. Therefore a further discriminant function can be defined for the minimum error-rate classifier by the help of the monotone <math display="inline">ln()</math> function as <math display="block">d_i({\bf x}) = ln p({\bf x}|c_i ) + ln P(c_i)).</math>
 
In this case all the above different discriminant functions define the same decision rule. The discriminant functions divide the feature space into disjunct decision regions <math display="inline">{\mathcal{R}_1}, \ldots, {\mathcal{R}_C}</math>. A feature vector <math display="inline">{\bf x}</math> falls into the decision region <math display="inline">{\mathcal{R}_i}</math> if <math display="inline">d_i({\bf x}) > d_j({\bf x})</math> for every <math display="inline">j \neq i</math>. Thus decision region <math display="inline">{\mathcal{R}_i}</math> represents the set of feature vectors for which the classifier decides class <math display="inline">c_i</math>. The decision regions are separated by decision boundaries in the <math display="inline">d</math>-dimensional space of feature vectors.
 
In the special case of classifier with two classes it is common to use a single discriminant function
 
<math display="block">d({\bf x}) = d_1({\bf x}) - d_2({\bf x}).</math>
 
instead of <math display="inline">d_1({\bf x})</math> and <math display="inline">d_2({\bf x})</math>. Hence the decision rule becomes
 
<math display="block">\mbox{~Decide~} c_1 \mbox{~if~} d({\bf x}) > 0 , \mbox{~otherwise~} c_2.</math>
 
Figure z.
 
<span> ''Linear discriminant functions''</span>
 
In case of discriminant functions for Bayes classifier the form of the underlying probability distribution must be assumed to be known. However discriminant functions without such assumption can be also designed. Hence, in limited, statistical sense, the procedure of determining such discriminant functions can be seen as nonparametric. An important class of such discriminant functions are the linear discriminant functions.
 
Linear discriminant functions are linear in the components of <math display="inline">{\bf x}</math> and can be given as the linear combination of the components of <math display="inline">{\bf x}</math>, in other words:
 
<math display="block">d({\bf x}) = {\bf w}^T {\bf x} + w_0,</math>
 
where <math display="inline">{\bf w}</math> is the weight vector and <math display="inline">w_0</math> is the bias or threshold weight. The decision boundary becomes a decision surface.
 
For the case of classifier with two classes there is only one discriminant function, <math display="inline">d({\bf x}) = {\bf w}^T {\bf x} + w_0</math>. The class <math display="inline">c_1</math> will be decided, if <math display="inline">{\bf w}^T {\bf x}</math> exceeds the threshold <math display="inline">-w_0</math>, and the class <math display="inline">c_2</math> otherwise.
 
The principle of the operation of the linear classifier with two classes and d-dimensional feature vector is shown in Figure w1.
 
Figure w1.
 
The decision surface is a hyperplane. Let us consider two feature vectors <math display="inline">{\bf x}_1</math> and <math display="inline">{\bf x}_2</math> both on the decision surface. Then <math display="inline">d({\bf x}_1) = 0 = d({\bf x}_2)</math> holds, which means <math display="inline">{\bf w}^T {\bf x}_1 + w_0 = {\bf w}^T {\bf x}_2 + w_0</math>, from which
 
<math display="block">{\bf w}^T \left({\bf x}_1 - {\bf x}_2\right) = 0.</math>
 
It follows that <math display="inline">{\bf w}</math> is perpendicular to any vector lying on the decision feature and hence the decision surface is a hyperplane with <math display="inline">{\bf w}</math> as normal vector.
 
The discriminant function <math display="inline">d({\bf x})</math> has a strong connection to the signed distance of <math display="inline">{\bf x}</math> to the hyperplane, <math display="inline">r</math>. This can be seen by expressing <math display="inline">{\bf x}</math> as the sum of its projection to the hyperplane <math display="inline">{\bf x}_p</math> and <math display="inline">r</math> times the unit vector in the direction of <math display="inline">{\bf w}</math>, i.e.
 
<math display="block">{\bf x} = {\bf x}_p + r \frac{{\bf w}}{\lVert {\bf w} \rVert}.</math>
 
Using <math display="inline">d({\bf x}_p) = 0</math>, for <math display="inline">d({\bf x})</math> we get
 
<math display="block">d({\bf x}) = {\bf w}^T {\bf x}_p + {\bf w}^T r \frac{{\bf w}}{\lVert {\bf w} \rVert} + w_0 = d({\bf x}_p) + r \frac{\lVert {\bf w} \rVert^2}{\lVert {\bf w} \rVert} = r \lVert {\bf w} \rVert,</math> from which <math display="block">r = \frac{d({\bf x})}{\lVert {\bf w} \rVert}.</math>
 
The normal vector <math display="inline">{\bf w}</math> points into the decision region <math display="inline">{\mathcal{R}_1}</math>, since <math display="inline">d({\bf x}) > 0</math> and thus <math display="inline">{\bf w}</math> points to the same side of the hyperplane as <math display="inline">{\bf x}</math>, if <math display="inline">{\bf x} \in {\mathcal{R}_1}</math>.
 
The distance of the origin to the hyperplane is <math display="inline">\frac{w_0}{\lVert {\bf w} \rVert}</math>. The origin falls into the decision region <math display="inline">{\mathcal{R}_1}</math> if <math display="inline">w_0 > 0</math> and into <math display="inline">{\mathcal{R}_2}</math> if <math display="inline">w_0 < 0</math>. If <math display="inline">w_0= 0</math>, then the hyperplane passes through the origin and linear discriminant function <math display="inline">d({\bf x})</math> has homogeneous form <math display="inline">d({\bf x})= {\bf w}^T {\bf x}</math>.
 
The above properties are illustrated geometrically on Figure w2.
 
Figure w2
 
For the general case of classifier with <math display="inline">C</math> classes, the linear discriminant functions are given as
 
<math display="block">d({\bf x})_i = {\bf w}_i^T {\bf x} + w_{i0},</math>
 
The decision rule is the same as the one for the general discriminant functions
 
<math display="block">\mbox{~Decide~} c_i \mbox{~if~} d_i({\bf x}) > d_j({\bf x})  \mbox{~for~all~} j \neq i.</math>
 
In case of the values of two ore more discriminant functions being the same for some <math display="inline">{\bf x}</math>, which is called ties (like in statistics), the decision is undefined. The classifier with <math display="inline">C</math> linear discriminant functions is also called as linear machine. It divides the d-dimensional feature space into <math display="inline">C</math> disjunct decision regions. For the boundary between two neighborous decision regions <math display="inline">{\mathcal{R}_i}</math> and <math display="inline">{\mathcal{R}_j}</math> holds <math display="inline">d({\bf x})_i = d({\bf x})_j</math>, or equivalently
 
<math display="block">\left({\bf w}_i-{\bf w}_j\right)^T {\bf x} + \left(w_{i0}- w_{j0}\right) = 0.</math>
 
It follows from the considerations done for classifier with two classes that the boundary is a hyperplane with normal vector <math display="inline">\left({\bf w}_i-{\bf w}_j\right)</math>. Furthermore the signed distance of <math display="inline">{\bf x}</math> to this hyperplane is given as <math display="inline">\frac{d_i({\bf x}) - d_j({\bf x})}{\lVert {\bf w}_i - {\bf w}_j  \rVert}</math>. These expressions show that in linear machine the differences of the weights are important, not the weights itself. The total number of hyperplanes can be less than the <math display="inline">\frac{C (C-1)}{2}</math> possible number of pairs. It can be shown that the decision regions in <math display="inline">{\bf x}</math> are convex.
 
Generalized linear discriminant functions are linear in some given functions of <math display="inline">{\bf x}</math> and have a form as
 
<math display="block">d({\bf x}) = \sum_{i=1}^{d} {\bf a}_i {\bf y}_i({\bf x}).</math>
 
Truncated serial expansions of <math display="inline">d({\bf x})</math> with any <math display="inline">{\bf y}_i</math>-s lead to polynomial discriminant functions as subclass of generalized linear discriminant functions. An example is the quadratic discriminant function, which can be given in the form
 
<math display="block">d({\bf x}) = w_0 + \sum_{i=1}^{d} {\bf w}_i {\bf x}_i + \sum_{i=1}^{d}\sum_{j=1}^{d}{\bf w}_{ij}  {\bf x}_i {\bf x}_j.</math>
 
Besides of the decision regions in <math display="inline">{\bf y}</math> being convex, they can have any dependency in <math display="inline">{\bf x}</math>. Hence generalized linear discriminant functions can describe more general feature spaces, which is their beneficial property providing motivation to use them.
 
<span> ''Finding linear discriminant functions''</span>
 
The parameters of the classifier with linear discriminant functions are the weights. Thus training the parameters of the classifier means to find proper linear discriminant functions. This is done by formulating the task as an optimization. A natural choice for the criterion of the optimization is to minimize the training error. To perform such minimization in a general settings, well-known numerical multivariate optimization algorithms, like e.g. gradient methods are used and adopted to the specialities of finding linear discriminant functions.
 
<span> '''''Lernen aus Beispiele'''''</span> Learning from examples is a fundamental approach in machine learning. The goal is to build up knowledge from examples by training a statistical model, which represent the knowledge also for unseen examples. On this way the statistical model represents a generalization of the knowledge extracted from the examples. In some contexts such learning is called also training.
 
The classical form of such learning assumes providing the correct output for the input in each example. It is usual to say to have labelled training set, which means that the in each example the input is labelled by the correct output. This type of learning is called supervised learning.
 
Example - estimation of the transition probability matrix of a DTMC (Beispiel: Schätzung der Übergangswahrscheinlichkeiten einer DTMC (mittels des Verfahrens der Lagrange-Multiplikatoren)) As an example for learning from examples we provide the estimation of the transition probability matrix of a DTMC from observed state sequences.
 
<span> ''The model''</span>
 
Let <math display="inline">\mathcal{S}={s_1, \ldots, s_{|\mathcal{S}|}}</math> be the set of states of the DTMC. Furthermore let <math display="inline">\overrightarrow{\bf z} = (z_0, z_1,\ldots,z_T)</math> denote observed state sequence. The transition probability matrix of the DTMC is denoted by <math display="inline">{\bf A}</math>, i.e. <math display="inline">{\bf A}_{i,j}= P(z_t=s_j|z_{t-1}=s_i)</math> for <math display="inline">s_i, s_j \in \mathcal{S}</math>. Observe that a given state sequence <math display="inline">\overrightarrow{\bf z}</math> implicitly includes many labelled examples in the form <math display="inline">..z_{t-1}, z_t..</math>, where the input is <math display="inline">z_{t-1}</math> and the next state <math display="inline">z_t</math> gives the corresponding correct output.
 
<span> ''Problem formulation''</span> The task of estimating of the parameter matrix <math display="inline">{\bf A}</math> can be formulated as an optimization problem as
 
<math display="block">{\bf A}^* = \arg\max_{{\bf A}} P(\overrightarrow{\bf z}|{\bf A}),</math> where <math display="inline">{\bf A}^*</math> is the estimated parameter matrix. We are going to estimate the parameter matrix<math display="inline">{\bf A}^*</math> by applying the Maximum-Likelihood statistical point estimation method. Therefore we maximize the log likelihood of <math display="inline">\overrightarrow{\bf z}|{\bf A}</math>, which has maximum at the same point, where the likelihood.
 
By omitting the condition to get simpler notation, the log likelihood <math display="inline">log P(\overrightarrow{\bf z})</math> can be expressed as
 
<math display="block">\begin{aligned}
log P(\overrightarrow{\bf z})&&\hspace{-5mm}= P(z_1,\ldots,z_T) = log P(z_T|z_{T-1},\ldots, z_1)P(z_{T-1}|z_{T-2},\ldots, z_1) \ldots P(z_1|z_0) P(z_0) \nonumber \\
&&\hspace{-5mm}= log P(z_T|z_{T-1}) P(z_{T-1}|z_{T-2})\ldots P(z_1|z_0)P(z_0) = log \prod_{t=1}^{T}P(z_{t}|z_{t-1}) P(z_0) \nonumber \\
&&\hspace{-5mm}= \prod_{t=1}^{T}{\bf A}_{z_{t-1},z_{t}}P(z_0) = \sum_{t=1}^{T} log {\bf A}_{z_{t-1},z_{t}} + log P(z_0),  \nonumber
\end{aligned}</math> where we used the multiplication theorem in the first and the Markov property in the second equality.
 
This can be further rearranged by introducing an indicator variable as
 
<math display="block">log P(\overrightarrow{\bf z})=\sum_{i=1}^{|\mathcal{S}|}\sum_{j=1}^{|\mathcal{S}|}\sum_{t=1}^{T} 1_{z_{t-1} = s_i, z_{t} = s_j} log {\bf A}_{i,j} + log P(z_0).</math>
 
Omitting the term <math display="inline">log P(z_0)</math> does not influence the place of maximum of the log likelihood, since it does not depend on matrix <math display="inline">{\bf A}</math>. Furthermore some constraints on matrix <math display="inline">{\bf A}</math> must be taken into account in the optimization. Matrix <math display="inline">{\bf A}</math> is stochastic and hence 1. <math display="inline">{\bf A}_{i,j} \neq 0</math>, i.e. the elements of matrix <math display="inline">{\bf A}</math> are non-negative, 2. <math display="inline">\sum_{j=1}^{|\mathcal{S}|} {\bf A}_{i,j} = 1</math> for <math display="inline">i = 1, \ldots, |\mathcal{S}|</math>, i.e. the row sums of matrix <math display="inline">{\bf A}</math> are equal to 1.
 
We are going to apply the method of Lagrange multiplicator for the optimization. It ensures that the non-negativity of the resulted matrix <math display="inline">{\bf A}</math>. Therefore the task can be formulated as the following constrainted optimization problem:
 
<math display="block">\arg\max_{{\bf A}} \sum_{i=1}^{|\mathcal{S}|}\sum_{j=1}^{|\mathcal{S}|}\sum_{t=1}^{T} 1_{z_{t-1} = s_i, z_{t} = s_j} log {\bf A}_{i,j}, ~ \mbox{~subject~to~}~\sum_{j=1}^{|\mathcal{S}|} {\bf A}_{i,j} = 1, \mbox{~for~} i = 1, \ldots, |\mathcal{S}|.</math>
 
<span> ''Solution by applying the method of Lagrange multiplier''</span>
 
Applying the method of Lagrange multiplicator requires the introduction of |S| Lagrange multipliers, <math display="inline">\alpha_i</math> for <math display="inline">i = 1, \ldots, |\mathcal{S}|</math>, which can be arranged in a vector <math display="inline">{\mbox{\boldmath$\alpha$}}</math>. Thus the Lagrange function can be given as
 
<math display="block">\mathcal{L}({\bf A}, {\mbox{\boldmath$\alpha$}}) = \sum_{i=1}^{|\mathcal{S}|}\sum_{j=1}^{|\mathcal{S}|}\sum_{t=1}^{T} 1_{z_{t-1} = s_i, z_{t} = s_j} log {\bf A}_{i,j} + \sum_{i=1}^{|\mathcal{S}|} \alpha_i \left(1 - \sum_{j=1}^{|\mathcal{S}|} {\bf A}_{i,j} \right).</math>
 
Taking the first derivative of the Lagrange function with respect to <math display="inline">{\bf A}_{i,j}</math> and making equal to 0 yields <math display="block">\frac{\partial \mathcal{L}}{\partial {\bf A}_{i,j}} = \frac{1}{{\bf A}_{i,j}}\sum_{t=1}^{T} 1_{z_{t-1} = s_i, z_{t} = s_j} - \alpha_i = 0, ~ i,j = 1, \ldots, |\mathcal{S}|,</math> from which <math display="inline">{\bf A}_{i,j}</math> can be expressed as <math display="block">{\bf A}_{i,j} = \frac{1}{\alpha_i}\sum_{t=1}^{T} 1_{z_{t-1} = s_i, z_{t} = s_j} ~ i,j = 1, \ldots, |\mathcal{S}|.</math>
 
Now taking first derivative of the Lagrange function with respect to <math display="inline">\alpha_i</math>, making equal to 0 and applying the above expression of <math display="inline">{\bf A}_{i,j}</math> yields <math display="block">\frac{\partial \mathcal{L}}{\partial \alpha_i} = 1 - \sum_{j=1}^{|\mathcal{S}|} {\bf A}_{i,j} = 1 - \sum_{j=1}^{|\mathcal{S}|} \frac{1}{\alpha_i}\sum_{t=1}^{T} 1_{z_{t-1} = s_i, z_{t} = s_j} = 0,</math> from which we get the solution for <math display="inline">\alpha_i</math> as <math display="block">\alpha_i = \sum_{j=1}^{|\mathcal{S}|} \sum_{t=1}^{T} 1_{z_{t-1} = s_i, z_{t} = s_j} = \sum_{t=1}^{T} 1_{z_{t-1} = s_i} ~ i = 1, \ldots, |\mathcal{S}|.</math>
 
Applying the final expression of <math display="inline">\alpha_i</math> in the expression of <math display="inline">{\bf A}_{i,j}</math>, we get the for the estimate of <math display="inline">{\bf A}_{i,j}</math> <math display="block">{\bf A}_{i,j} = \frac{\sum_{t=1}^{T} 1_{z_{t-1} = s_i, z_{t} = s_j}}{\sum_{t=1}^{T} 1_{z_{t-1} = s_i}} ~ i,j = 1, \ldots, |\mathcal{S}|.</math>
 
This estimate of <math display="inline">{\bf A}_{i,j}</math> can be interpreted as the number of count of the transition <math display="inline">s_i --> s_j</math> divided by the number of count of <math display="inline">s_i</math>, i.e. the count of observing the DTMC being in state <math display="inline">s_i</math>. This estimate fits to the intuition.
 
<span> '''''Learning durch Steuereung - MDP'''''</span> Learning by means of control is an iterative process, in whose each step the actor reacts with an action to the environment’s answer to its previous action. Thus the actor learns its behaviour stepwise. Such process arise very often in the nature, e.g. as controlling the movement of animals during looking for some food.
 
Such learning process can be described by the mathematical model of Markov-Entscheidungsprozess (MEP) - Markov decision process (MDP). MDP is diskreter stochastischer Kontrollprozess. In each state <math display="inline">s</math> of the MDP, the actor can take an action <math display="inline">a</math>, from the allowed set of actions at that state. It causes the MDP to move into the state <math display="inline">s^{'}</math> and giving a reward to the actor, which corresponds to the action and state transition, <math display="inline">R_a(s,s^{'})</math>, both at the next time step. On such way the MDP realizes an action and reward-feedback based learning. The working flow of the MDP model is shown in Figure x.
 
Abbildung x
 
The mathematical definition of MDP is given as follows. MDP is a 4-tuple <math display="inline">(\mathcal{S}, \mathcal{A}, P, \mathcal{R})</math>, where
 
* <math display="inline">\mathcal{S}</math> is the state space, i.e. set of states,
* <math display="inline">\mathcal{A}</math> is the action space, i.e. set of actions,
* <math display="inline">P</math>, in function form <math display="inline">P_a(s,s^{'}) = P(s(t+1) = s^{'}| s(t) = s, a(t) = a)</math> describes the state transition from <math display="inline">s</math> to <math display="inline">s^{'}</math> while the action <math display="inline">a</math> is taken and
* <math display="inline">\mathcal{R}</math> is the set of possible rewards, and <math display="inline">R_a(s,s^{'})</math> is the reward for the state transition from <math display="inline">s</math> to <math display="inline">s^{'}</math> when the action <math display="inline">a</math> is taken.
 
An example diagram of a MDP is shown in Figure y.
 
Abbildung y
 
MDP is an extension of the Discrete-Time Markov Chain (DTMC) by adding actions (allowing selection opportunities) and rewards (providing feedback for motivating) to it. It implies that an MDP with only one action and a common reward reduces to a DTMC. MDP can be used for solving optimization tasks and it is usually implemented by dynamic programming.
 
<span id="ci"></span>
=== CI ===
 
Computational Intelligence (CI) is subarea of AI covering biologically motivated and hence computationally intensive methodologies and approaches.
 
<span> '''''Definition von CI'''''</span>
 
There is no commonly accepted unique definition of CI. The first definition origins from 1994 by Bezdek (see in ) and defines CI as having all of the following merkmalen:
 
* deals with numerical data,
* has a pattern-recognition component instead of represented knowledge,
* has computational adaptive and fault tolerance methodology and
* approximates human like performance in terms of speed and error-rate.
 
CI can be also defined by means of its major characteristics:
 
* a sub-field in KI,
* a set of nature-inspired approaches,
* addresses complex real-world problems, which may contain some uncertainties or might have stochastic nature.
 
<span> '''''Positionierung von CI gegenüber KI und seiner Geschichte zur Geschichte der KI'''''</span>
 
CI and KI both have the long-term goal of reaching general intelligence. However KI is based on hard computing techniques following binary logic having two values, the binary 0 or 1. This can be interpreted as an element in a set is either included or not. On the contrary CI is based on soft computing techniques following fuzzy logics, which enables real values between 0 and 1, which can be seen as a degree of membership an element in a set. This is a clear difference between KI and CI. On the other hand CI can be seen as a sub-area of KI in a broader sense.
 
While the history of KI began already in the 1950s, the term CI was first used in 1990 by the IEEE Neural Networks Council dealing with development of biological and artificial neural networks. In 2001 the Council became IEEE Computational Intelligence Society and several years later new areas such as fuzzy systems and evolutionary computation have been included into the area of interest of the society.
 
<span> '''''Hauptkomponenten der CI'''''</span>
 
The major components of CI are given as
 
# Artificial Neural Networks based on the biological ones, which
#* enable processing and learning from experiential data and
#* provide fault tolerance in some extent.
# Fuzzy logic being one of the major discipline of CI, which
#* enable soft computing techniques for modelling real-life complex processes by applying probabilistic methods and
#* can be applied for approximate reasoning, but not for learning.
# Evolutionary computation (EC) providing algorithms (Evolutionary Algorithms -EA) for global optimization, which
#* are inspired by biological evolution
#* usually based on a population of candidate solutions
 
<span> '''''Evolutionäre Algorithmen (EA)'''''</span>
 
In this subsection we give the brief mathematical descriptions of the two most broadly known and applied EA algorithm: Optim book + Pattern rec book
 
* Particle Swarm Optimisations (PSO) and
* Genetic Algorithms.
 
<span> '''''Partikelschwarmoptimierungen - Particle Swarm Optimisations (PSO)'''''</span>
 
<span> '''''Genetischer Algorithmus - Genetic Algorithms (+ Programming ?)'''''</span>
 
<span id="anwendungsgebiete-von-ai"></span>
== Anwendungsgebiete von AI ==
 
<span id="anwendungen-von-kg"></span>
=== Anwendungen von KG ===
 
In this subsection we discuss the applications of the KG.
 
(1) Information retrieval (IR) systems include tasks like - image retrieval, - music retrieval, - web search, - domain-specific retrieval (e.g. geographic, legal,..), - cross-lingual retrieval and - text summarization (mainly as extracted parts of the original text).
 
(2) Semantic search is realized by integrating KG into the results of the search engine. This leads to improved ,,Big Data“ search engine. It extends the conventional search engine with the following new features - offer more relevant information, - identify and disambiguate entities in text, - provide links to related entities (exploratory search). Examples include - Google search engine implemented via integrating Google’s KG, - Bing, the Microsoft search engine realized via integrating the Microsoft’s KB Satori.
 
(3) Question Answering System (QA) incorporate semantic info from KG in order to enhance search result for semantic question. Therefore they provide semantically aware question answering services. Examples include - Watson, the IBM’s QA system, which uses among others YAGO, DBpedia and Freebase, - Digital/virtual assistants like Siri, Cortana, Google Now The QA systems can be classified into extractive or generative QA systems. - In extractive QA the answer is extracted from context (BERT-like models). - In generative QA the model generates free text based on the context for which it uses Text Generation models. On the other hand the QA systems can be also classified as open or closed QA systems. - In open QA the answer is taken from context. - In closed QA there is no context provided and thus the answer is a completely generated one. The fast QA over many document works by inserting first ranking by using passge ranking models. There are QA datasets provided for academic benchmark tests. The most used QA datasets for academic benchmark of extractive QA systems is the Stanford Question Answering Dataset (SQuAD) consisting of 100000+ QA pairs on 500+ articles.
 
(4) Recommendation/Recommender System gives personalized recommendations based on user’s behaviour, common preferences and interests. Based on analysing user clicks, these content platforms can suggests another contents for users to view or read. In such applications KGs help to improve accuracy, increase diversity of recommended items and bring interpretability to recommendations. Typical examples of such KG applications are content platforms, like e.g. Netflix, Search Engine Optimization (SEO), which is ein Teilgebiet des Suchmaschinenmarketings, or social media.
 
(5) Natural Language Processing (NLP) addresses processing and understanding text and spoken speech by applying mainly machine learning techniques. Information extraction (IE) is a NLP technique targeting extraction of structured information from typically unstructured text. More precisely IE locates in the intersection of IR and NLP. There exists a mutually beneficial relationship between KG and NLP. On one hand KG is used to realize NLP techniques, like e.g. text summarization. But on the other hand NLP techniques, like (named) entity recognition and relation extraction, are used to create KG.
 
(6) Enterprise knowledge management is about applying KG in industrial sector in order to utilize the benefits provided by KG. This benefits include - applying big data in order to create new business value, - creating business viewpoints on different granularity levels, - providing new insights through hierarchical inclusion of company relevant data and - providing advanced accessibility of company information to employees.
 
(7) Domain-specific applications enable creating added value due to applying KG in different fields. - Biomedical KG applications enable question answering and decision support in the life sciences due to integrating multiple sources of biomedical information. - Medical application utilize KG for integrating textual medical knowledge. One such example is retrieving specific info via Schlussforgerung. - In cybersecurity applications KG is used to store modelling patterns in large datasets of transactions and organisations. This serves as a base for building applications, like e.g. detecting and predicting attacks, identifying potential frauds (=verschiedene Arten von Wirtschaftskriminalität) and identifying security threats via suspicious transactions, abnormal user behaviour or fake accounts. - In e-commerce applications KG is used to store semantic information about the interactions between customers and products, and their characteristics. These can be used for a variety of tasks including finding similar or compatible products, identifying product classes, retrieving relevant products for a given search term and retrieving similar search terms. - In financial applications KG is created by identifying named entities from news of companies and by extracting business links between relevant stocks. Such KGs are used for tasks like e.g. predicting stock’s price movement. - KG is used in the field of news to implement fake news detection. - In the field of education KG nodes are used to store instructional concepts, which are used in tasks like learning resource recommendation and concept visualisation. - In Geoscience applications KG stores the textual geoscience data used for information extraction and knowledge discovery services.
 
<span id="anwendungen-des-mls"></span>
=== Anwendungen des MLs ===
 
ML is one of the technology which has many applications both commonly known ones and in a wide variety of Fachgebiets, including finance, healthcare, agriculture and many more.
 
In the following we give a brief descriptions of several commonly known applications.
 
Speech recognition is the process of converting the spoken speech into written n text. It is also called as speech to text (STT) or automatic speech recognition (ASR). Speech recognition or used e.g. in speech automated call centers, voice dialling (also called speech enabled dialling) or in Apple’s speech assistant Siri.
 
Speaker recognition (also called as voice recognition) is the task of identifying the speaker from a short (usually several seconds) speech.
 
Speech synthesis covers the task of converting written text into human speech. It is also called as text-to-speech (TTS) and it is the reverse task of the speech recognition.
 
Language translation, like e.g. Google Translate can translate written text in one language into the corresponding text in other language. It is based on Natural Language Processing(NLP) techniques including POS tagging and Named Entity Recognition (NER).
 
Image recognition is the task of identifying some pattern in an image or video. It can be used among others for face detection, face recognition or for further analysis.
 
Traffic pattern prediction. This task generates predictions about the upcoming traffic based on the actual one, which is used e.g. to identify and suggest the fastest route to the required destination.
 
Data mining has a goal of extracting information from a (usually large) data set and converting it into a structure, which is appropriate for further use. Data mining is applied e.g. by secret services or for outlier detection in the statistics.
 
E-commerce product recommendations. It is a marketing application generating product recommendations based on collected data on past customer behaviours, past shopping.
 
E-mail spam detection (email filtering) works by means of spam filters using ML algorithms for classifying the incoming e-mails being spam or not.
 
Malware detection is a pattern recognition system trained on features extracted by analysing suspicious activities.
 
Computer vision uses ML techniques to process, analyse, visualize or interpret high-dimensional real life data.
 
Transportation (Uber) uses ML to analyse traffic conditions in order to estimate the Estimated Time of Arrival (ETA) to the required Destination.
 
ML applied to business problems is also called as predictive analytics. Here we give short interpretations of several ML application examples in Finance.
 
* Stock Market and Day Trading. In this application ML is trained to predict the evolution of the prices in the stock market.
* Focused Account Holder Targeting. Here ML algorithms classify the account holders for segments with predefined balances and loans.
* Fraud Detection (in banking transactions) is implemented by ML algorithm giving a score for each transaction, which represents a probability of being a fraud. The training of the algorithm is based on patterns for unusual behaviour identified from large amount of transactional data. This is the most important use of ML in banking and finance area.
* Loan Eligibility Prediction is realized by different ML classifier (like e.g. Random Forest) to judge customers eligibility for being granted a loan.
 
As a next several applications of ML in Healthcare are shortly considered.
 
* In Personalized Treatment/Medication ML is used to identify patient gene patterns (=response markers) that could enable targeted therapies.
* In Genetics and Genomics ML is used to identify gene sequences in genome sequencing, gene modification and in general in genetic research.
* In Cancer Prognosis and Prediction (especially with ANNs) ML is used to construct prediction models to support decision on therapy and predict its evolution.
* In Drug development the process of drug discovery can be sped up by applying ML techniques.
 
Artificial Neural Networks (ANN) combine ML with NN model. The big majority of the applications has benefited from the usage of ANN instead of pure ML algorithms. These applications are e.g. image recognition, medical diagnosis, speech recognition, machine translation, computer vision, cancer prognosis and prediction, social network filtering, playing board and video games.
 
Embedded Machine Learning is a subfield of ML, in which ML is applied to embedded systems with limited resources. Such embedded systems include microcontrollers, wearable computers and edge devices. Using ML methods in embedded systems makes transferring and storing data on cloud servers unnecessary. Embedded Machine Learning techniques include among others approximate computing and hardware acceleration.
 
For more details on application of ML see e.g. the Wikipedia side [https://en.wikipedia.org/wiki/Machine_learning .]
 
<span id="weitere-anwendungsgebiete-von-ai"></span>
=== Weitere Anwendungsgebiete von AI ===
 
Further application areas of AI include
 
# Educational sector e.g. by creating automated messages to students, or by designing content based on the user’s interest (Smart Content Creation),
# Robotics e.g. by making real time decisions possible based on NLP, Object Recognition and Human-Robotics Interaction (HRI),
# Navigation e.g. by computing the best route based on GPS and AI technologies,
# Healthcare e.g. by patient monitoring or surgical assistance,
# Automobiles e.g. by Advanced Driving Assistance System (ADAS) or Autonomous Driving,
# Agriculture e.g. by crops monitoring, supply chain maintenance or weather forecasting,
# Human Resource e.g. by screening,
# Lifestyle e.g. by Personalized Recommendation, Virtual Assistance,
# Gaming e.g. by Animation,
# Astronomy e.g. by Analysis of astronomical data and Detection e.g. of exoplanets,
# Travel and Transport e.g. enabling Truck platooning,
# Military e.g. by detecting Cyberattacks and Decision Support e.g. for resource allocation.
 
The feedback of recent development of NN to older AI concepts resulted in modern AI model combinations. Perhaps the two most important ones are the Neural Language Models and Large Language Models.
 
Neural Language Models (NLM) are LMs based on recurrent NN. They are also called as continuous space language models. They transform the LM representation of words into continuous low-dimensional vector space, called embedding space. This has the advantage of locating the semantically similar units closer to each other in the embedding space. Such units can be parts of corpus, sentences, sub-words or characters. Furthermore NLMs avoid the data sparsity problem of SLMs since they represent words as non-linear combinations of weights in a neural net . The NLMs can be categorized to be either Non-contextual Embeddings or Contextual Embeddings. Non-contextual Embeddings apply the same embeddings for a given semantic unit regardless of the given context. Examples of Non-contextual Embeddings include Word2Vec  or Wikipedia2Vec . On the contrary Contextual Embeddings can represent different semantics of the semantic units in different context. Examples of Contextual Embeddings includes Bidirectional Encoder Representations from Transformers (BERT) introduced by Google  or Sentence-BERT (SBERT), a fine-tuned version of BERT. The transformer model is a specific NN architecture introduced by Vaswani et al.
 
Large Language Models (LLM) are transformer NN based LMs, which are pre-trained using self-supervised learning, learn billions of parameters during training and need large computational resources for the training and during operation. They can achieve general-purpose language understanding and generate answer in the form of human-like generated text. These LLMs are used in generative AI systems. Recent versions can complete specific tasks prompt-engineered, i.e. providing a short structured description of the task that can be interpreted by the LLM. Most prominent examples are Open AI’s GPT-3.5 and GPT-4 model (used in ChatGPT) and Google’s PaLM (used in Bard).
 
<span id="ethik-in-der-ki"></span>
== Ethik in der KI ==
 
Ethik ist im allgemeinen eine Menge von moralischer Regeln und Leitfaden, die den Menschen helfen, zwischen Recht und Unrecht zu entscheiden. Es ist zu erwarten, das KI schon in der näheren Zukunft erhebliche Auswirkungen auf die ganze Menscheit und Welt haben wird. Deshalb ist es wichtig auf ethischen Fragen in Zusammenhang mit KI bedeutsam Aufmerksamkeit zu schenken.
 
In der vorigen Jahren haben Organisationen verschiedene Big Data Algoritmen und KI Lösungen eingesetzt meist um ihre Geschäftsergebnisse durch Automatisierung und datengesteuerte Entscheidungsfindung zu verbessern. Dabei wurden einige Unternehmen in Zusammenhang mit ihrer KI-Anwendungen mit unerwarteten negative Konsequenzen konfrontiert, insbesondere aufgrund unfairer Ergebnisse und durch Anwendung von mit Vorurteilen behafteten Datensätzen. Dies hat dazu geführt, dass führende Unternehmen und Forschungs-und Datenwissenschafts-Communities im Bereich der KI sich mit den ethischen Aspekten der KI eingehend befassen mussten. Mangel an angemessene Regeln in diesem Bereich kann zu Reputationsverlust, hohe Geldstrafen sowie zu regulatorischen und rechtlichen Problemen führen.
 
Ethik der KI (kurz KI-Ethik) ist ein Teilbereich der angewandten Ethik, der sich unten anderen mit der folgende Fragestellungen sich beschäftigt:
 
* die gezielte Rolle von KI-Syteme und die ethische Aspekte die ihre Benützung entsprechend ihrer Rollen ermöglichen,
* ethische Regeln, Leitpfaden für Menschen, die KI-Systeme planen, herstellen, testen, zertifizieren und benutzen.
* gewünschtes ethische Verhalten von KI-Systemen (Maschinenethik).
 
Als Leitfaden für die Ethik in der experimentellen Forschung und der Entwicklung von Algorithmen sind der Belmont-Bericht ([https://www.hhs.gov/ohrp/sites/default/files/the-belmont-report-508c_FINAL.pdf )] in akademischer Gemeinschaft weit verbreitet. Die drei wesentliche Prinzipien des Belmont-Berichts lauten:
 
# Respekt für Personen
# Gutes tun
# Gerechtigkeit (bzgl. Fairness und Gleichheit)
 
Obwohl eine vielzahl von ethischen Elemente, Prinzipien und Richtlinien für KI vorgeschlagen wurden, existiert derzeit keine einheitliche Richtlinien für KI-Ethik. Allerdings besteht eine gewisse Konsens über die folgende zwei Elemente in die KI-Ethik zu integrieren:
 
* Governance - um die Einhaltung der gesetzlichen Vorschriften in Zusammenarbeit mit Regierungsbehörden sicherzustellen.
* Erklärbarkeit - die Funktionsweise der KI Systeme zu erklären (Transparenz) um Vertrauen gegenüber KI-Systeme zu schaffen.
 
Es gibt mehrere Organisationen mit dem Ziel das KI-Ethik zu fördern. Dies sind die folgenden.
 
* CHAI: Das Center for Human-Compatible Artificial Intelligence ([https://humancompatible.ai/ )] ist eine funktionierende Kooperation verschiedener Universitäten und Institute, welche dei vertrauenswürdige KI und nachweislich nutzbringender Systeme fördert.
* DARPA: Die Defense Advanced Research Projects Agency des US-Verteidigungsministeriums ([https://www.darpa.mil/work-with-us/ai-next-campaign )] fördert die erklärbarer KI-Forschung.
* NASCAI: Die National Security Commission on Artificial Intelligence ([https://www.nscai.gov/ )] ist eine US Kommission ,,die die Methoden und Mittel prüft, die notwendig sind, um die Entwicklung von künstlicher Intelligenz, maschinellem Lernen und damit verbundenen Technologien voranzutreiben, um die nationalen Sicherheits- und Verteidigungsbedürfnisse der Vereinigten Staaten umfassend zu erfüllen.“
* AlgorithmWatch: Eine gemeinnützige Organisation ([https://algorithmwatch.org/en/ )], die sich auf den Einsatz von erklärbaren und nachvollziehbaren Entscheidungsprozessen und Algorithmen abzielt.
* AI Now Institute: Eine gemeinnützige Organisation an der New York University ([https://ainowinstitute.org/ )], die sich mit der sozialen Auswirkungen der KI beschäftigt.
 
Laut Luciano Floridi und Josh Cowls herrscht eine weitgehende Übereinstimmung darüber, dass es möglich ist, die ethischen Prinzipien für gesellschaftlich nützlichen Einsatz von KI basierend auf die vier Prinzipien der Medizinethik und ein zusätzliches fünftes Prinzip, Erklärbarkeit aufzubauen :
 
,,1. Fürsorge (Benefizienz): KI-Technologie soll der Menschheit nützen, das Wohlergehen fördern, die Menschenwürde wahren und der Erhaltung des Planeten dienen.<br />
2. Schadensvermeidung (Non-Malefizienz): Negative Folgen eines übermäßigen oder missbräuchlichen Einsatzes von KI müssen vermieden werden, KI soll nur in einem sicheren Rahmen eingesetzt werden.<br />
3. Autonomie: Menschen müssen die volle Entscheidungsfreiheit darüber haben, ob und welche Aufgaben sie an KI-Systeme delegieren, ihre Freiheit, sich eigene Normen und Standards zu setzen, muss gewahrt bleiben.<br />
4. Gerechtigkeit: Wahrung der Solidarität, Vermeidung von Ungerechtigkeit, gleichberechtigter Zugang zu den Vorteilen der KI.<br />
5. Erklärbarkeit (Verständlichkeit): Die Prinzipien 1 bis 4 sind nur realisierbar, wenn auch Laien nachvollziehen können, welchen Nutzen oder Schaden KI für die Gesellschaft hat und wer für welche Auswirkungen verantwortlich ist.“
 
Es gibt derzeit eine Reihe von ethischer Diskussionen bzgl. KI Systemen. Sollche KI-Ethik einbeziehende Debatten kommen in mehrere Bereichen vor. Autonomes Fahren stellt ein Musterbeispiel dar. Von automatisierter Systeme sind erwartet, dass sie im Vergleich zur menschlichen Fahrleistung zumindest eine Schadensminderung erzielen. Da derzeitige KI Syteme sehr selten, aber eventuell Fehlerhafte Reaktionen produzieren, ist es zumindest fragwürdig ob autonome Fahrzeuge diese Erwartung erfüllen können. Außerdem ist es auch problematisch, wie ein autonomes Fahrzeug in einem nicht normierbare dilemmatische Entscheidungen, wie z.B. Leben gegen Leben bei plötzlich auftretenden Gefahren, reagieren werden sollen. Ähnliche Debatte kommen auch im Bereiche Autonome Waffensysteme und Maschinenethik vor. Im Allgemeinen die folgende Debatte auslösende Quellen können identizifiert werden:
 
# KI Systemen machen eventuell Fehler - durch Beschränkungen der Sammlung von Daten und Dokumentation sowie der algorithmischen Regeln und - Aufgrund der Unvollständigkeit der Korpora bei der Entscheidungsfindung (ist eine Angabe alle nötige Informationskanäle in die Korpora um die gleiche Entscheidung von KI zu haben, was ein Mensch machen würde, ist nicht möglich).
# Bedenken hinsichtlich der Designziele des Systems, z.B. bessere Geschäftsergebnis zu schaffen statt öffentliches Interesse zu folgen. Eine andere Debatte bilden die Überlegungen über die technologische Singularität, i.e. wann die KI das Übertreffen die menschliche Intelligenz erreicht. Obwohl diese Superintelligenz steht nicht unmittelbar bevor, ist Sie mit Angst vor Gefahr auf die Menschheit verbunden. Allerdings ist diese Angst zumindest Teilweise unbegründet, da diese dadurch ausgelöst, dass die Funktionsweise der KI Systeme für meistens die Menschen nicht bekannt ist, also durch die fehlende Transparenz.

Version vom 1. Februar 2024, 19:04 Uhr

Einführung in Computational Intelligence und AI

Die Idee hinter der künstlichen Intelligenz, KI (Artificial Intelligence - AI), bestand darin, die Prozesse der menschlichen Wahrnehmung wie Denken, Lernen oder Mustererkennung nachzubilden. Das Erscheinen der KI in der akademischen Welt begann mit der Simulation des Verhaltens neuronaler Netze bei IBM im Jahr 1955. Dies führte zu einer Konferenz zu diesem Thema, die heute als Dartmouth-Konferenz bekannt ist und als Geburtsstunde der künstlichen Intelligenz gilt.

Die Geschichte der KI als akademische Disziplin hat einen langen, weitreichenden und abenteuerlichen Weg zurückgelegt, der optimistische und zweifelhafte, verrufene Phasen durchquerte und durch viele wissenschaftliche Bereiche führte. Es begann mit dem Studium der ,,formalen“Argumentation, das unter anderem zu Alan Turings Berechnungstheorie oder zur Programmiersprache Lisp sowie zum kommerziellen Erfolg von Expertensystemen in den frühen 1980er Jahren führte. Es folgte eine eher zweifelhafte Zeitraum, während andere Ansätze wuchsen, wie z.B. Bewegen von technischen Maschinen.

Mit der rasanten Entwicklung der Rechengeschwindigkeit und -fähigkeiten der Computer entstanden spätestens in den 1990er Jahren rechenintensive Zweige der KI. Computational Intelligence umfasst die biologisch motivierten Bereiche der Informationsverarbeitung, wie z.B. evolutionäre Algorithmen. Maschinelles Lernen ist ein umfassender Zweig der KI, der viele rechenintensive Teilbereiche umfasst. Eine davon ist die auf statistischen Ansätzen basierende Mustererkennung, die Ende der 1990er Jahre zu einem großen kommerziellen Erfolg bei Befehls- und Kontrollsystemen (command&control systems), Spracherkennungssystemen mit großem Vokabular (Large Vocabulary Speech Recognition Systems - LVSRS) und ihren Anwendungen in Callcentern und in der Radiologie führte. Diese Ansätze haben sich in den 2000er Jahren abgeflacht, als sie die Sättigung des angewandten statistischen Ansatzes erreichten.

Mittlerweile führte die Entwicklung von Ideen zur Wissensdarstellung zu fortgeschritteneren Ansätzen wie Ontologie oder Wissensgraphen (Knowledge Graphs - KG).

Die wichtigste Entwicklung war jedoch die Wiederbelebung der neuronalen Netzwerkforschung, die zu vielen erfolgreichen Anwendungen führte, wie z.B. Erkennung handgeschriebener Ziffern. Der eigentliche Durchbruch bei der Verwendung neuronaler Netze kam mit Deep Learning, das erfolgreich auf die Klassifizierung großer Datenmengen angewendet werden kann, wie z.B. Bilder. Der Erfolg von Deep Learning führte zu einem enormen Anstieg des Interesses an und der Finanzierung von KI. Deep Reinforcement Learning ermöglicht die erfolgreiche Umsetzung automatischer Steuerungsaufgaben.

Die jüngste Entwicklung im Bereich der KI ist die Anwendung spezifischer großer Sprachmodelle (Large Language Model - LLM) und fortschrittlicher Darstellungstechniken in prädiktiven Transformatoren, die zu Produkten führen, wie z. B. ChatGPT 3.5.

Im Wesentlichen soll KI die unterschiedlichen Fähigkeiten der menschlichen Intelligenz umsetzen. Dazu gehören Lernen, Wahrnehmung, logisches Denken, Abstraktion sowie komplexere Fähigkeiten wie z.B. Zusammenfassung, Kontrolle, Aufgabenlösungsfähigkeiten und vieles mehr. Eines der langfristigen Ziele ist die Schaffung einer allgemeinen Intelligenz, der sogenannten Künstlichen Allgemeinen Intelligenz (Artificial General Intelligence - AGI), die eine Vielzahl von Problemen ähnlich der menschlichen Intelligenz lösen könnte.

Raymond Kurzweil, Pionier unter anderen der optischen Texterkennung (Optical Character Recognition - OCR), Sprachsynthese (speech synthesis), Spracherkennung, ein Visionär der künstlichen Intelligenz hat vorhergesagt, dass (einfach formuliert) spätestens im Jahr 2029 die KI klüger sein werden als das Mensch. Kritiker sagen, dass in der Richtung der Entwicklung zukünftiger Technologien er wahrscheinlich Recht hat, aber die prognostizierte Zeitspanne stimmt nicht, er ist zu optimistisch.

KI ist eine unvermeidliche zukünftige Entwicklungsrichtung und eine der Kernfachgebiete der Wissenschaftsfeld Data Science. Beachten Sie, dass Data Science in wirtschaftlichen Kontext auch als Business Analytics bekannt ist.

Grundlagen von Computational Intelligence und AI

Mathematische Grundlagen von KI

Es gibt keine Einheitliche Definition von KI. Eine mögliche ausdrucksvolle Definition kann wie folgt angegeben werden: KI ist the field of study in computer science that develops and studies intelligent machines. Allerdings unter KI ist auch die ,,intelligente Maschine“ selbst zu verstehen.

Die Ziele von KI sind Teilmenge möglicher Arten menschlicher intelligenter Aktivitäten zu implementieren. Dazu gehört Argumentation (reasoning), Wissensrepräsentation (knowledge representation), Planung (planning), Lernen (learning), Verarbeitung natürlicher Sprache (natural language processing), Wahrnehmung (perception) und Unterstützung für Robotik (support for robotics).

Im Folgenden geben wir eine kurze mathematische Beschreibung einiger ausgewählten grundlegender Konzepte der KI. Das beinhaltet

  1. Regelbasierte Methoden,
  2. Wissensrepräsentation,
  3. Linear regression mit MSE-Framework
  4. Klassifikation - Bayesian decision theory
  5. Klassifikation - lineare Diskriminanzfunktionen
  6. Lernen aus Beispiele
  7. Learning durch Steuereung - MDP

Regelbasierte Methoden

Logische Argumentation zielt auf menschliche Fähigkeit ab, Schritt für Schritt zu argumentieren oder/und logische Schlussfolgerungen zu ziehen. Dies wird realisiert durch Regelbasierte Methoden (rule-based methods). Sie bieten eine deterministische Methode zur Klassifizierung in Problemen, in welchen Klassen aus Beziehungen von Entitäten bestimmt werden können, die Eigenschaften einiger Entitäten darstellen. Solche Entitätenbeziehungen werden üblicherweise durch die breite Klasse der if-then rules (Wenn-Dann-Regeln) dargestellt. Eine einfache if-then rule lautet:
IF EinerDerFünfGrößtenOrteÖsterreichs(x) THEN Stadt(x)
was natürlich bedeutet, dass wenn ein Objekt x der Eigenschaft hat, einer der fünf größten Orte von Österreichs zu sein, dann ist x eine Stadt.

If-then rules können auch logische Verknüpfungen einbeziehen, wie es im folgenden Beispiel gezeigt wird:
IF Fahrzeug(x) AND AufSchienenFahrendes(x) THEN Bahnfahrzeug(x)
Darüber hinaus sind in if-then rules auch Funktionen erlaubt, die numerische Werte zurückgeben. Dies wird wie folgt veranschaulicht:
IF Female(x) AND (Age(x) < 16) THEN Girl(x),
wobei der Age(x) ist eine Funktion, die das numerische Alter in Jahren zurückgibt, und der Ausdruck (Age(x) < 16) gibt entweder logisch ,,True“ oder ,,False“ zurück.

Eine predicate (Aussage), wie z.B. Vögel(x), HatFlügel(x) and HatKörperbedeckungAusFeder(x) and HatSchnabel(x), ist ein Test, der entweder logisch ,,True“ oder ,,False“ zurückgibt. Es gibt zwei Haupttypen von if-then rules, die propositional (variablenfrei Aussage) and first-order Regeln. Die bisherige Beispiele sind alle first-order Regel, da alle sich auf eine Variable x beziehen. Ein Beispiel für propositional if-then rules ist
IF EineDerFünfGrößtenOrteVonÖsterreich (Wien) THEN Stadt(Wien),
da diese Regel eine Aussage nur für die logische Konstante (=bestimmte atomare Instanz) Wien beschreibt.

Die first-order if-then rules sind leistungsstarke logische Regeln, was durch folgendes Beispiel verdeutlicht wird:
IF Parent(x,y) AND Parent(y,z) THEN Grandchild(z,x)

If-then rules können auf algorithmischem Wege erlernt werden. Der Sequential Covering Learning Algorithmus basiert auf dem Lernen aus Trainingsdaten, die aus einer Menge positiver und negativer Aussagen (Prädikate) bestehen. Der Algorithmus erstellt Regeln, die die Menge der positiven Prädikate abdecken, aber keines der negativen Prädikate. Das Grundprinzip des Algorithmus besteht darin, jeweils eine Regel zu lernen und diesen Vorgang zu wiederholen um schrittweise alle positiven Prädikate abzudecken. Daher heißt der "Sequential Covering". Entsprechend umfasst der Algorithmus drei Schritte: Lernen einer einzelnen Regel um am meisten positive Prädikate abzudecken (Learn-One-Rule), Löschen der dadurch abgedeckte Prädikate und Iterieren. Das Ergebnis ist ein disjunktes Regelwerk, das die gesamte Menge der positiven Prädikate abdeckt.

Beispiel .

Als Trainingsdaten werden die folgende Aussagen über Möbelwaren angegeben:


Trainingsdaten - Aussagen über Möbelwaren
Aussage-ID     Größe     Möbelhaus     Teuer
1     groß     Möbelix     False
2     klein     Möbelix     False
3     groß     Kika     True
4     klein     Kika     False
5     groß     XXXLutz     True
6     klein     XXXLutz     True


Die Aussagen 3,5 und 6 bilden die Menge der positiven Prädikaten, während die anderen (1,2 und 4) der Menge der negativen Prädikaten gehören. Zuerst werden mittels Learn-One-Rule die beste Hypothese bestimmt, d.h. die Regel die am meisten positive Prädikate abdeckt:
IF Möbelhaus(x,XXXLutz) THEN Teuer(x),
wobei x die Variable für Möbelwaren ist. Dann werden die Prädikaten 5,6 aus dem Trainingsdaten gelöscht. Wendet man Learn-One-Rule noch einmal an, bekommt man die Regel
IF Größe(x,groß) AND Möbelhaus(x,Kika) THEN Teuer(x)
Damit alle positiven Prädikate wurde abgedekt, also endet der Algorithmus hier. Beachten Sie, dass die beiden gefundenen Regeln disjunkt sind, was aus dem Verlauf des Algorithmus folgt.

Die endgültige Regel lautet wie folgt:
IF Möbelhaus(x,XXXLutz) (deckt die Beispiele 5,6)
OR (Größe(x,groß) AND Möbelhaus(x,Kika)) (deckt das Beispiel 3)
Then Teuer(x)

Die Visualisierung des Sequential Covering Learning Algorithmus wird in Abbildung 1.1 gezeigt.

Abbildung 1.1 Visualisierung des Sequential Covering Learning Algorithmus

Das vorige Beispiel zeigt den Aufbau eines Regels mittels Sequential Covering Learning Algorithmus für Binäre Klassifizierung, welche die mögliche Prädikate in eine von zwei Klassen eingeteilt werden: Teuer(x) = True oder Teuer(x) = False. Die Regelbasierte Methoden können für Klassifizierung nicht nur in zwei sondern in mehrere Klassen angewendet werden. Nachfolgend wird der Sequential Covering Learning Algorithmus für den allgemeinen Fall von Lernregeln für die Klassifizierung in eine endliche Anzahl von Klassen gezeigt.

—————————————————————————————
Algorithm  Sequential Covering Algorithm
—————————————————————————————
Eingabe:
- Die Menge T, Trainingsdaten, bestehend aus positiver und negativer Prädikaten
- Die geordnete Menge von Klassen,
Output: Individuelle Regel für jede Klasse
in Form eine mit AND verbundener Liste der gefundenen Regeln,
für die Klasse ,
die die Klassifikation durch Ausprobieren der Regellisten, ,
in der Reihenfolge , realisieren.
—————————————————————————————
1 Initialize the rule lists, ,
2 for i= 1:(k-1)
3   while not every positive predicate is covered in class
4      <– Learn-One-Rule
5     Removing predicates from training data covered by
6     Add r to the end of the rule list
7   end
8 end
—————————————————————————————


Auf diese Weise werden mögliche komplexe if-then rules erlernt. Dieser Greedy-Algorithmus ist nicht unbedingt optimal, d. h. er liefert nicht unbedingt die kompaktesten Regeln. Daher können die resultierenden Regeln in einem Nachbearbeitungsschritt durch die Anwendung logischer Umformulierungsregeln weiter vereinfacht werden. Der Schritt Learn-One-Rule kann auf auf unterschiedliche Art und Weise realisiert werden. Eine effektiver Ansatz ist beam search. Man kann über Sequential Covering Algorithmenfamilie reden, da geänderte und erweiterte Varianten des ursprünglichen Algorithmus existieren. Typische Varianten sind Ripper Algorithmus, Foil Algorithmus oder CN2 Algorithmus. If-then rules können auch auf andere Weise, z. B. mittels Entscheidungsbäume auch gelernt werden.

Regeln haben den Vorteil, dass sie einfach zu interpretieren sind und in relationalen Datenbanken implementiert werden können. Allerdings hängt die Auswahl der Prädikate für die Trainingsdaten stark von der Aufgabe ab und ist in der Praxis eine schwierigere Aufgabe als das Erlernen der Regeln. Regelbasierte Methoden sind integraler Bestandteil von Expertensystemen in der KI.

Wissensrepräsentation

Representing knowledge on word level can be realized by probabilistic model of natural language, which is called Language Model (LM) Usually an LM is used to estimate the probability of a given word sequence , i.e. , which can be given as a product of the conditional probabilities as
.

In the N-gram LM, due to practical reasons, the word sequence probability, is approximated as
,
where the words with negative index must be omitted from the probabilities. The conditional probabilities can be estimated from the relative frequencies of the N-grams and (N-1)-gramms as
,
where stands for the number of occurrences of string x.

The probabilities for and for are called unigram, bigram and trigram probabilities, respectively. The N-gram LM is purely Statistical LM (SLM), since it is trained on N-gram statistics of the training corpora. The estimates of the conditional probabilities to be reliable requires that the corresponding N-grams and (N-1)-grams must be present in the training corpora. This implies that the training corpus must be large enough. The SLM can not assign any probability to N-grams, which does not appear in the training corpus. This problem is usually solved by applying smoothing techniques, in which a small part of the probability measure is shifted to the unseen grams.

The most effective method of smoothing is the Kneser–Ney smoothing, in which a fixed value is subtracted from the conditional probability’s of N-grams with lower frequencies and the so won probabilities are spread over the conditional probabilities of all non-appearing N-grams. The equation for the conditional bigram probabilities is given by

and
.
Here is the constant discount value to be subtracted and is a normalization constant, which is set to make the sum of over all equal to one. The conditional probability’s of N-grams with lower frequencies will be determined by the unusual probability term , which realizes an estimation for the conditional probability of the unseen bigram form other bigram counts related to .

Besides of LMs, another ways of knowledge representation are

  • Semantic network - a graph representing semantic relations between concepts. The concepts are represented by the nodes and the edges specify the semantic relations. One reason to design semantic networks is to represent knowledge in machine readable form. Knowledge graphs are semantic networks with limited set of semantic relations.
  • Ontology- hierarchical representation of concepts and their relationships, which can be managed by standard ontology language such as OWL or RDF.

Linear regression mit MSE-Framework

The task of linear regression is to give an linear prediction for the random scalar output from the random input vector . Let be the predicted value of . It is assumed that the depends linearly from the vector , hence the predicted value can be given as linear combination of as or in vector form

where is a column vector of parameters. Here stands for the transpose operation and the vectors are by default column vectors. The parameters can be seen as weights. If input value has positive weight, then increasing also increases the predicted value . Similarly if has negative weight, then increasing decreases the predicted value . If is a large value then has a large impact on . If then has no effect on .

Suppose that examples of the vector and the correct value of output to each of them are given. We arrange the input vectors in a matrix on that way that vector is placed into the -th row of matrix , . The output values and the predicted values are also arranged into a column vector and so, that the correct value and the predicted value belonging to the input vector comes to the -th position in the vector and , respectively.

The optimal parameter vector can be determined from an optimum task with respect to a performance measure quantifying the quality of prediction.

Linear regression with mean squared error

One possible choice for the performance measure is the Mean Squared Error (MSE) of the predictions, which is biven by

where stands for the norm of .

Then the optimal value of parameter vector is obtained by minimizing the MSE. The necessary condition of having a local minimum of MSE is to have a value of parameter vector , for which the gradient of MSE is 0. It can be shown, that in our case, MSE as function of is a convex function, in which case there is only one such value of , and hence it is also the global minimum point. Therefore the best value of parameter vector is obtained by solving the equation

where stands for the gradient with respect to . The set of predicted outputs can be given in vector-matrix form as . Applying it in the above relation gives the equation to be solved in terms of as

Omitting the constant , evaluating the gradient and performing several rearrangements we get

Fehler beim Parsen (Unbekannte Funktion „\hspace“): {\displaystyle \begin{aligned} \nabla_{{\bf w}} \left( \lVert {\bf X}{\bf w} - {\bf y} \rVert_2^2 \right) &&\hspace{-5mm}= \nabla_{{\bf w}} \left({\bf X}{\bf w} - {\bf y} \right)^T \left({\bf X}{\bf w} - {\bf y} \right) = \nabla_{{\bf w}} \left( {\bf w}^T {\bf X}^T{\bf X}{\bf w} - 2 {\bf w}^T {\bf X}^T {\bf y} + {\bf y}^T {\bf y}\right) \nonumber \\ &&\hspace{-5mm}= 2 {\bf X}^T{\bf X}{\bf w} - 2 {\bf X}^T {\bf y} = 0, \nonumber \end{aligned}}

where we utilized in the second equality that , since is a scalar and thus it equals to its transpose, and in the last equality that matrix is symmetric enabling the above form of .

Finally the solution for the parameter vector can be expressed from the above equation as

Matrix is quadratic enabling taking the inverse of it. Nevertheless the inverse exists only if matrix is non-singular, which usually holds, since matrix is constructed from independent input examples.

The above equations in scalar form lead to system of equations, which are called as normal equations.

Linear regression is also used for a slightly extended prediction model, which is given as

where the scalar is called bias and results that the hyperplane described by the multivariate function does not pass through the origin. This extended case can be led back to the base case by introducing the extended vectors and leading to the equivalent description

An illustrating example with one-dimensional input values is shown in Figure 1.

Connection of linear regression with MSE to maximum likelihood estimation

Let be a sequence of samples taken from the underlying population producing independent samples, each of them from the same random variable with a likelihood function Fehler beim Parsen (Syntaxfehler): {\textstyle p(z|{\mbox{\boldmath$\theta$}})} from a known distribution family with unknown parameter vector Fehler beim Parsen (Syntaxfehler): {\textstyle {\mbox{\boldmath$\theta$}}} . The likelihood function equals to the probability (mass) function and the probability density function (pdf) for discrete and continuous distribution, respectively. The Maximum Likelihood Estimation (ML estimation or MLE) of the parameter vector is the value, which maximizes the probability of the observed sequence of samples. In other words

Fehler beim Parsen (Syntaxfehler): {\displaystyle {\mbox{\boldmath$\theta$}}_{ML} = \arg \max_{{\mbox{\boldmath$\theta$}}} p({\bf z}|{\mbox{\boldmath$\theta$}}) = \arg \max_{{\mbox{\boldmath$\theta$}}} p(z_1,\ldots,z_M|{\mbox{\boldmath$\theta$}}) = \arg \max_{{\mbox{\boldmath$\theta$}}} \prod_{m=1}^{M} p(z_m|{\mbox{\boldmath$\theta$}}),}

where we utilized that the samples are taken independently from the same random variable .

The product over more likelihoods can cause numerical issues, like e.g. numerical underflow. Therefore it is usual to use the logarithm of the likelihood instead of the likelihood. This modification does not cause any change in argmax and max, since the logarithm function is monotone. This gives the usual form of the ML estimate of the parameter vector as

Fehler beim Parsen (Syntaxfehler): {\displaystyle {\mbox{\boldmath$\theta$}}_{ML} = \arg \max_{{\mbox{\boldmath$\theta$}}} log \left( \prod_{m=1}^{M} log p(z_m|{\mbox{\boldmath$\theta$}})\right) = \arg \max_{{\mbox{\boldmath$\theta$}}}\sum_{m=1}^{M} log p(z_m|{\mbox{\boldmath$\theta$}}).}

Applying the principle of MLE to a conditional likelihood function , the conditional likelihood function from a known distribution family with unknown parameter vector Fehler beim Parsen (Syntaxfehler): {\textstyle {\mbox{\boldmath$\theta$}}} becomes Fehler beim Parsen (Syntaxfehler): {\textstyle p(y|{\bf x}, {\mbox{\boldmath$\theta$}})} . Thus the ML estimate of the parameter vector in such case is given by

Fehler beim Parsen (Syntaxfehler): {\displaystyle {\mbox{\boldmath$\theta$}}_{ML} = \arg \max_{{\mbox{\boldmath$\theta$}}}\sum_{k=1}^{K} log~ p(y_k|{\bf x}_k, {\mbox{\boldmath$\theta$}}).}

Now we select the normal distribution as the known distribution family and set the mean of the distribution to be the predicted output value, and the variance of the distribution to a fixed value . Then the conditional likelihood function, which in this case a conditional pdf, becomes Fehler beim Parsen (Syntaxfehler): {\textstyle p(y|{\bf x},{\mbox{\boldmath$\theta$}}) = N(y| \hat{y}({\bf x},{\bf w}),\sigma^2)} , where stands for the pdf of the normal distribution with mean and variance . Applying the expression of the pdf of the normal distribution , we get for the ML estimate of the predicted output values,

Fehler beim Parsen (Unbekannte Funktion „\hspace“): {\displaystyle \begin{aligned} {\bf \hat{y}} &&\hspace{-5mm}= \arg \max_{\bf \hat{y}}\sum_{k=1}^{K} log N(y_k| \hat{y_k}({\bf x_k},{\bf w}),\sigma^2)\nonumber \\ &&\hspace{-5mm} = \arg \max_{\bf \hat{y}}\sum_{k=1}^{K} log \left(\frac{1}{\sqrt(2 \pi \sigma^2)}exp\left(-\frac{1}{2} \frac{(y_k-\hat{y_k})^2}{\sigma^2} \right)\right) \nonumber \\ &&\hspace{-5mm}= \arg \max_{\bf \hat{y}} \left(-K log \sigma -\frac{K}{2} log(2 \pi) - \sum_{k=1}^{K} \frac{(y_k-\hat{y_k})^2}{2\sigma^2} \right) = \arg \max_{\bf \hat{y}} \left(-\sum_{k=1}^{K} ((y_k-\hat{y_k})^2\right) \nonumber \\ &&\hspace{-5mm}= \arg \min_{\bf \hat{y}} \sum_{k=1}^{K} ((y_k-\hat{y_k})^2 = \arg \min_{\bf \hat{y}} \lVert {\bf \hat{y}} - {\bf y} \rVert_2^2 \nonumber. \end{aligned}}

Assuming the linear relation among and as , the above optimization with respect to results in the same estimation for as minimizing the MSE. Consequently the minimizing MSE can be seen as a ML estimation, which justifies the use of MSE due to the desirable properties of the ML estimation.

Klassifikation - Bayesian decision theory

Bayesian decision theory is a fundamental probabilistic approach to the task of classification. This approach uses probabilistic arguments and losses associated to decisions.

Bayes decision rule - with minimizing the error probability

Let and be two classes, e.g. two types of see fishes, like herring and mackerel. Suppose that an observer watching fishes arriving along a conveyor belt tries to predict which type of fish is the next one. The arriving type of fishes is random. It can be assumed that there is a probability that the next fish is herring and a probability that the next fish is mackerel. These probabilities are called a priori probabilities (or priors) reflecting that they represent the only knowledge on the emergence of these types of fishes. Having only these probabilities available for the decision, it seems logical to use a decision rule: Decide if and otherwise decide . This decision rule makes sense if there is only one fish to decide about its type. However applying this rule for a sequence of fishes seems to be incorrect: although we know that both types of fishes appear, we always decide the next one to be the same type. On this way we introduce an error whose probability is always the smaller one among and .

Fortunately in most of the practical situations we have not to make decisions with so little amount of informations. Let us assume that informations on several features of the fish types, like e.g. their length, weights, etc. are available in advance. These features can be measured for the next emerging fish and can be organized in a vector . Now we assume the more general case of having more classes, . The information on the features belonging to class can be given by the class-conditional density function . Suppose that both the priori probabilities and the class-conditional density function are given. Then the probability of each class, taking into account the informations observed in the features can be given by the conditional probability of that class, given the observed values of features, . This conditional probability can be expressed by means of the Bayes formula as

The probability is called posteriori probability (or posterior) reflecting the incorporation of the informations on the features into the probability. The likelihood indicates that if the priors of the classes are equal then the class with the highest likelihood is the most likely to be the right class.

At this point it seems logical to establish the decision rule:
Decide with for every .
The expected value of the probability of error for this decision rule is given by

Deciding class with highest leads to error with probability , for which for every . Therefore Fehler beim Parsen (Syntaxfehler): {\textstyle 1 - P(c_i|{\bf x} = \min P(\mbox{~error~}|{\bf x} )} , i.e. this error is the minimal one, for the given . It ensures that Fehler beim Parsen (Syntaxfehler): {\textstyle P(\mbox{~error~}|{\bf x} )} in the above integral is as small as possible and hence the integral and thus also is minimal. So the above decision rule minimizes the probability of error. In fact it is a Bayes decision rule with minimizing the probability of error.

The general Bayes decision rule

We would like to generalize the Bayes decision rule by introducing a loss function instead of the probability of error. The decision among the classes is performed by taking the action , . However if the true class is then we involve a loss . Hence the expected loss by taking the action is given by

In the decision-theoretic terminology is called as conditional risk. Let be the decision function taking one of the actions for each . The overall risk is given by

Te goal is to find the decision function which minimizes the overall risk. To achieve it the action must be taken for each , which minimizes the conditional risk for that . In other words take with . This leads to Bayes decision rule (also called Bayes decision procedure):
To minimize the overall risk,

  1. Compute the conditional risk
    for and
  2. Select the action for which is minimum, in other words select with .

The overall risk resulted by applying the Bayes decision rule gives the Bayes risk, denoted by , which is the achievable lowest risk.

The general case with two classes

Let us and be interpreted as deciding class and , respectively. This enables the simplified notation , . The conditional risk for and can be given as

Fehler beim Parsen (Unbekannte Funktion „\hspace“): {\displaystyle \begin{aligned} &&\hspace{-5mm} R(\alpha_1|{\bf x}) = \lambda_{1,1} P(c_1|{\bf x}) + \lambda_{1,2} P(c_2|{\bf x}), \nonumber \\ &&\hspace{-5mm} R(\alpha_2|{\bf x}) = \lambda_{2,1} P(c_1|{\bf x}) + \lambda_{2,2} P(c_2|{\bf x}). \nonumber \end{aligned}}

The Bayes decision rule becomes

Fehler beim Parsen (Syntaxfehler): {\displaystyle \mbox{~Decide~} c_1 \mbox{~if~} R(\alpha_1|{\bf x}) < R(\alpha_2|{\bf x}), \mbox{~otherwise~} c_2.}

In terms of posteriori probabilities this becomes Fehler beim Parsen (Syntaxfehler): {\displaystyle \mbox{~Decide~} c_1 \mbox{~if~} (\lambda_{2,1} - \lambda_{1,1}) P(c_1|{\bf x}) > (\lambda_{1,2} - \lambda_{2,2}) P(c_2|{\bf x}), \mbox{~otherwise~} c_2.}

Usually making error should lead to higher loss, which means that both and are positive.

The above inequality can be rearranged by applying Bayes rule in it, which gives . Hence the Bayes decision rule can be alternatively formulated as Fehler beim Parsen (Syntaxfehler): {\displaystyle \mbox{~Decide~} c_1 \mbox{~if~} \frac{p({\bf x}|c_1 )}{p({\bf x}|c_2 )} > \frac{(\lambda_{1,2} - \lambda_{2,2}}{\lambda_{2,1} - \lambda_{1,1}} \frac{P(c_2)}{P(c_1)}, \mbox{~otherwise~} c_2.}

The quantities are likelihoods, which enables to interpret this Bayes decision rule as deciding if the likelihood ratio exceeds some independent threshold.

Criteria for specific Bayes decision rules - minimum error-rate classification

From now on is interpreted as deciding class and the simplified notation is enabled. Let us consider the loss given by the symmetrical or zero-one loss function:

Fehler beim Parsen (Unbekannte Funktion „\begin{array}“): {\displaystyle \begin{aligned} \lambda_{i,j} = \left\{ \begin{array}{l} 0, \mbox{~if~} i=j \\ 1, \mbox{~if~} i \neq j \end{array} \right\}, ~i,j = 1, \ldots, C. \nonumber \end{aligned}}

This loss function assigns 0 loss to correct decision and a unit loss to any erroneous decisions, i.e. all errors are equally punished. Thus this loss function can be seen as a function indicating the errors. The conditional risk is given by

which is exactly the probability of error, i.e. the error rate. Minimizing the probability of error is equivalent to maximizing the posteriori probability. Therefore the Bayes decision rule minimizing the conditional risk gives the decision rule

Fehler beim Parsen (Syntaxfehler): {\displaystyle \mbox{~Decide~} c_1 \mbox{~if~} P(c_1|{\bf x}) > P(c_2|{\bf x}) , \mbox{~otherwise~} c_2.}

This is the Bayes decision rule with minimizing the error rate, i.e. the error probability, which we already introduced before. Hence this logical decision rule corresponds to select the loss function being the zero-one function indicating the errors.

Criteria for specific Bayes decision rules - minimax criterion

The principle of minimax criterion will be shown for the classification with two classes. We show that the overall risk is a linear function of the prior, i.e. it is linear in e.g. . Let and the (unknown) region, where the classifier decides and , respectively. It enables to express the overall risk in terms of the conditional risks as

Fehler beim Parsen (MathML mit SVG- oder PNG-Rückgriff (empfohlen für moderne Browser und Barrierefreiheitswerkzeuge): Ungültige Antwort („Math extension cannot connect to Restbase.“) von Server „https://wikimedia.org/api/rest_v1/“:): {\displaystyle \begin{aligned} R &&\hspace{-5mm}= \int_{\mathcal{R}_1} \left( \lambda_{1,1} P(c_1) p({\bf x}|c_1 ) + \lambda_{1,2} P(c_2) p({\bf x}|c_2 )\right) d{\bf x} \nonumber \\ &&\hspace{-5mm}+ \int_{\mathcal{R}_2} \left( \lambda_{2,1} P(c_1) p({\bf x}|c_1 ) + \lambda_{2,2} P(c_2) p({\bf x}|c_2 )\right) d{\bf x}. \nonumber \end{aligned}}

Using the relations and , for the above expression of the overall risk can be rearranged as

Fehler beim Parsen (Unbekannte Funktion „\hspace“): {\displaystyle \begin{aligned} R &&\hspace{-5mm}= \lambda_{1,1} P(c_1) \left(1- \int_{\mathcal{R}_2} p({\bf x}|c_1 ) d{\bf x} \right) + \lambda_{1,2} \left(1-P(c_1)\right) \int_{\mathcal{R}_1} p({\bf x}|c_2 ) d{\bf x} \nonumber \\ &&\hspace{-5mm}+ \lambda_{2,1} P(c_1) \int_{\mathcal{R}_2} p({\bf x}|c_1 ) d{\bf x} + \lambda_{2,2} \left(1- P(c_1)\right) \left(1 - \int_{\mathcal{R}_1} p({\bf x}|c_2 ) d{\bf x} \right) \nonumber \\ &&\hspace{-5mm}= \lambda_{2,2} + \left(\lambda_{1,2}-\lambda_{2,2}\right) \int_{\mathcal{R}_1} p({\bf x}|c_2 ) d{\bf x} \nonumber \\ &&\hspace{-5mm}+ P(c_1) \bigg(\left(\lambda_{1,1}-\lambda_{2,2}\right) + \left(\lambda_{2,1}-\lambda_{1,1}\right) \int_{\mathcal{R}_2} p({\bf x}|c_1 ) d{\bf x} - \left(\lambda_{1,2}-\lambda_{2,2}\right) \int_{\mathcal{R}_1} p({\bf x}|c_2 ) d{\bf x}\bigg). \nonumber \end{aligned}}

Therefore the overall risk usually depends on the prior on linear way. Sometimes it is necessary to have a classifier, which performs well over a range of prior probabilities. This is the situation for example if the priori probabilities of the types of fishes arriving along a conveyor belt vary or we do not know the priori probabilities at all. Such a classifier can be designed by setting the coefficient of in the above expression of the overall risk to , i.e. by setting
,
in which case the overall risk becomes independent of the prior . This setting is called the minimax solution and the resulted overall risk is the minimax risk. It can be seen that the resulted risk is the maximum among the prior dependent Bayes risks (=minimum risks besides given prior), which explains the name minimax for this criterion.

Estimation of the posteriors

The design of a classifier based on Bayes decision rule requires the knowledge of the posteriori probabilities , for , which are determined by the priors and the likelihoods . Usually the likelihoods are computed by applying parametric statistic estimations, in which the distribution family is assumed to be either unimodal (like e.g. the multivariate normal density) or multimodal.

Klassifikation - lineare Diskriminanzfunktionen

General discriminant functions

A general way of representing a classifier is to specify it by a set of discriminant functions , for . The classifier decides to assign the class to the input feature vector , if

Fehler beim Parsen (Syntaxfehler): {\displaystyle d_i({\bf x}) > d_j({\bf x}) \mbox{~for~all~} j \neq i.}

The Bayes classifier can be also represented by means of discriminant functions. In general case the discriminant functions can be mapped as

The maximal among the discriminant functions determines the class with the minimum conditional risks. A still simpler discriminant function can be given for the minimum error-rate classifier as

This can be further simplified by utilizing that the denominator of is independent of , which leads to alternative discriminant function as

Any monotone function can be applied to discriminant functions, since it does not change the class belonging to the maximal one. Therefore a further discriminant function can be defined for the minimum error-rate classifier by the help of the monotone function as

In this case all the above different discriminant functions define the same decision rule. The discriminant functions divide the feature space into disjunct decision regions . A feature vector falls into the decision region if for every . Thus decision region represents the set of feature vectors for which the classifier decides class . The decision regions are separated by decision boundaries in the -dimensional space of feature vectors.

In the special case of classifier with two classes it is common to use a single discriminant function

instead of and . Hence the decision rule becomes

Fehler beim Parsen (Syntaxfehler): {\displaystyle \mbox{~Decide~} c_1 \mbox{~if~} d({\bf x}) > 0 , \mbox{~otherwise~} c_2.}

Figure z.

Linear discriminant functions

In case of discriminant functions for Bayes classifier the form of the underlying probability distribution must be assumed to be known. However discriminant functions without such assumption can be also designed. Hence, in limited, statistical sense, the procedure of determining such discriminant functions can be seen as nonparametric. An important class of such discriminant functions are the linear discriminant functions.

Linear discriminant functions are linear in the components of and can be given as the linear combination of the components of , in other words:

where is the weight vector and is the bias or threshold weight. The decision boundary becomes a decision surface.

For the case of classifier with two classes there is only one discriminant function, . The class will be decided, if exceeds the threshold , and the class otherwise.

The principle of the operation of the linear classifier with two classes and d-dimensional feature vector is shown in Figure w1.

Figure w1.

The decision surface is a hyperplane. Let us consider two feature vectors and both on the decision surface. Then holds, which means , from which

It follows that is perpendicular to any vector lying on the decision feature and hence the decision surface is a hyperplane with as normal vector.

The discriminant function has a strong connection to the signed distance of to the hyperplane, . This can be seen by expressing as the sum of its projection to the hyperplane and times the unit vector in the direction of , i.e.

Using , for we get

from which

The normal vector points into the decision region , since and thus points to the same side of the hyperplane as , if .

The distance of the origin to the hyperplane is . The origin falls into the decision region if and into if . If , then the hyperplane passes through the origin and linear discriminant function has homogeneous form .

The above properties are illustrated geometrically on Figure w2.

Figure w2

For the general case of classifier with classes, the linear discriminant functions are given as

The decision rule is the same as the one for the general discriminant functions

Fehler beim Parsen (Syntaxfehler): {\displaystyle \mbox{~Decide~} c_i \mbox{~if~} d_i({\bf x}) > d_j({\bf x}) \mbox{~for~all~} j \neq i.}

In case of the values of two ore more discriminant functions being the same for some , which is called ties (like in statistics), the decision is undefined. The classifier with linear discriminant functions is also called as linear machine. It divides the d-dimensional feature space into disjunct decision regions. For the boundary between two neighborous decision regions and holds , or equivalently

It follows from the considerations done for classifier with two classes that the boundary is a hyperplane with normal vector . Furthermore the signed distance of to this hyperplane is given as . These expressions show that in linear machine the differences of the weights are important, not the weights itself. The total number of hyperplanes can be less than the possible number of pairs. It can be shown that the decision regions in are convex.

Generalized linear discriminant functions are linear in some given functions of and have a form as

Truncated serial expansions of with any -s lead to polynomial discriminant functions as subclass of generalized linear discriminant functions. An example is the quadratic discriminant function, which can be given in the form

Besides of the decision regions in being convex, they can have any dependency in . Hence generalized linear discriminant functions can describe more general feature spaces, which is their beneficial property providing motivation to use them.

Finding linear discriminant functions

The parameters of the classifier with linear discriminant functions are the weights. Thus training the parameters of the classifier means to find proper linear discriminant functions. This is done by formulating the task as an optimization. A natural choice for the criterion of the optimization is to minimize the training error. To perform such minimization in a general settings, well-known numerical multivariate optimization algorithms, like e.g. gradient methods are used and adopted to the specialities of finding linear discriminant functions.

Lernen aus Beispiele Learning from examples is a fundamental approach in machine learning. The goal is to build up knowledge from examples by training a statistical model, which represent the knowledge also for unseen examples. On this way the statistical model represents a generalization of the knowledge extracted from the examples. In some contexts such learning is called also training.

The classical form of such learning assumes providing the correct output for the input in each example. It is usual to say to have labelled training set, which means that the in each example the input is labelled by the correct output. This type of learning is called supervised learning.

Example - estimation of the transition probability matrix of a DTMC (Beispiel: Schätzung der Übergangswahrscheinlichkeiten einer DTMC (mittels des Verfahrens der Lagrange-Multiplikatoren)) As an example for learning from examples we provide the estimation of the transition probability matrix of a DTMC from observed state sequences.

The model

Let be the set of states of the DTMC. Furthermore let denote observed state sequence. The transition probability matrix of the DTMC is denoted by , i.e. for . Observe that a given state sequence implicitly includes many labelled examples in the form , where the input is and the next state gives the corresponding correct output.

Problem formulation The task of estimating of the parameter matrix can be formulated as an optimization problem as

where is the estimated parameter matrix. We are going to estimate the parameter matrix by applying the Maximum-Likelihood statistical point estimation method. Therefore we maximize the log likelihood of , which has maximum at the same point, where the likelihood.

By omitting the condition to get simpler notation, the log likelihood can be expressed as

Fehler beim Parsen (Unbekannte Funktion „\hspace“): {\displaystyle \begin{aligned} log P(\overrightarrow{\bf z})&&\hspace{-5mm}= P(z_1,\ldots,z_T) = log P(z_T|z_{T-1},\ldots, z_1)P(z_{T-1}|z_{T-2},\ldots, z_1) \ldots P(z_1|z_0) P(z_0) \nonumber \\ &&\hspace{-5mm}= log P(z_T|z_{T-1}) P(z_{T-1}|z_{T-2})\ldots P(z_1|z_0)P(z_0) = log \prod_{t=1}^{T}P(z_{t}|z_{t-1}) P(z_0) \nonumber \\ &&\hspace{-5mm}= \prod_{t=1}^{T}{\bf A}_{z_{t-1},z_{t}}P(z_0) = \sum_{t=1}^{T} log {\bf A}_{z_{t-1},z_{t}} + log P(z_0), \nonumber \end{aligned}} where we used the multiplication theorem in the first and the Markov property in the second equality.

This can be further rearranged by introducing an indicator variable as

Omitting the term does not influence the place of maximum of the log likelihood, since it does not depend on matrix . Furthermore some constraints on matrix must be taken into account in the optimization. Matrix is stochastic and hence 1. , i.e. the elements of matrix are non-negative, 2. for , i.e. the row sums of matrix are equal to 1.

We are going to apply the method of Lagrange multiplicator for the optimization. It ensures that the non-negativity of the resulted matrix . Therefore the task can be formulated as the following constrainted optimization problem:

Fehler beim Parsen (Syntaxfehler): {\displaystyle \arg\max_{{\bf A}} \sum_{i=1}^{|\mathcal{S}|}\sum_{j=1}^{|\mathcal{S}|}\sum_{t=1}^{T} 1_{z_{t-1} = s_i, z_{t} = s_j} log {\bf A}_{i,j}, ~ \mbox{~subject~to~}~\sum_{j=1}^{|\mathcal{S}|} {\bf A}_{i,j} = 1, \mbox{~for~} i = 1, \ldots, |\mathcal{S}|.}

Solution by applying the method of Lagrange multiplier

Applying the method of Lagrange multiplicator requires the introduction of |S| Lagrange multipliers, for , which can be arranged in a vector Fehler beim Parsen (Syntaxfehler): {\textstyle {\mbox{\boldmath$\alpha$}}} . Thus the Lagrange function can be given as

Fehler beim Parsen (Syntaxfehler): {\displaystyle \mathcal{L}({\bf A}, {\mbox{\boldmath$\alpha$}}) = \sum_{i=1}^{|\mathcal{S}|}\sum_{j=1}^{|\mathcal{S}|}\sum_{t=1}^{T} 1_{z_{t-1} = s_i, z_{t} = s_j} log {\bf A}_{i,j} + \sum_{i=1}^{|\mathcal{S}|} \alpha_i \left(1 - \sum_{j=1}^{|\mathcal{S}|} {\bf A}_{i,j} \right).}

Taking the first derivative of the Lagrange function with respect to and making equal to 0 yields

from which can be expressed as

Now taking first derivative of the Lagrange function with respect to , making equal to 0 and applying the above expression of yields

from which we get the solution for as

Applying the final expression of in the expression of , we get the for the estimate of

This estimate of can be interpreted as the number of count of the transition divided by the number of count of , i.e. the count of observing the DTMC being in state . This estimate fits to the intuition.

Learning durch Steuereung - MDP Learning by means of control is an iterative process, in whose each step the actor reacts with an action to the environment’s answer to its previous action. Thus the actor learns its behaviour stepwise. Such process arise very often in the nature, e.g. as controlling the movement of animals during looking for some food.

Such learning process can be described by the mathematical model of Markov-Entscheidungsprozess (MEP) - Markov decision process (MDP). MDP is diskreter stochastischer Kontrollprozess. In each state of the MDP, the actor can take an action , from the allowed set of actions at that state. It causes the MDP to move into the state and giving a reward to the actor, which corresponds to the action and state transition, , both at the next time step. On such way the MDP realizes an action and reward-feedback based learning. The working flow of the MDP model is shown in Figure x.

Abbildung x

The mathematical definition of MDP is given as follows. MDP is a 4-tuple , where

  • is the state space, i.e. set of states,
  • is the action space, i.e. set of actions,
  • , in function form describes the state transition from to while the action is taken and
  • is the set of possible rewards, and is the reward for the state transition from to when the action is taken.

An example diagram of a MDP is shown in Figure y.

Abbildung y

MDP is an extension of the Discrete-Time Markov Chain (DTMC) by adding actions (allowing selection opportunities) and rewards (providing feedback for motivating) to it. It implies that an MDP with only one action and a common reward reduces to a DTMC. MDP can be used for solving optimization tasks and it is usually implemented by dynamic programming.

CI

Computational Intelligence (CI) is subarea of AI covering biologically motivated and hence computationally intensive methodologies and approaches.

Definition von CI

There is no commonly accepted unique definition of CI. The first definition origins from 1994 by Bezdek (see in ) and defines CI as having all of the following merkmalen:

  • deals with numerical data,
  • has a pattern-recognition component instead of represented knowledge,
  • has computational adaptive and fault tolerance methodology and
  • approximates human like performance in terms of speed and error-rate.

CI can be also defined by means of its major characteristics:

  • a sub-field in KI,
  • a set of nature-inspired approaches,
  • addresses complex real-world problems, which may contain some uncertainties or might have stochastic nature.

Positionierung von CI gegenüber KI und seiner Geschichte zur Geschichte der KI

CI and KI both have the long-term goal of reaching general intelligence. However KI is based on hard computing techniques following binary logic having two values, the binary 0 or 1. This can be interpreted as an element in a set is either included or not. On the contrary CI is based on soft computing techniques following fuzzy logics, which enables real values between 0 and 1, which can be seen as a degree of membership an element in a set. This is a clear difference between KI and CI. On the other hand CI can be seen as a sub-area of KI in a broader sense.

While the history of KI began already in the 1950s, the term CI was first used in 1990 by the IEEE Neural Networks Council dealing with development of biological and artificial neural networks. In 2001 the Council became IEEE Computational Intelligence Society and several years later new areas such as fuzzy systems and evolutionary computation have been included into the area of interest of the society.

Hauptkomponenten der CI

The major components of CI are given as

  1. Artificial Neural Networks based on the biological ones, which
    • enable processing and learning from experiential data and
    • provide fault tolerance in some extent.
  2. Fuzzy logic being one of the major discipline of CI, which
    • enable soft computing techniques for modelling real-life complex processes by applying probabilistic methods and
    • can be applied for approximate reasoning, but not for learning.
  3. Evolutionary computation (EC) providing algorithms (Evolutionary Algorithms -EA) for global optimization, which
    • are inspired by biological evolution
    • usually based on a population of candidate solutions

Evolutionäre Algorithmen (EA)

In this subsection we give the brief mathematical descriptions of the two most broadly known and applied EA algorithm: Optim book + Pattern rec book

  • Particle Swarm Optimisations (PSO) and
  • Genetic Algorithms.

Partikelschwarmoptimierungen - Particle Swarm Optimisations (PSO)

Genetischer Algorithmus - Genetic Algorithms (+ Programming ?)

Anwendungsgebiete von AI

Anwendungen von KG

In this subsection we discuss the applications of the KG.

(1) Information retrieval (IR) systems include tasks like - image retrieval, - music retrieval, - web search, - domain-specific retrieval (e.g. geographic, legal,..), - cross-lingual retrieval and - text summarization (mainly as extracted parts of the original text).

(2) Semantic search is realized by integrating KG into the results of the search engine. This leads to improved ,,Big Data“ search engine. It extends the conventional search engine with the following new features - offer more relevant information, - identify and disambiguate entities in text, - provide links to related entities (exploratory search). Examples include - Google search engine implemented via integrating Google’s KG, - Bing, the Microsoft search engine realized via integrating the Microsoft’s KB Satori.

(3) Question Answering System (QA) incorporate semantic info from KG in order to enhance search result for semantic question. Therefore they provide semantically aware question answering services. Examples include - Watson, the IBM’s QA system, which uses among others YAGO, DBpedia and Freebase, - Digital/virtual assistants like Siri, Cortana, Google Now The QA systems can be classified into extractive or generative QA systems. - In extractive QA the answer is extracted from context (BERT-like models). - In generative QA the model generates free text based on the context for which it uses Text Generation models. On the other hand the QA systems can be also classified as open or closed QA systems. - In open QA the answer is taken from context. - In closed QA there is no context provided and thus the answer is a completely generated one. The fast QA over many document works by inserting first ranking by using passge ranking models. There are QA datasets provided for academic benchmark tests. The most used QA datasets for academic benchmark of extractive QA systems is the Stanford Question Answering Dataset (SQuAD) consisting of 100000+ QA pairs on 500+ articles.

(4) Recommendation/Recommender System gives personalized recommendations based on user’s behaviour, common preferences and interests. Based on analysing user clicks, these content platforms can suggests another contents for users to view or read. In such applications KGs help to improve accuracy, increase diversity of recommended items and bring interpretability to recommendations. Typical examples of such KG applications are content platforms, like e.g. Netflix, Search Engine Optimization (SEO), which is ein Teilgebiet des Suchmaschinenmarketings, or social media.

(5) Natural Language Processing (NLP) addresses processing and understanding text and spoken speech by applying mainly machine learning techniques. Information extraction (IE) is a NLP technique targeting extraction of structured information from typically unstructured text. More precisely IE locates in the intersection of IR and NLP. There exists a mutually beneficial relationship between KG and NLP. On one hand KG is used to realize NLP techniques, like e.g. text summarization. But on the other hand NLP techniques, like (named) entity recognition and relation extraction, are used to create KG.

(6) Enterprise knowledge management is about applying KG in industrial sector in order to utilize the benefits provided by KG. This benefits include - applying big data in order to create new business value, - creating business viewpoints on different granularity levels, - providing new insights through hierarchical inclusion of company relevant data and - providing advanced accessibility of company information to employees.

(7) Domain-specific applications enable creating added value due to applying KG in different fields. - Biomedical KG applications enable question answering and decision support in the life sciences due to integrating multiple sources of biomedical information. - Medical application utilize KG for integrating textual medical knowledge. One such example is retrieving specific info via Schlussforgerung. - In cybersecurity applications KG is used to store modelling patterns in large datasets of transactions and organisations. This serves as a base for building applications, like e.g. detecting and predicting attacks, identifying potential frauds (=verschiedene Arten von Wirtschaftskriminalität) and identifying security threats via suspicious transactions, abnormal user behaviour or fake accounts. - In e-commerce applications KG is used to store semantic information about the interactions between customers and products, and their characteristics. These can be used for a variety of tasks including finding similar or compatible products, identifying product classes, retrieving relevant products for a given search term and retrieving similar search terms. - In financial applications KG is created by identifying named entities from news of companies and by extracting business links between relevant stocks. Such KGs are used for tasks like e.g. predicting stock’s price movement. - KG is used in the field of news to implement fake news detection. - In the field of education KG nodes are used to store instructional concepts, which are used in tasks like learning resource recommendation and concept visualisation. - In Geoscience applications KG stores the textual geoscience data used for information extraction and knowledge discovery services.

Anwendungen des MLs

ML is one of the technology which has many applications both commonly known ones and in a wide variety of Fachgebiets, including finance, healthcare, agriculture and many more.

In the following we give a brief descriptions of several commonly known applications.

Speech recognition is the process of converting the spoken speech into written n text. It is also called as speech to text (STT) or automatic speech recognition (ASR). Speech recognition or used e.g. in speech automated call centers, voice dialling (also called speech enabled dialling) or in Apple’s speech assistant Siri.

Speaker recognition (also called as voice recognition) is the task of identifying the speaker from a short (usually several seconds) speech.

Speech synthesis covers the task of converting written text into human speech. It is also called as text-to-speech (TTS) and it is the reverse task of the speech recognition.

Language translation, like e.g. Google Translate can translate written text in one language into the corresponding text in other language. It is based on Natural Language Processing(NLP) techniques including POS tagging and Named Entity Recognition (NER).

Image recognition is the task of identifying some pattern in an image or video. It can be used among others for face detection, face recognition or for further analysis.

Traffic pattern prediction. This task generates predictions about the upcoming traffic based on the actual one, which is used e.g. to identify and suggest the fastest route to the required destination.

Data mining has a goal of extracting information from a (usually large) data set and converting it into a structure, which is appropriate for further use. Data mining is applied e.g. by secret services or for outlier detection in the statistics.

E-commerce product recommendations. It is a marketing application generating product recommendations based on collected data on past customer behaviours, past shopping.

E-mail spam detection (email filtering) works by means of spam filters using ML algorithms for classifying the incoming e-mails being spam or not.

Malware detection is a pattern recognition system trained on features extracted by analysing suspicious activities.

Computer vision uses ML techniques to process, analyse, visualize or interpret high-dimensional real life data.

Transportation (Uber) uses ML to analyse traffic conditions in order to estimate the Estimated Time of Arrival (ETA) to the required Destination.

ML applied to business problems is also called as predictive analytics. Here we give short interpretations of several ML application examples in Finance.

  • Stock Market and Day Trading. In this application ML is trained to predict the evolution of the prices in the stock market.
  • Focused Account Holder Targeting. Here ML algorithms classify the account holders for segments with predefined balances and loans.
  • Fraud Detection (in banking transactions) is implemented by ML algorithm giving a score for each transaction, which represents a probability of being a fraud. The training of the algorithm is based on patterns for unusual behaviour identified from large amount of transactional data. This is the most important use of ML in banking and finance area.
  • Loan Eligibility Prediction is realized by different ML classifier (like e.g. Random Forest) to judge customers eligibility for being granted a loan.

As a next several applications of ML in Healthcare are shortly considered.

  • In Personalized Treatment/Medication ML is used to identify patient gene patterns (=response markers) that could enable targeted therapies.
  • In Genetics and Genomics ML is used to identify gene sequences in genome sequencing, gene modification and in general in genetic research.
  • In Cancer Prognosis and Prediction (especially with ANNs) ML is used to construct prediction models to support decision on therapy and predict its evolution.
  • In Drug development the process of drug discovery can be sped up by applying ML techniques.

Artificial Neural Networks (ANN) combine ML with NN model. The big majority of the applications has benefited from the usage of ANN instead of pure ML algorithms. These applications are e.g. image recognition, medical diagnosis, speech recognition, machine translation, computer vision, cancer prognosis and prediction, social network filtering, playing board and video games.

Embedded Machine Learning is a subfield of ML, in which ML is applied to embedded systems with limited resources. Such embedded systems include microcontrollers, wearable computers and edge devices. Using ML methods in embedded systems makes transferring and storing data on cloud servers unnecessary. Embedded Machine Learning techniques include among others approximate computing and hardware acceleration.

For more details on application of ML see e.g. the Wikipedia side .

Weitere Anwendungsgebiete von AI

Further application areas of AI include

  1. Educational sector e.g. by creating automated messages to students, or by designing content based on the user’s interest (Smart Content Creation),
  2. Robotics e.g. by making real time decisions possible based on NLP, Object Recognition and Human-Robotics Interaction (HRI),
  3. Navigation e.g. by computing the best route based on GPS and AI technologies,
  4. Healthcare e.g. by patient monitoring or surgical assistance,
  5. Automobiles e.g. by Advanced Driving Assistance System (ADAS) or Autonomous Driving,
  6. Agriculture e.g. by crops monitoring, supply chain maintenance or weather forecasting,
  7. Human Resource e.g. by screening,
  8. Lifestyle e.g. by Personalized Recommendation, Virtual Assistance,
  9. Gaming e.g. by Animation,
  10. Astronomy e.g. by Analysis of astronomical data and Detection e.g. of exoplanets,
  11. Travel and Transport e.g. enabling Truck platooning,
  12. Military e.g. by detecting Cyberattacks and Decision Support e.g. for resource allocation.

The feedback of recent development of NN to older AI concepts resulted in modern AI model combinations. Perhaps the two most important ones are the Neural Language Models and Large Language Models.

Neural Language Models (NLM) are LMs based on recurrent NN. They are also called as continuous space language models. They transform the LM representation of words into continuous low-dimensional vector space, called embedding space. This has the advantage of locating the semantically similar units closer to each other in the embedding space. Such units can be parts of corpus, sentences, sub-words or characters. Furthermore NLMs avoid the data sparsity problem of SLMs since they represent words as non-linear combinations of weights in a neural net . The NLMs can be categorized to be either Non-contextual Embeddings or Contextual Embeddings. Non-contextual Embeddings apply the same embeddings for a given semantic unit regardless of the given context. Examples of Non-contextual Embeddings include Word2Vec or Wikipedia2Vec . On the contrary Contextual Embeddings can represent different semantics of the semantic units in different context. Examples of Contextual Embeddings includes Bidirectional Encoder Representations from Transformers (BERT) introduced by Google or Sentence-BERT (SBERT), a fine-tuned version of BERT. The transformer model is a specific NN architecture introduced by Vaswani et al.

Large Language Models (LLM) are transformer NN based LMs, which are pre-trained using self-supervised learning, learn billions of parameters during training and need large computational resources for the training and during operation. They can achieve general-purpose language understanding and generate answer in the form of human-like generated text. These LLMs are used in generative AI systems. Recent versions can complete specific tasks prompt-engineered, i.e. providing a short structured description of the task that can be interpreted by the LLM. Most prominent examples are Open AI’s GPT-3.5 and GPT-4 model (used in ChatGPT) and Google’s PaLM (used in Bard).

Ethik in der KI

Ethik ist im allgemeinen eine Menge von moralischer Regeln und Leitfaden, die den Menschen helfen, zwischen Recht und Unrecht zu entscheiden. Es ist zu erwarten, das KI schon in der näheren Zukunft erhebliche Auswirkungen auf die ganze Menscheit und Welt haben wird. Deshalb ist es wichtig auf ethischen Fragen in Zusammenhang mit KI bedeutsam Aufmerksamkeit zu schenken.

In der vorigen Jahren haben Organisationen verschiedene Big Data Algoritmen und KI Lösungen eingesetzt meist um ihre Geschäftsergebnisse durch Automatisierung und datengesteuerte Entscheidungsfindung zu verbessern. Dabei wurden einige Unternehmen in Zusammenhang mit ihrer KI-Anwendungen mit unerwarteten negative Konsequenzen konfrontiert, insbesondere aufgrund unfairer Ergebnisse und durch Anwendung von mit Vorurteilen behafteten Datensätzen. Dies hat dazu geführt, dass führende Unternehmen und Forschungs-und Datenwissenschafts-Communities im Bereich der KI sich mit den ethischen Aspekten der KI eingehend befassen mussten. Mangel an angemessene Regeln in diesem Bereich kann zu Reputationsverlust, hohe Geldstrafen sowie zu regulatorischen und rechtlichen Problemen führen.

Ethik der KI (kurz KI-Ethik) ist ein Teilbereich der angewandten Ethik, der sich unten anderen mit der folgende Fragestellungen sich beschäftigt:

  • die gezielte Rolle von KI-Syteme und die ethische Aspekte die ihre Benützung entsprechend ihrer Rollen ermöglichen,
  • ethische Regeln, Leitpfaden für Menschen, die KI-Systeme planen, herstellen, testen, zertifizieren und benutzen.
  • gewünschtes ethische Verhalten von KI-Systemen (Maschinenethik).

Als Leitfaden für die Ethik in der experimentellen Forschung und der Entwicklung von Algorithmen sind der Belmont-Bericht () in akademischer Gemeinschaft weit verbreitet. Die drei wesentliche Prinzipien des Belmont-Berichts lauten:

  1. Respekt für Personen
  2. Gutes tun
  3. Gerechtigkeit (bzgl. Fairness und Gleichheit)

Obwohl eine vielzahl von ethischen Elemente, Prinzipien und Richtlinien für KI vorgeschlagen wurden, existiert derzeit keine einheitliche Richtlinien für KI-Ethik. Allerdings besteht eine gewisse Konsens über die folgende zwei Elemente in die KI-Ethik zu integrieren:

  • Governance - um die Einhaltung der gesetzlichen Vorschriften in Zusammenarbeit mit Regierungsbehörden sicherzustellen.
  • Erklärbarkeit - die Funktionsweise der KI Systeme zu erklären (Transparenz) um Vertrauen gegenüber KI-Systeme zu schaffen.

Es gibt mehrere Organisationen mit dem Ziel das KI-Ethik zu fördern. Dies sind die folgenden.

  • CHAI: Das Center for Human-Compatible Artificial Intelligence () ist eine funktionierende Kooperation verschiedener Universitäten und Institute, welche dei vertrauenswürdige KI und nachweislich nutzbringender Systeme fördert.
  • DARPA: Die Defense Advanced Research Projects Agency des US-Verteidigungsministeriums () fördert die erklärbarer KI-Forschung.
  • NASCAI: Die National Security Commission on Artificial Intelligence () ist eine US Kommission ,,die die Methoden und Mittel prüft, die notwendig sind, um die Entwicklung von künstlicher Intelligenz, maschinellem Lernen und damit verbundenen Technologien voranzutreiben, um die nationalen Sicherheits- und Verteidigungsbedürfnisse der Vereinigten Staaten umfassend zu erfüllen.“
  • AlgorithmWatch: Eine gemeinnützige Organisation (), die sich auf den Einsatz von erklärbaren und nachvollziehbaren Entscheidungsprozessen und Algorithmen abzielt.
  • AI Now Institute: Eine gemeinnützige Organisation an der New York University (), die sich mit der sozialen Auswirkungen der KI beschäftigt.

Laut Luciano Floridi und Josh Cowls herrscht eine weitgehende Übereinstimmung darüber, dass es möglich ist, die ethischen Prinzipien für gesellschaftlich nützlichen Einsatz von KI basierend auf die vier Prinzipien der Medizinethik und ein zusätzliches fünftes Prinzip, Erklärbarkeit aufzubauen :

,,1. Fürsorge (Benefizienz): KI-Technologie soll der Menschheit nützen, das Wohlergehen fördern, die Menschenwürde wahren und der Erhaltung des Planeten dienen.
2. Schadensvermeidung (Non-Malefizienz): Negative Folgen eines übermäßigen oder missbräuchlichen Einsatzes von KI müssen vermieden werden, KI soll nur in einem sicheren Rahmen eingesetzt werden.
3. Autonomie: Menschen müssen die volle Entscheidungsfreiheit darüber haben, ob und welche Aufgaben sie an KI-Systeme delegieren, ihre Freiheit, sich eigene Normen und Standards zu setzen, muss gewahrt bleiben.
4. Gerechtigkeit: Wahrung der Solidarität, Vermeidung von Ungerechtigkeit, gleichberechtigter Zugang zu den Vorteilen der KI.
5. Erklärbarkeit (Verständlichkeit): Die Prinzipien 1 bis 4 sind nur realisierbar, wenn auch Laien nachvollziehen können, welchen Nutzen oder Schaden KI für die Gesellschaft hat und wer für welche Auswirkungen verantwortlich ist.“

Es gibt derzeit eine Reihe von ethischer Diskussionen bzgl. KI Systemen. Sollche KI-Ethik einbeziehende Debatten kommen in mehrere Bereichen vor. Autonomes Fahren stellt ein Musterbeispiel dar. Von automatisierter Systeme sind erwartet, dass sie im Vergleich zur menschlichen Fahrleistung zumindest eine Schadensminderung erzielen. Da derzeitige KI Syteme sehr selten, aber eventuell Fehlerhafte Reaktionen produzieren, ist es zumindest fragwürdig ob autonome Fahrzeuge diese Erwartung erfüllen können. Außerdem ist es auch problematisch, wie ein autonomes Fahrzeug in einem nicht normierbare dilemmatische Entscheidungen, wie z.B. Leben gegen Leben bei plötzlich auftretenden Gefahren, reagieren werden sollen. Ähnliche Debatte kommen auch im Bereiche Autonome Waffensysteme und Maschinenethik vor. Im Allgemeinen die folgende Debatte auslösende Quellen können identizifiert werden:

  1. KI Systemen machen eventuell Fehler - durch Beschränkungen der Sammlung von Daten und Dokumentation sowie der algorithmischen Regeln und - Aufgrund der Unvollständigkeit der Korpora bei der Entscheidungsfindung (ist eine Angabe alle nötige Informationskanäle in die Korpora um die gleiche Entscheidung von KI zu haben, was ein Mensch machen würde, ist nicht möglich).
  2. Bedenken hinsichtlich der Designziele des Systems, z.B. bessere Geschäftsergebnis zu schaffen statt öffentliches Interesse zu folgen. Eine andere Debatte bilden die Überlegungen über die technologische Singularität, i.e. wann die KI das Übertreffen die menschliche Intelligenz erreicht. Obwohl diese Superintelligenz steht nicht unmittelbar bevor, ist Sie mit Angst vor Gefahr auf die Menschheit verbunden. Allerdings ist diese Angst zumindest Teilweise unbegründet, da diese dadurch ausgelöst, dass die Funktionsweise der KI Systeme für meistens die Menschen nicht bekannt ist, also durch die fehlende Transparenz.