Scikit-learn's operational process several advantages of integrated implementation

Introduction to scikit-learn

Scikit-learn is Python's most popular machine learning library. It has the following attractive features:

Simple, efficient and exceptionally rich data mining/data analysis algorithms;

Based on NumPy, SciPy, and matplotlib, from data exploratory analysis, data visualization to algorithm implementation, the whole process is integrated;

Especially when we want to compare and evaluate the effects of various algorithms, the advantages of this integrated realization can be more prominent.

Since the scikit-learn module is so important, there is not much nonsense to say, just open it below!

Project organization and file loading

Project organization

Working path: `D:\my_python_workfile\Thesis\sklearn_exercise` |--data: used to store data |--20news-bydate: Practice data set |--20news-bydate-train: training set |--20news-bydate -test: test set

File loading

Suppose we need to load the data, the organization is as follows:

Container_folder/ category_1_folder/ file_1.txt file_2.txt ... file_42.txt category_2_folder/ file_43.txt file_44.txt ...

You can use the following functions to load data:

Sklearn.datasets.load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)

Parameter explanation:

The path of `container_path`:container_folder;

`load_content = True`: Whether to load the contents of the file into memory;

`encoding = None`: encoding method. The current text file is generally encoded as "utf-8". If you do not specify the encoding method (encoding=None), the file contents will be processed in bytes instead of unicode.

Return value: Bunch Dictionary-like object. The main attributes are

Data: raw data;

Filenames: the name of each file;

Target: category tag (integer index starting from 0);

Target_names: The specific meaning of the category label (determined by the subfolder name `category_1_folder`, etc.).

In this way, an example demonstration is performed using the test data set [The 20 Newsgroups data set] (Home Page for 20 Newsgroups Data Set: http://qwone.com/~jason/20Newsgroups/). First download the data set from the Internet, and then load the data locally.

```python# load library import osimport sys##configure utf-8 output environment#reload(sys)#sys.setdefaultencoding("utf-8")# Set the current working path os.chdir("D:\\my_python_workfile\ \Thesis\\sklearn_exercise")# Load data from sklearn import datasetstwenty_train = datasets.load_files("data/20news-bydate/20news-bydate-train")twenty_test = datasets.load_files("data/20news-bydate/20news-bydate- Test")```````pythonlen(twenty_train.target_names), len(twenty_train.data), len(twenty_train.filenames),len(twenty_test.data)```

(20, 11314, 11314, 7532)

```python print("".join(twenty_train.data[0].split("")[:3])) ```

From: keley.edu ( )

Subject: Re: Cubs behind Marlins? How?

Article-ID: agate.1pt592$f9a

```python print(twenty_train.target_names[twenty_train.target[0]]) ```

Rec.sport.baseball

```python twenty_train.target[:10] ```

Array([ 9, 4, 11, 4, 0, 4, 5, 5, 13, 12])

It can be seen that the file has been successfully loaded.

Of course, as a training for getting started, we can also use the `tochi example` dataset that comes with `scikit-learn` for testing and playing. Below, I will introduce how to load the data set that comes with it.

```python from sklearn.datasets import fetch_20newsgroups categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med'] twenty_train = fetch_20newsgroups(subset='train', categories =categories, shuffle=True, random_state=42) ```text feature extraction

Text data belongs to unstructured data, and is generally converted into structured data in order to implement machine learning algorithms to achieve text classification.

A common practice is to convert the text into a "document-term matrix". The elements in the matrix can use word frequency, or TF-IDF value, and so on.

Calculating word frequency

```python from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer(stop_words="english",decode_error='ignore') X_train_counts = count_vect.fit_transform(twenty_train.data) X_train_counts.shape ```

(11314, 129783)

Feature extraction using TF-IDF

```python from sklearn.feature_extraction.text import TfidfTransformer tf_transformer = TfidfTransformer(use_idf = False).fit(X_train_counts) X_train_tf = tf_transformer.transform(X_train_counts) X_train_tf.shape ```

(11314, 129783)

The above program uses two steps to formally represent the text: first use the `fit()` method to make the model applicable to the data; then use the `transform()` method to re-express the word frequency matrix into TF-IDF.

It can also be set in one step as shown below.

```python tfidf_transformer = TfidfTransformer() X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) X_train_tfidf.shape ```

(11314, 129783)]

Classifier training

```python from sklearn.naive_bayes import MultinomialNB clf = MultinomialNB().fit(X_train_tfidf,twenty_train.target) ``````python #Predicting new samples docs_new = ['God is love','OpenGL on The GPU is fast'] X_new_counts = count_vect.transform(docs_new) X_new_tfidf = tfidf_transformer.transform(X_new_counts) predicted = clf.predict(X_new_tfidf) for doc,category in zip(docs_new,predicted): print("%r => % s") %(doc,twenty_train.target_names[category]) ```

'God is love' => soc.religion.christian

'OpenGL on the GPU is fast' => comp.graphics

Classification effect evaluation

Building a pipeline

```python from sklearn.pipeline import Pipeline text_clf = Pipeline([('vect',CountVectorizer(stop_words="english",decode_error='ignore')), ('tfidf',TfidfTransformer()), ('clf' , MultinomialNB()), ]) text_clf = text_clf.fit(twenty_train.data,twenty_train.target) ```

Test set classification accuracy

```python import numpy as np docs_test = twenty_test.data predicted = text_clf.predict(docs_test) np.mean(predicted == twenty_test.target) ```

0.81691449814126393

Using the Naive Bayes classifier, the accuracy of the test set classification is 81.7%, and the effect is not bad!

Below, use the linear kernel support vector machine to see how it works.

```python from sklearn.linear_model import SGDClassifier text_clf_2 = Pipeline([('vect',CountVectorizer(stop_words='english',decode_error='ignore')), ('tfidf',TfidfTransformer()), ('clf' , SGDClassifier(loss = 'hinge',penalty = 'l2', alpha = 1e-3,n_iter = 5, random_state = 42)), ]) _ = text_clf_2.fit(twenty_train.data,twenty_train.target) predicted = text_clf_2 .predict(docs_test) np.mean(predicted == twenty_test.target) ```

0.82355284121083383

The classification accuracy of the support vector machine has been improved.

More detailed evaluation metrics are provided in `scikit-learn`, such as the accuracy of each category, the recall rate, and the F value.

Below, let's take a look at how the more detailed indicators perform.

```python from sklearn import metrics print(metrics.classification_report(twenty_test.target,predicted, target_names = twenty_test.target_names)) ```

Precision recall f1-score support

Alt.atheism 0.71 0.71 0.71 319

Comp.graphics 0.81 0.69 0.74 389

Comp.os.ms-windows.misc 0.72 0.79 0.75 394

Comp.sys.ibm.pc.hardware 0.73 0.66 0.69 392

Comp.sys.mac.hardware 0.82 0.83 0.82 385

Comp.windows.x 0.86 0.77 0.81 395

Misc.forsale 0.80 0.87 0.84 390

Rec.autos 0.91 0.90 0.90 396

Rec.motorcycles 0.93 0.97 0.95 398

Rec.sport.baseball 0.88 0.91 0.90 397

Rec.sport.hockey 0.87 0.98 0.92 399

Sci.crypt 0.85 0.96 0.90 396

Sci.electronics 0.80 0.62 0.70 393

Sci.med 0.90 0.87 0.88 396

Sci.space 0.84 0.96 0.90 394

Soc.religion.christian 0.75 0.93 0.83 398

Talk.politics.guns 0.70 0.93 0.80 364

Talk.politics.mideast 0.92 0.92 0.92 376

Talk.politics.misc 0.89 0.56 0.69 310

Talk.religion.misc 0.81 0.39 0.53 251

Avg / total 0.83 0.82 0.82 7532

The accuracy of the test set and the recall rate are both good.

Let's take a look at the results of the "confusion matrix".

```python metrics.confusion_matrix(twenty_test.target, predicted)```Use grid search for parameter optimization

In the process of classifying texts using classifiers, some parameters need to be specified. The penalty coefficient `alpha` in the smoothing parameter `alpha`;`SGClassifier()` in `use_idf`;`MultinomialNB()` in `TfidfTransformer()`. However, the parameter setting is not so much that you can't make a head decision directly. Because the setting of the parameters may cause the results to be different.

In order not to degenerate into a "tuning dog", let's look at how to use the violent "grid search algorithm" to let the computer help us to optimize the parameters.

```python from sklearn.grid_search import GridSearchCV parameters = { 'vect__ngram_range':[(1,1),(1,2)], 'tfidf__use_idf':(True,False), 'clf__alpha':(1e-2, 1e-3) } ```

If you want to exhaust all the combinations of parameters, it will take a lot of time to wait for the results. Some local tyrants may think: Can I use money to change time?

The answer is yes. If you have an 8-core computer, use all the cores!

```python gs_clf = GridSearchCV(text_clf_2, parameters, n_jobs = -1) ``` ```python gs_clf = gs_clf.fit(twenty_train.data,twenty_train.target) ```

Set `n_jobs = -1` and the computer will automatically detect and use all your cores for parallel computing.

```python best_parameters,score,_ = max(gs_clf.grid_scores_,key = lambda x:x[1]) for param_name in sorted(parameters.keys()): print("%s: %r" %(param_name ,best_parameters[param_name])) ```

Clf__alpha: 0.01

Tfidf__use_idf: True

Vect__ngram_range: (1, 1)

```python score ```

0.90516174650875025

Push-back Pallet Racks

The push-back pallet rack means that when the forklift stores the goods arriving after the forklift into the rack from the front, the goods will push the original goods to the rear. When picking up goods from the front, since the rack slide rails are inclined forward, the goods in the rear automatically slide to the front for picking.

Features:

1. The storage density is high, but the accessibility is poor. Generally, there are 3 storage positions in the depth direction, and up to 5 storage positions.

2. It saves one third of the space than the general pallet rack and increases the storage space.

3. Suitable for general forklift access.

4. It is suitable for the storage of small varieties and large quantities of items.

5. Storage of items that are not too heavy.

6. The goods automatically slide to the front storage position.

7. No FIFO access.

Push-Back Pallet Racks,Push-Back Pallet Shelves For Storage,Pallet Racking Efficient Access,Pallet Racking No Aisles Needed

Wuxi Lerin New Energy Technology Co.,Ltd. , https://www.lerin-tech.com