Scikit-learn is Python's most popular machine learning library. It has the following attractive features:
Simple, efficient and exceptionally rich data mining/data analysis algorithms;
Based on NumPy, SciPy, and matplotlib, from data exploratory analysis, data visualization to algorithm implementation, the whole process is integrated;
Especially when we want to compare and evaluate the effects of various algorithms, the advantages of this integrated realization can be more prominent.
Since the scikit-learn module is so important, there is not much nonsense to say, just open it below!
Project organization and file loadingProject organization
Working path: `D:\my_python_workfile\Thesis\sklearn_exercise` |--data: used to store data |--20news-bydate: Practice data set |--20news-bydate-train: training set |--20news-bydate -test: test set
File loading
Suppose we need to load the data, the organization is as follows:
Container_folder/ category_1_folder/ file_1.txt file_2.txt ... file_42.txt category_2_folder/ file_43.txt file_44.txt ...
You can use the following functions to load data:
Sklearn.datasets.load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)
Parameter explanation:
The path of `container_path`:container_folder;
`load_content = True`: Whether to load the contents of the file into memory;
`encoding = None`: encoding method. The current text file is generally encoded as "utf-8". If you do not specify the encoding method (encoding=None), the file contents will be processed in bytes instead of unicode.
Return value: Bunch Dictionary-like object. The main attributes are
Data: raw data;
Filenames: the name of each file;
Target: category tag (integer index starting from 0);
Target_names: The specific meaning of the category label (determined by the subfolder name `category_1_folder`, etc.).
In this way, an example demonstration is performed using the test data set [The 20 Newsgroups data set] (Home Page for 20 Newsgroups Data Set: http://qwone.com/~jason/20Newsgroups/). First download the data set from the Internet, and then load the data locally.
```python# load library import osimport sys##configure utf-8 output environment#reload(sys)#sys.setdefaultencoding("utf-8")# Set the current working path os.chdir("D:\\my_python_workfile\ \Thesis\\sklearn_exercise")# Load data from sklearn import datasetstwenty_train = datasets.load_files("data/20news-bydate/20news-bydate-train")twenty_test = datasets.load_files("data/20news-bydate/20news-bydate- Test")```````pythonlen(twenty_train.target_names), len(twenty_train.data), len(twenty_train.filenames),len(twenty_test.data)```
(20, 11314, 11314, 7532)
```python print("".join(twenty_train.data[0].split("")[:3])) ```
From: keley.edu ( )
Subject: Re: Cubs behind Marlins? How?
Article-ID: agate.1pt592$f9a
```python print(twenty_train.target_names[twenty_train.target[0]]) ```
Rec.sport.baseball
```python twenty_train.target[:10] ```
Array([ 9, 4, 11, 4, 0, 4, 5, 5, 13, 12])
It can be seen that the file has been successfully loaded.
Of course, as a training for getting started, we can also use the `tochi example` dataset that comes with `scikit-learn` for testing and playing. Below, I will introduce how to load the data set that comes with it.
```python from sklearn.datasets import fetch_20newsgroups categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med'] twenty_train = fetch_20newsgroups(subset='train', categories =categories, shuffle=True, random_state=42) ```text feature extraction
Text data belongs to unstructured data, and is generally converted into structured data in order to implement machine learning algorithms to achieve text classification.
A common practice is to convert the text into a "document-term matrix". The elements in the matrix can use word frequency, or TF-IDF value, and so on.
Calculating word frequency
```python from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer(stop_words="english",decode_error='ignore') X_train_counts = count_vect.fit_transform(twenty_train.data) X_train_counts.shape ```
(11314, 129783)
Feature extraction using TF-IDF
```python from sklearn.feature_extraction.text import TfidfTransformer tf_transformer = TfidfTransformer(use_idf = False).fit(X_train_counts) X_train_tf = tf_transformer.transform(X_train_counts) X_train_tf.shape ```
(11314, 129783)
The above program uses two steps to formally represent the text: first use the `fit()` method to make the model applicable to the data; then use the `transform()` method to re-express the word frequency matrix into TF-IDF.
It can also be set in one step as shown below.
```python tfidf_transformer = TfidfTransformer() X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) X_train_tfidf.shape ```
(11314, 129783)]
Classifier training
```python from sklearn.naive_bayes import MultinomialNB clf = MultinomialNB().fit(X_train_tfidf,twenty_train.target) ``````python #Predicting new samples docs_new = ['God is love','OpenGL on The GPU is fast'] X_new_counts = count_vect.transform(docs_new) X_new_tfidf = tfidf_transformer.transform(X_new_counts) predicted = clf.predict(X_new_tfidf) for doc,category in zip(docs_new,predicted): print("%r => % s") %(doc,twenty_train.target_names[category]) ```
'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics
Classification effect evaluation
Building a pipeline
```python from sklearn.pipeline import Pipeline text_clf = Pipeline([('vect',CountVectorizer(stop_words="english",decode_error='ignore')), ('tfidf',TfidfTransformer()), ('clf' , MultinomialNB()), ]) text_clf = text_clf.fit(twenty_train.data,twenty_train.target) ```
Test set classification accuracy
```python import numpy as np docs_test = twenty_test.data predicted = text_clf.predict(docs_test) np.mean(predicted == twenty_test.target) ```
0.81691449814126393
Using the Naive Bayes classifier, the accuracy of the test set classification is 81.7%, and the effect is not bad!
Below, use the linear kernel support vector machine to see how it works.
```python from sklearn.linear_model import SGDClassifier text_clf_2 = Pipeline([('vect',CountVectorizer(stop_words='english',decode_error='ignore')), ('tfidf',TfidfTransformer()), ('clf' , SGDClassifier(loss = 'hinge',penalty = 'l2', alpha = 1e-3,n_iter = 5, random_state = 42)), ]) _ = text_clf_2.fit(twenty_train.data,twenty_train.target) predicted = text_clf_2 .predict(docs_test) np.mean(predicted == twenty_test.target) ```
0.82355284121083383
The classification accuracy of the support vector machine has been improved.
More detailed evaluation metrics are provided in `scikit-learn`, such as the accuracy of each category, the recall rate, and the F value.
Below, let's take a look at how the more detailed indicators perform.
```python from sklearn import metrics print(metrics.classification_report(twenty_test.target,predicted, target_names = twenty_test.target_names)) ```
Precision recall f1-score support
Alt.atheism 0.71 0.71 0.71 319
Comp.graphics 0.81 0.69 0.74 389
Comp.os.ms-windows.misc 0.72 0.79 0.75 394
Comp.sys.ibm.pc.hardware 0.73 0.66 0.69 392
Comp.sys.mac.hardware 0.82 0.83 0.82 385
Comp.windows.x 0.86 0.77 0.81 395
Misc.forsale 0.80 0.87 0.84 390
Rec.autos 0.91 0.90 0.90 396
Rec.motorcycles 0.93 0.97 0.95 398
Rec.sport.baseball 0.88 0.91 0.90 397
Rec.sport.hockey 0.87 0.98 0.92 399
Sci.crypt 0.85 0.96 0.90 396
Sci.electronics 0.80 0.62 0.70 393
Sci.med 0.90 0.87 0.88 396
Sci.space 0.84 0.96 0.90 394
Soc.religion.christian 0.75 0.93 0.83 398
Talk.politics.guns 0.70 0.93 0.80 364
Talk.politics.mideast 0.92 0.92 0.92 376
Talk.politics.misc 0.89 0.56 0.69 310
Talk.religion.misc 0.81 0.39 0.53 251
Avg / total 0.83 0.82 0.82 7532
The accuracy of the test set and the recall rate are both good.
Let's take a look at the results of the "confusion matrix".
```python metrics.confusion_matrix(twenty_test.target, predicted)```Use grid search for parameter optimization
In the process of classifying texts using classifiers, some parameters need to be specified. The penalty coefficient `alpha` in the smoothing parameter `alpha`;`SGClassifier()` in `use_idf`;`MultinomialNB()` in `TfidfTransformer()`. However, the parameter setting is not so much that you can't make a head decision directly. Because the setting of the parameters may cause the results to be different.
In order not to degenerate into a "tuning dog", let's look at how to use the violent "grid search algorithm" to let the computer help us to optimize the parameters.
```python from sklearn.grid_search import GridSearchCV parameters = { 'vect__ngram_range':[(1,1),(1,2)], 'tfidf__use_idf':(True,False), 'clf__alpha':(1e-2, 1e-3) } ```
If you want to exhaust all the combinations of parameters, it will take a lot of time to wait for the results. Some local tyrants may think: Can I use money to change time?
The answer is yes. If you have an 8-core computer, use all the cores!
```python gs_clf = GridSearchCV(text_clf_2, parameters, n_jobs = -1) ``` ```python gs_clf = gs_clf.fit(twenty_train.data,twenty_train.target) ```
Set `n_jobs = -1` and the computer will automatically detect and use all your cores for parallel computing.
```python best_parameters,score,_ = max(gs_clf.grid_scores_,key = lambda x:x[1]) for param_name in sorted(parameters.keys()): print("%s: %r" %(param_name ,best_parameters[param_name])) ```
Clf__alpha: 0.01
Tfidf__use_idf: True
Vect__ngram_range: (1, 1)
```python score ```
0.90516174650875025
The push-back pallet rack means that when the forklift stores the goods arriving after the forklift into the rack from the front, the goods will push the original goods to the rear. When picking up goods from the front, since the rack slide rails are inclined forward, the goods in the rear automatically slide to the front for picking.
Features:
1. The storage density is high, but the accessibility is poor. Generally, there are 3 storage positions in the depth direction, and up to 5 storage positions.
2. It saves one third of the space than the general pallet rack and increases the storage space.
3. Suitable for general forklift access.
4. It is suitable for the storage of small varieties and large quantities of items.
5. Storage of items that are not too heavy.
6. The goods automatically slide to the front storage position.
7. No FIFO access.
Push-Back Pallet Racks,Push-Back Pallet Shelves For Storage,Pallet Racking Efficient Access,Pallet Racking No Aisles Needed
Wuxi Lerin New Energy Technology Co.,Ltd. , https://www.lerin-tech.com