1

I have noticed that the Microsoft.Ml.Legacy.LearningPipeline.Row count is always 10 in the SentimentAnalysis sample project no matter how much data is in the test or training models.

https://github.com/dotnet/samples/blob/master/machine-learning/tutorials/SentimentAnalysis.sln

Can anyone explain the significance of 10 here?

// LearningPipeline allows you to add steps in order to keep everything together 
        // during the learning process.  
        // <Snippet5>
        var pipeline = new LearningPipeline();
        // </Snippet5>

        // The TextLoader loads a dataset with comments and corresponding postive or negative sentiment. 
        // When you create a loader, you specify the schema by passing a class to the loader containing
        // all the column names and their types. This is used to create the model, and train it. 
        // <Snippet6>
        pipeline.Add(new TextLoader(_dataPath).CreateFrom<SentimentData>());
        // </Snippet6>

        // TextFeaturizer is a transform that is used to featurize an input column. 
        // This is used to format and clean the data.
        // <Snippet7>
        pipeline.Add(new TextFeaturizer("Features", "SentimentText"));
        //</Snippet7>

        // Adds a FastTreeBinaryClassifier, the decision tree learner for this project, and 
        // three hyperparameters to be used for tuning decision tree performance.
        // <Snippet8>
        pipeline.Add(new FastTreeBinaryClassifier() { NumLeaves = 50, NumTrees = 50, MinDocumentsInLeafs = 20 });
        // </Snippet8>

enter image description here

4

1 回答 1

2

The debugger is showing only a preview of the data - the first 10 rows. The goal here is to show a few example rows and how each transform is operating on them to make debugging easier.

Reading in the entire training data and running all the transformations on it is expensive and only happens when you reach .Train(). As the transformations are only operating on a few rows, their effect might be different when operating on the entire dataset (e.g. the text dictionary will likely be bigger), but hopefully the preview of data shown before running through the full training process is helpful for debugging and making sure transforms are applied to the correct columns.

If you have any ideas on how to make this clearer or more useful, it would be great if you can create an issue on GitHub!

于 2018-10-21T17:42:50.430 回答