Modeling(Training)
Handcraft Approach:
At the start, data size is small, we should start with simple methods or rule based system.
- start with heuristics method
I.e. by trial & error OR loosely defined rules
- E.g. An explicit blacklist in email spam task.
- E.g.2 In E-commerce, sorting & recommend by the top number of purchased items(or its category)
- Another example is regular expression
- lib:
Stanford NLP’s TokensRegex
,spaCy’s rule based matches
- lib:
When data grows, switch to/combine with ML or DL model.
There are 2 popular ways:
- Create features from heuristics, to train/inference model
- Use heuristics in particular case and bypass model (I.e. the heuristics decide it during pre-processing)
Using NLP service providers as approach:
- Google Cloud Natural Language
- Amazon Comprehend
- Microsoft Azure Cognitive Services
- IBM Watson Natural Language Understanding
These can also be used as a reference to see if your model performs well enough. If not, then you can use the NLP service directly.
Model ensembling & stacking
- Common way is to NOT do everything by 1 model. Instead, use multiple model to do each specific task.
- Ensembling: model in parallel
- Stacking: model in series
Apply Heuristic after model inference
- it is a good practice to apply rule based / heuristic test after model final output, as a safe test to ensure the model does not make huge mistake