Home > Text Nodes > Text Mining in Oracle Data ... > Build Text > Edit Build Text Node > Add/Edit Text Transform
The Add/Edit Text Transform is launched from the Edit Build Text Node dialog.
The default values for the transform are illustrated in this graphic:
Source Column is the name of the column to be transformed.
Transform Type is either Token (the default) or Theme.
Output Column is the name of the new column; the default name is the source column name with either TOK (for Token) or THM (for Theme) appended, depending on the Transform Type. To specify the output column name, unselect Automatic and type in a name.
Settings specify characteristics of the text and the transform:
Language: the language used by the document being indexed. The default is to specify a single language; English is the default, but you can select a different language. You can specify multiple languages, for example French and English. To specify Single Byte languages (Arabic, Turkish, Thai, and European languages) select them from the Single Byte list. To specify Multi Byte languages (Chinese (Simplified or Traditional), Japanese of Korean, select them from the Multi Byte languages.
Stoplist: Oracle Text provides default stoplists for several single languages. If there is a Default stoplist, it is selected. For several languages, the default is No stop list.
You can select any stoplist that was previously created for this attribute from the pull down list.
You can edit the selected stoplist by clicking
You can add a stoplist by clicking
Either of these selections launches the Stoplist Editor.
If you selected Token, the default for the maximum number of tokens per document is 50 and the default for the maximum number of tokens across all documents is 3000. You can change these values.
The tokens per document and across all documents cutoffs are for rankings, not for absolute count of tokens. You could have more then 3000 tokens across all documents if there were ties.
If you selected Theme, the default for the maximum number of themes per document is 50 and the default for the maximum number of themes across all documents is 3000.
The themes per document and across all documents cutoffs are for rankings, not for absolute count of themes. You could have more then 3000 themes across all documents if there were ties.
Theme incudes a Theme Type specification. The default is Single; you can select Full.
Frequency: Term Frequency is the default. You can select Term Frequency IDF. Note that Frequency is a sticky setting; if you change it, the changed value becomes the default.
Term Frequency uses the term frequency in the document itself. It does not take collection information into account.
TF-IDF (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the collection.