Title: | Word Factor Vectors |
---|---|
Description: | A user-friendly factor-like interface for converting strings of text into numeric vectors and rectangular data structures. |
Authors: | Michael W. Kearney [aut, cre] , Lingshu Hu [ctb] |
Maintainer: | Michael W. Kearney <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.1 |
Built: | 2024-11-20 03:04:14 UTC |
Source: | https://github.com/mkearney/wactor |
Convert data into object of type 'wactor'
as_wactor(.x, ...)
as_wactor(.x, ...)
.x |
Input text vector |
... |
Other args passed to Wactr$new(...) |
An object of type wactor
Converts character vector into document term matrix (dtm)
dtm(object, .x = NULL)
dtm(object, .x = NULL)
object |
Input object containing dictionary (column), e.g., wactor |
.x |
Text from which the document term matrix will be created |
A c-style matrix
## create wactor w <- wactor(letters) ## use wactor to create dtm of same vector dtm(w, letters) ## using the initial data is the default; so you don't actually have to ## respecify it dtm(w) ## use wactor to create dtm on new vector dtm(w, c("a", "e", "i", "o", "u")) ## apply directly to character vector dtm(letters)
## create wactor w <- wactor(letters) ## use wactor to create dtm of same vector dtm(w, letters) ## using the initial data is the default; so you don't actually have to ## respecify it dtm(w) ## use wactor to create dtm on new vector dtm(w, c("a", "e", "i", "o", "u")) ## apply directly to character vector dtm(letters)
Randomly partition input into a list of train
and test
data sets
split_test_train(.data, .p = 0.8, ...)
split_test_train(.data, .p = 0.8, ...)
.data |
Input data. If atomic (numeric, integer, character, etc.), the input is first converted to a data frame with a column name of "x." |
.p |
Proportion of data that should be used for the |
... |
Optional. The response (outcome) variable. Uses tidy evaluation
(quotes are not necessary). This is only relevant if the identified
variable is categorical–i.e., character, factor, logical–in which case it
is used to ensure a uniform distribution for the |
A list with train
and test
tibbles (data.frames)
## example data frame d <- data.frame( x = rnorm(100), y = rnorm(100), z = c(rep("a", 80), rep("b", 20)) ) ## split using defaults split_test_train(d) ## split 0.60/0.40 split_test_train(d, 0.60) ## split with equal response level obs split_test_train(d, 0.80, label = z) ## apply to atomic data split_test_train(letters)
## example data frame d <- data.frame( x = rnorm(100), y = rnorm(100), z = c(rep("a", 80), rep("b", 20)) ) ## split using defaults split_test_train(d) ## split 0.60/0.40 split_test_train(d, 0.60) ## split with equal response level obs split_test_train(d, 0.80, label = z) ## apply to atomic data split_test_train(letters)
Converts character vector into a term frequency inverse document frequency (TFIDF) matrix
tfidf(object, .x = NULL)
tfidf(object, .x = NULL)
object |
Input object containing dictionary (column), e.g., wactor |
.x |
Text from which the tfidf matrix will be created |
A c-style matrix
## create wactor w <- wactor(letters) ## use wactor to create tfidf of same vector tfidf(w, letters) ## using the initial data is the default; so you don't actually have to ## respecify it tfidf(w) ## use wactor to create tfidf on new vector tfidf(w, c("a", "e", "i", "o", "u")) ## apply directly to character vector tfidf(letters)
## create wactor w <- wactor(letters) ## use wactor to create tfidf of same vector tfidf(w, letters) ## using the initial data is the default; so you don't actually have to ## respecify it tfidf(w) ## use wactor to create tfidf on new vector tfidf(w, c("a", "e", "i", "o", "u")) ## apply directly to character vector tfidf(letters)
Create an object of type 'wactor'
wactor(.x, ...)
wactor(.x, ...)
.x |
Input text vector |
... |
Other args passed to Wactr$new(...) |
An object of type wactor
## create w <- wactor(c("a", "a", "a", "b", "b", "c")) ## summarize summary(w) ## plot plot(w) ## predict predict(w) ## use on NEW data dtm(w, letters[1:5]) ## dtm() is the same as predict() predict(w, letters[1:5]) ## works if you specify 'newdata' too predict(w, newdata = letters[1:5])
## create w <- wactor(c("a", "a", "a", "b", "b", "c")) ## summarize summary(w) ## plot plot(w) ## predict predict(w) ## use on NEW data dtm(w, letters[1:5]) ## dtm() is the same as predict() predict(w, letters[1:5]) ## works if you specify 'newdata' too predict(w, newdata = letters[1:5])
A factor-like class for word vectors
new()
Wactr$new( text = character(), tokenizer = NULL, max_words = 1000, doc_prop_max = 1, doc_prop_min = 0 )
max_words
Maximum number of words in vocabulary
doc_prop_max
Maximum proportion of docs for terms in dinctionary
doc_prop_min
Minimum proportion of docs for terms in dictionary.
clone()
The objects of this class are cloneable with this method.
Wactr$clone(deep = FALSE)
deep
Whether to make a deep clone.
Simple wrapper for creating a xgboost matrix
xgb_mat(x, ..., y = NULL, split = NULL)
xgb_mat(x, ..., y = NULL, split = NULL)
x |
Input data |
... |
Other data to cbind |
y |
Label vector |
split |
Optional number between 0-1 indicating the desired split between train and test |
A xgb.Dmatrix
xgb_mat(data.frame(x = rnorm(20), y = rnorm(20)))
xgb_mat(data.frame(x = rnorm(20), y = rnorm(20)))