r - 在使用 stringi 包之前将字符串的文本语料库转换为字符向量

Question

我有一个语料库，其中包含我导入的两个文本文件：

temp = list.files(pattern = ".txt")  
mydata = lapply(temp, read.delim, sep ="\t", quote = "")  
mydata

输出类是列表，但我将其转换为字符：

class(mydata)  
list  
mydata <- as.character(mydata)

文本属于字符类：

class(mydata)    
[1] "character"

但似乎它们是字符串，因为输出首先显示：

[[1]]ï..We.give.the.observer.as.much.time.as.he.wants.to.make.his.response..we.simply.increase.the.number.of.alternative.stimuli.among.which.he.must.

（上一行只是其中一个文本的示例）；然后它打印实际文本，因为它们是单独一行的每个句子，例如：

ï..this.is.just.a.bunch.of.crab.to.analyse. 
1  I need to understand how this R package works.                                                                                                                                                                                                                                                                                                                                                                        
2  lexical diversity needs to be analysed for two texts for now.                                                                                                                                                                                                                                                                                                                                                           
3  In this document I am typing each sentence on a separate line.

我需要将这些文本转换为字符向量，以便下一步分析，在 R 中的stringi包的帮助下将它们转换为 ASCII，例如：

stri_enc_toascii(mydata)

--此包仅将字符向量转换为 ascii 编码。所以问题是：

--如何将字符串语料库转换为向量？

PS：我已经查看了 StackOverflow 中的所有其他问题，以避免重复问题。谢谢你的帮助！

谢谢大家帮助！我只是使用 as.vector 将字符串转换为字符向量：

as.vector(mydata)
is.vector(mydata)
TRUE

但主要问题仍然存在：我想要一个字符向量作为 stringi 包和stri_enc_toascii(mydata)函数的输入，以将 mydata 转换为 ASCII 编码（检查这里，但编码仍然显示未知。有没有直接的方法来转换“未知”编码为“ascii”？

score 0 · Accepted Answer

这个问题不是很清楚，但听起来你想展平一个也转换为 ASCII 的字符串向量：

library(stringi)

string1 <- "Here's a random phrase."          # English, ASCII
string2 <- ".هنا عبارة عشوائية هناائية"     # Arabic, not ASCII
string3 <- "여기에 임의의 문구가 있습니다."      # Korean, not ASCII

strings <- c(string1, string2, string3)       # as a vector of strings of length 3

ascii_strings <- stri_enc_toascii(strings)    # convert to ASCII

stri_flatten(ascii_strings)           # as a flat, single element string

# other options....
stri_c(ascii_strings, collapse = " ") # as a flat, single element string
Reduce(paste, ascii_strings)          # base::Reduce() / purrr::reduce() with paste() will do the same
stringr::str_c(ascii_strings)         # stringr::str_c() just wraps stringi::str_c()
stringr::str_flatten()                # stringr::str_flatten() just wraps stringi::flatten()

r - 在使用 stringi 包之前将字符串的文本语料库转换为字符向量

1 回答 1

Related

Reference