Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Another learner here, one clarification that I think is useful even for beginners:

> A token is a unique integer identifier for a piece of text.

A token is a word fragment that's common enough to be useful on its own - for eg., "writing", "written", "writer" all have "writ", so "writ" would be an individual token, and "writer" might be tokenized as "writ" and "er".

An embedding is where the tokens get turned into unique numeric identifiers.



Tokens are also numbers in practice, but they're indexes into a lookup table of character sequences so yes there's very little between the two definitions. Embeddings are in turn the result of looking up that index in a table, and the result is a vector. So:

character sequence (string) -> token (small integer) -> embedding (vector of floats)


The tokens are in this case actually the individual characters:

    vocab = sorted(list(set(lines)))




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: