Doc2vec: How to get document vectors

Python Problem Overview

How to get document vectors of two text documents using Doc2vec? I am new to this, so it would be helpful if someone could point me in the right direction / help me with some tutorial

I am using gensim.

doc1=["This is a sentence","This is another sentence"]
documents1=[doc.strip().split(" ") for doc in doc1 ]
model = doc2vec.Doc2Vec(documents1, size = 100, window = 300, min_count = 10, workers=4)

I get

> AttributeError: 'list' object has no attribute 'words'

whenever I run this.

Python Solutions

Solution 1 - Python

If you want to train Doc2Vec model, your data set needs to contain lists of words (similar to Word2Vec format) and tags (id of documents). It can also contain some additional info (see <https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb> for more information).

# Import libraries

from gensim.models import doc2vec
from collections import namedtuple

# Load data

doc1 = ["This is a sentence", "This is another sentence"]

# Transform data (you can add more data preprocessing steps) 

docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
    words = text.lower().split()
    tags = [i]
    docs.append(analyzedDocument(words, tags))

# Train model (set min_count = 1, if you want the model to work with the provided example data set)

model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)

# Get the vectors

model.docvecs[0]
model.docvecs[1]

UPDATE (how to train in epochs): This example became outdated, so I deleted it. For more information on training in epochs, see this answer or @gojomo's comment.

Solution 2 - Python

Gensim was updated. The syntax of LabeledSentence does not contain labels. There are now tags - see documentation for LabeledSentence https://radimrehurek.com/gensim/models/doc2vec.html

However, @bee2502 was right with

docvec = model.docvecs[99]

It will should the 100th vector's value for trained model, it works with integers and strings.

Solution 3 - Python

doc=["This is a sentence","This is another sentence"]
documents=[doc.strip().split(" ") for doc in doc1 ]
model = doc2vec.Doc2Vec(documents, size = 100, window = 300, min_count = 10, workers=4)

I got AttributeError: 'list' object has no attribute 'words' because the input documents to the Doc2vec() was not in correct LabeledSentence format. I hope this below example will help you understand the format.

documents = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1'])

More details are here : http://rare-technologies.com/doc2vec-tutorial/ However, I solved the problem by taking input data from file using TaggedLineDocument().
File format: one document = one line = one TaggedDocument object. Words are expected to be already preprocessed and separated by whitespace, tags are constructed automatically from the document line number.

sentences=doc2vec.TaggedLineDocument(file_path)
model = doc2vec.Doc2Vec(sentences,size = 100, window = 300, min_count = 10, workers=4)

To get document vector : You can use docvecs. More details here : https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument

docvec = model.docvecs[99]

where 99 is the document id whose vector we want. If labels are in integer format (by default, if you load using TaggedLineDocument() ), directly use integer id like I did. If labels are in string format,use "SENT_99" .This is similar to Word2vec

Solution 4 - Python

from gensim.models.doc2vec import Doc2Vec, TaggedDocument 
Documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(doc1)]
Model = Doc2Vec(Documents, other parameters~~)

This should work fine. You need to tag your documents for training doc2vec model.

Content Type	Original Author	Original Content on Stackoverflow
Question	bee2502	View Question on Stackoverflow
Solution 1 - Python	Lenka Vraná	View Answer on Stackoverflow
Solution 2 - Python	l.augustyniak	View Answer on Stackoverflow
Solution 3 - Python	bee2502	View Answer on Stackoverflow
Solution 4 - Python	MovingKyu	View Answer on Stackoverflow

Doc2vec: How to get document vectors

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

Solution 3 - Python

Solution 4 - Python

How to set time zone in codeigniter?

What is the difference between init.py and main.py?

Attributions

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

Solution 3 - Python

Solution 4 - Python

How to set time zone in codeigniter?

What is the difference between __init__.py and __main__.py?

Attributions

What is the difference between init.py and main.py?