How to use Bert for long text classification?

NlpText ClassificationBert Language-Model

Nlp Problem Overview


We know that BERT has a max length limit of tokens = 512, So if an article has a length of much bigger than 512, such as 10000 tokens in text How can BERT be used?

Nlp Solutions


Solution 1 - Nlp

You have basically three options:

  1. You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient.
  2. You can split your text in multiple subtexts, classifier each of them and combine the results back together ( choose the class which was predicted for most of the subtexts for example). This option is obviously more expensive.
  3. You can even feed the output token for each subtext (as in option 2) to another network (but you won't be able to fine-tune) as described in this discussion.

I would suggest to try option 1, and only if this is not good enough to consider the other options.

Solution 2 - Nlp

This paper compared a few different strategies: How to Fine-Tune BERT for Text Classification?. On the IMDb movie review dataset, they actually found that cutting out the middle of the text (rather than truncating the beginning or the end) worked best! It even outperformed more complex "hierarchical" approaches involving breaking the article into chunks and then recombining the results.

As another anecdote, I applied BERT to the Wikipedia Personal Attacks dataset here, and found that simple truncation worked well enough that I wasn't motivated to try other approaches :)

Solution 3 - Nlp

In addition to chunking data and passing it to BERT, check the following new approaches.

There are new researches for long document analysis. As you've asked for Bert a similar pre-trained transformer Longformer has recently been made available from ALLEN NLP (https://arxiv.org/abs/2004.05150). Check out this link for the paper.

The related work section also mentions some previous work on long sequences. Google them too. I'll suggest at least go through Transformer XL (https://arxiv.org/abs/1901.02860). As far I know it was one of the initial models for long sequences, so would be good to use it as a foundation before moving into 'Longformers'.

Solution 4 - Nlp

You can leverage from the HuggingFace Transformers library that includes the following list of Transformers that work with long texts (more than 512 tokens):

  • Reformer: that combines the modeling capacity of a Transformer with an architecture that can be executed efficiently on long sequences.
  • Longformer: with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer.

Eight other recently proposed efficient Transformer models include Sparse Transformers (Child et al.,2019), Linformer (Wang et al., 2020), Sinkhorn Transformers (Tay et al., 2020b), Performers (Choromanski et al., 2020b), Synthesizers (Tay et al., 2020a), Linear Transformers (Katharopoulos et al., 2020), and BigBird (Zaheeret al., 2020).

The paper from the authors from Google Research and DeepMind tries to make a comparison between these Transformers based on Long-Range Arena "aggregated metrics":

Performance, speed, and memory footprint of the Transformers

They also suggest that Longformers have better performance than Reformer when it comes to the classification task.

Solution 5 - Nlp

I have recently (April 2021) published a paper regarding this topic that you can find on arXiv (https://arxiv.org/abs/2104.07225).

There, Table 1 allows to review previous approaches to the problem in question, and the whole manuscript is about long text classification and proposing a new method called Text Guide. This new method claims to improve performance over naive and semi-naive text selection methods used in the paper (https://arxiv.org/abs/1905.05583) that was mentioned in one of the previous answers to this question.

Long story short about your options:

  1. Low computational cost: use naive/semi naive approaches to select a part of original text instance. Examples include choosing first n tokens, or compiling a new text instance out of the beginning and end of original text instance.

  2. Medium to high computational cost: use recent transformer models (like Longformer) that have 4096 token limit instead of 512. In some cases this will allow for covering the whole text instance and the modified attention mechanism decreases computational cost, and

  3. High computational cost: divide the text instance into chunks that fit a model like BERT with ‘standard’ 512 limit of tokens per instance, deploy the model on each part separately, join the resulting vector representations.

Now, in my recently published paper there is a new method proposed called Text Guide. Text Guide is a text selection method that allows for improved performance when compared to naive or semi-naive truncation methods. As a text selection method, Text Guide doesn’t interfere with the language model, so it can be used to improve performance of models with ‘standard’ limit of tokens (512 for transformer models) or ‘extended’ limit (4096 as for instance for the Longformer model). Summary: Text Guide is a low-computational-cost method that improves performance over naive and semi-naive truncation methods. If text instances are exceeding the limit of models deliberately developed for long text classification like Longformer (4096 tokens), it can also improve their performance.

Solution 6 - Nlp

There are two main methods:

  • Concatenating 'short' BERT altogether (which consists of 512 tokens max)
  • Constructing a real long BERT (CogLTX, Blockwise BERT, Longformer, Big Bird)

I resumed some typical papers of BERT for long text in this post : https://lethienhoablog.wordpress.com/2020/11/19/paper-dissected-and-recap-4-which-bert-for-long-text/

You can have an overview of all methods there.

Solution 7 - Nlp

There is an approach used in the paper Defending Against Neural Fake News ( https://arxiv.org/abs/1905.12616)

Their generative model was producing outputs of 1024 tokens and they wanted to use BERT for human vs machine generations. They extended the sequence length which BERT uses simply by initializing 512 more embeddings and training them while they were fine-tuning BERT on their dataset.

Solution 8 - Nlp

U can use the max_position_embeddings argument in the configuration while downloading the BERT model into your kernel. with this argument you can choose 512, 1024, 2048 as max sequence length

max_position_embeddings (int, optional, defaults to 512) – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

https://huggingface.co/transformers/model_doc/bert.html

Solution 9 - Nlp

A relatively straightforward way to go is altering the input. For example, you can truncate the input or separately classify multiple parts of the input and aggregate the results. However, you would probably lose some useful information this way.

The main obstacle of applying Bert on long texts is that attention needs O(n^2) operations for n input tokens. Some newer methods try to subtly change the Bert's architecture and make it compatible for longer texts. For instance, Longformer limits the attention span to a fixed value so every token would only be related to a set of nearby tokens. This table (Longformer 2020, Iz Beltagy et al.) demonstrates a set of attention-based models for long-text classification:

enter image description here

LTR methods process the input in chunks from left to right and are suitable for auto-regressive applications. Sparse methods mostly reduce the computational order to O(n) by avoiding a full quadratic attention matrix calculation.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionuser1337896View Question on Stackoverflow
Solution 1 - NlpchefhoseView Answer on Stackoverflow
Solution 2 - NlpchrismccView Answer on Stackoverflow
Solution 3 - NlpGandharv SuriView Answer on Stackoverflow
Solution 4 - NlpSvGAView Answer on Stackoverflow
Solution 5 - NlpkrzysztoffiokView Answer on Stackoverflow
Solution 6 - NlpFrançois - Hoa LEView Answer on Stackoverflow
Solution 7 - NlpZain SarwarView Answer on Stackoverflow
Solution 8 - NlpSivaView Answer on Stackoverflow
Solution 9 - NlpteshniziView Answer on Stackoverflow