Content Management and Capture

 View Only

Content Services GraphQL Support for Downloading and Uploading Large Content

By DAVID Hanson posted Fri January 26, 2024 02:16 PM

  

With the 5.5.12 release of FileNet Content Manager, the GraphQL API includes some enhancements to better support uploading and downloading large content. There is no single definition of what is meant by "large content" but in general issues start to arise when the size reaches several gigabytes. Depending on things such as how much memory is allocated to the GraphQL server and how long connections are configured to stay open, the GraphQL server could suffer from high memory usage, out of memory conditions, or errors from reset connections. With these 5.5.12 enhancements, these issues can be better worked around. 

Large Content Download

Background

Content is downloaded from GraphQL using a content download endpoint. A GraphQL query can be executed to retrieve the content elements of a content holding object such as a document. For example:

 {
  document(
    repositoryIdentifier:"OS1"
    identifier:"{<doc_id_guid>}"
  )
  {
    id
    name
    contentElements {
      contentType
      elementSequenceNumber
      ... on ContentTransfer {
        contentSize
        retrievalName 
        downloadUrl
      }
    }
  }
}

The downloadUrl field is returned in the response as a relative URL. When appended to the base GraphQL server URL it forms a URL to the content download endpoint. For example:

https://example.server.ibm.com/content-services-graphql/content?repositoryIdentifier=OS1&documentId=...guid...&elementSequenceNumber=0

There should be no issue in downloading large content from the GraphQL server as long as connections stay open for the duration of the download. It can be a challenge in some environments to configure the necessary network components to remain open for longer durations to avoid reset connection errors.

Large Content Chunking

In environments or applications where it is not convenient to have connections remain open while downloading large content, a new feature allows large content to be downloaded in chunks. A new optional position argument can be added to the download URL. For example:

https://example.server.ibm.com/content-services-graphql/content?repositoryIdentifier=OS1&documentId=...guid...&elementSequenceNumber=0&position=104857600

Using this argument, you can download the full content of a large document in chunks. From a single request, read some maximum amount of bytes from the response. Then execute additional requests as needed, each time skipping ahead to the position where you left off with the previous request.

The Content Engine Java API has a similar feature to skip ahead in the stream obtained from a ContentTransfer object for downloading the content. This feature of GraphQL closes that feature gap with the CE API.

Here is some sample Java code that downloads some content in chunks using the GraphQL API.

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpUriRequest;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;

. . .
	final String authorizationHeader = "Basic XXX";
	final String downloadUrl = "http://example.server.ibm.com/content-services-graphql/content?repositoryIdentifier=OS1&documentId=...guid...&elementSequenceNumber=0";
	final String outputFileName = "download_all.zip";
	CloseableHttpClient httpClient = getCloseableHttpClient();
	CloseableHttpResponse httpResp = null;
	InputStream contentIs = null;
	File fout = new File(outputFileName);
	OutputStream downloadOs = new FileOutputStream(fout);
	final long maxRequestRead = 100L * 1024L * 1024L;
	try {
		boolean eofReached = false;
		long position = 0L;
		while (!eofReached) {
			String getUrl = downloadUrl;
			if (position > 0) {
				getUrl = downloadUrl + "&position=" + position;
			}
			HttpGet get = new HttpGet(getUrl);
			get.setHeader("Authorization", authorizationHeader);
			httpResp = httpClient.execute(get);
			checkErrors(httpResp);
			contentIs = httpResp.getEntity().getContent();
			long sizeRead = readLargeGraphqlStream(contentIs, maxRequestRead, downloadOs);
			if (sizeRead < maxRequestRead) {
				eofReached = true;
			}
			else
				position += sizeRead;
		}
	}
	finally {
		// Close contentIs, httpResp, httpClient and downloadOs ...
	}
. . .

private long readLargeGraphqlStream(InputStream is, long maxRequestRead, OutputStream os) throws IOException {
    final int readSize = 1024*1024;
    byte[] readBuff = new byte[readSize];
    int readRtn;
    long totalRead = 0;
    int len = readSize;
    while ((readRtn = is.read(readBuff, 0, len)) != -1) {
        os.write(readBuff, 0, readRtn);
        totalRead += readRtn;
        if (totalRead >= maxRequestRead)
            break;
        len = (int)Math.min((long)readSize, maxRequestRead-totalRead);
    }
    return totalRead;
}
	
private CloseableHttpClient getCloseableHttpClient() {
	// Construct appropriate CloseableHttpClient object
}

private void checkErrors(CloseableHttpResponse httpResp) {
	// Check for Unauthenticated and other error responses
}

The sample code uses the apache http library. It executes multiple requests to read the content in chunks. The first request uses the downloadUrl without a position argument so it starts reading the content from the beginning. In subsequent requests it appends a position argument. The readLargeGraphqlStream method reads the stream for a single request, reading either until the end-of-file has been reached or the maximum read size for a single request is reached.

Large Content Upload

Background

The uploading of content using GraphQL involves a POST with the multi-part form content type. The main GraphQL request text is one part of that multi-part content. The content for content elements being uploaded are in one or more additional parts -- one part for each content element.

In the GraphQL request text, values that correspond to the content being input for a content element are represented as variables. The variables map by name to the parts in the multi-part form. For example:

GraphQL Request text:

mutation ($contvar:String) {
  createDocument(
    repositoryIdentifier:"FNOS1" 
    documentProperties: {
      name: "Test Doc" 
      contentElements:{
        replace: [
          {
            type: CONTENT_TRANSFER 
            subContentTransfer: {
              content:$contvar
          } 
         . . .

With variable values in JSON:

{
    “contvar”: null
}

For variables representing the parts for the content being uploaded, the values don't matter. The name of the variable itself maps to the part. Typically null is just passed as the value of these variables.

An entire curl example is:

curl -k https://example.server.ibm.com/content-services-graphql/graphql \
--header 'ECM-CS-XSRF-Token:a251fb4a-88df-4d9d-b38f-5ce80e603e22' \
--header 'Cookie:ECM-CS-XSRF-Token=a251fb4a-88df-4d9d-b38f-5ce80e603e22' \
-u myuser:mypwd \
-F graphql='{"query":"mutation ($contvar:String) {createDocument(repositoryIdentifier:\"P8ObjectStore\" documentProperties: {name: \"Test Doc\" contentElements:{replace: [{type: CONTENT_TRANSFER subContentTransfer: {content:$contvar } } ]} } ) { id name } }", "variables":{"contvar":null} }' \
-F contvar=@SimpleDocument.txt

Two parts are being transmitted in the multi-part form. The first part is the graphql request text itself along with the values of the defined variables. The second part is the content being uploaded for a content element.

Large Content Streaming

In releases prior to 5.5.12, the parts of the multi-part form were loaded entirely into memory before streaming those to the backend CE server. This was partially due to the GraphQL server implementation. There is also an issue with the current release of the Liberty Application Server that causes this content to be loaded into memory under certain conditions.

In 5.5.12 there is an option that allows the upload content to be streamed to the GraphQL server and then to the backend CE server. This option should be used when uploading large content. There is no specific definition of "large" but the previous and current default (without the new option) behavior loads all of the content into memory, so if files of multi-gigabyte size are attempted or there are multiple upload requests going on in parallel with reasonably large files, this could lead to high memory usage or out-of-memory conditions.

There are some behavior differences when using this new option. The plan in the future is to make this the default option while taking some steps to mitigate the behavior differences.

The new option involves passing a special header value in the request:

ECM-CS-Use-Multipart-Streaming:true

With this option enabled, all parts must be in order, starting with the part that holds the GraphQL request text. This is followed by any parts for uploaded content elements. The content element parts must be in the order that they are referenced in the GraphQL request.

With this streaming option, it is still necessary that the connections stay open for the duration of the upload. This requires all of the network components involved in communicating with the GraphQL server and the backend CE server to be configured such that the connections don't reset for a time period that is sufficient for the maximum size uploads that are expected.

Here is an example of a request that supports upload streaming:

The GraphQL request text:

mutation ($contvar:String) {
  createDocument(
    repositoryIdentifier:"P8ObjectStore" 
    documentProperties: {
      name: "Large Doc" 
      contentElements:{
        replace: [
          {
            type: CONTENT_TRANSFER 
            contentType: "application/octet-stream" 
            subContentTransfer: {
              content:$contvar 
              retrievalName: "Large_content.zip"
          }
          . . .

With variable values (JSON):

{
    “contvar”: null
}

This request is similar to the simple example shown earlier. The difference, other than the assumption that the content being uploaded is some large file, is the specifying of the contentType and retrievalName input fields. I will describe shortly why those are necessary.

A full example using Curl is:

curl -k https://example.server.ibm.com/content-services-graphql/graphql \
--header 'ECM-CS-XSRF-Token:a251fb4a-88df-4d9d-b38f-5ce80e603e22' \
--header 'Cookie:ECM-CS-XSRF-Token=a251fb4a-88df-4d9d-b38f-5ce80e603e22’ \
--header ‘ECM-CS-Use-Multipart-Streaming:true’ \
-u myuser:mypwd \
-F graphql='{"query":"mutation ($contvar:String) {createDocument(repositoryIdentifier:\"P8ObjectStore\" documentProperties: {name: \"Large Doc\" contentElements:{replace: [{type: CONTENT_TRANSFER contentType: \"application/octet-stream\" subContentTransfer: {content:$contvar retrievalName: \"Large_content.zip\"} } ]} } ) { id name } }", "variables":{"contvar":null} }' \
-F contvar=@Large_content.zip

Note the addition of the ECM-CS-Use-Multipart-Streaming header.

As previously mentioned, one difference in behavior when using this streaming option is that all of the parts need to be in order -- the GraphQL request text followed by any content element upload parts. If the parts are not in order, a "skipped stream" error will be returned.

Also the contentType and retrievalName input fields should be specified for a ContentTransfer content element. With the non-streaming behavior, if those fields were not specified their values would be taken from header values of the file part. With the streaming option, the parsing of the multi-part content happens in a deferred manner and those header values are not available when passing those property values to the CE API. If not specified, some default values will be generated by the CE server -- ContentType value of application/content-stream and RetrievalName value of, for example, file0.ext.

A JavaScript sample that demonstrates how to upload content using GraphQL with an option to use the ECM-CS-Use-Multipart-Streaming header can be found at this github location:

https://github.com/ibm-ecm/ibm-content-platform-engine-samples/tree/master/CS-GraphQL-javascript-samples

Liberty App Server Multi-part Pre-load Issue

There is an issue with the current version of the Liberty App Server that defeats this streaming capability when OIDC is configured.  With multi-part form content, Liberty consumes the content looking for some OIDC provider information. Liberty does not find that information – it not being in the multi-part form – but an out-of-memory condition may ensue anyway when it reads the content into memory. This issue is currently being worked on by the Liberty team.

When the streaming option is specified, GraphQL will return a Bad Request error when this issue is encountered. It means that the content was already pre-loaded by the app server and there is nothing left to stream. A future enhancement will detect this condition and fall back to the default behavior, but when that occurs you will still be subject to high memory usage or out-of-memory conditions with this issue.

This issue can currently be worked around using LTPA tokens. Some settings are required in the OIDC config file for the GraphQL server:

disableLtpaCookie="false“
accessTokenInLtpaCookie="true"
inboundPropagation="supported"

Make an initial GraphQL request – without multi-form content – to establish an LTPA cookie. This is an example with Basic Auth enabled:

curl -k -v https://example.server.ibm.com/content-services-graphql/graphql \
--header 'ECM-CS-XSRF-Token:a251fb4a-88df-4d9d-b38f-5ce80e603e22' \
--header 'Cookie:ECM-CS-XSRF-Token=a251fb4a-88df-4d9d-b38f-5ce80e603e22' \
-u myuser:mypwd \
--header 'Content-Type:application/json' \
-d '{"query":"{ domain { objectStores { objectStores { symbolicName } } } }"}'

The request can be any GraphQL request at all as long as it doesn't involve multi-part content with the streaming option. The server responds with a set-cookie for a LTPA token:

 set-cookie: FileNetLtpaToken=<token>; Path=/; Secure; HttpOnly

In a subsequent request to GraphQL that streams some multi-part content, honor the LTPA token:

curl -k -v https://example.server.ibm.com/content-services-graphql/graphql \
--header 'ECM-CS-XSRF-Token:a251fb4a-88df-4d9d-b38f-5ce80e603e22' \
--header 'Cookie: ECM-CS-XSRF-Token=a251fb4a-88df-4d9d-b38f-5ce80e603e22; FileNetLtpaToken=<ltpa_token>' \
--header 'ECM-CS-Use-Multipart-Streaming:true' \
-F graphql='{"query":"mutation ($contvar:String) {createDocument(repositoryIdentifier:\"FNOS1\" documentProperties: {name: \"Large Doc\" contentElements:{replace: [{type: CONTENT_TRANSFER subContentTransfer: {content:$contvar} } ]} } ) { id name } }", "variables":{"contvar":null} }' \
-F contvar=@LargeDoc.zip

With this workaround, Liberty does not pre-consume the multi-part content and it can be streamed by the GraphQL server.

With the OIDC config file settings, a bearer token can also be used for authentication to establish a LTPA token. Make an initial GraphQL request using a bearer token:

curl -k -v --location 'https://example.server.ibm.com/content-services-graphql/graphql' \
--header 'ECM-CS-XSRF-Token: a251fb4a-88df-4d9d-b38f-5ce80e603e22' \
--header 'Cookie: ECM-CS-XSRF-Token=a251fb4a-88df-4d9d-b38f-5ce80e603e22' \
--header 'Authorization: Bearer <access_token>' \
--header 'Content-Type: application/json' \
--data '{"query":"{ domain { objectStores { objectStores { symbolicName } } } }","variables":{}}'

The server responds with a set-cookie for a LTPA token as with basic authentication. Honor that LTPA cookie in subsequent requests to stream the content.


#Featured-area-2-home

0 comments
41 views

Permalink