KINTO Tech Blog
Development

The story of how I struggled with the character code when I got OGP in Kotlin

Cover Image for The story of how I struggled with the character code when I got OGP in Kotlin

Introduction

Hello! I'm Hasegawa, an Android engineer at KINTO Technologies! I usually work on developing an app called my route. Please check out the other articles written by members of my route's Android Team!

To be explained in this article

What is OGP?

OGP stands for "Open Graph Protocol" and is an HTML element that correctly shows the title and image of a web page when sharing it with other services. Web pages configured with OGP have meta tags that represent this information. The following is a meta tag that excerpts part of it. Services that want to get OG information can read information from these meta tags.

<meta property="og:title" content="page title" />
<meta property="og:description" content="page description" />
<meta property="og:image" content=" thumbnail image URL" />

How to get OGP in Kotlin

This time, I will use OkHttp for communication and Jsoup for HTTP parsing.
First, use OkHttp to access the web page of the URL where you want to get OG information. I will omit error handling since it varies depending on the requirements.

val client = OkHttpClient.Builder().build()
val request =
    Request.Builder().apply {
        url("URL for wanted OG information")
    }.build()
client.newCall(request).enqueue(
    object : okhttp3.Callback {
        override fun onFailure(call: okhttp3.Call, e: java.io.IOException) {}
        override fun onResponse(call: okhttp3.Call, response: okhttp3.Response) {
            parseOgTag(response.body)
        }
    },
)

Then parse the contents using Jsoup.

private fun parseOgTag(body: ResponseBody?): Map<String, String> {
    val html = body?.string() ?: ""
    val doc = Jsoup.parse(html)
    val ogTags = mutableMapOf<String, String>()
    val metaTags = doc.select("meta[property^=og:]")
    for (tag in metaTags) {
        val property = tag.attr("property")
        val content = tag.attr("content")
        val matchResult = Regex("og:(.*)").find(property)
        val ogType = matchResult?.groupValues?.getOrNull(1)
        if (ogType != null && !content.isNullOrBlank()) {
            ogTags[ogType] = content
        }
    }
    return ogTags
}

Now ogTags has the necessary OG information.

The reason why text in the information obtained by OGP is corrupted

I think that I can get the OG information of most web pages correctly so far. However, for some web pages, corrupted text may occur. Here, I will explain the cause.
This time, I called string() as shown below.

val html = response.body?.string() ?: ""

This function selects the character code in the following order of precedence:

  1. BOM (Byte Order Mark) information
  2. Response header charset
  3. UTF-8 unless specified in 1 and 2
    More information can be found in the OkHttp repository comments.
    In other words, what if there is no BOM information, no response header charset, and a web page encoded in a non-UTF-8 format such as Shift_JIS?
    ...
    Text corruption occurs. Because it decodes with the default UTF-8. So what do we do? In the next section, I will explain how to respond.

How to deal with corrupted text

I found the cause of the corrupted text in the previous section. In fact, the character code may be specified in the HTML in the web page as follows. If there is no BOM information and the response header charset is not specified, this information could be used.

<meta charset="UTF-8">  <!-- HTML5 -->
<meta http-equiv="content-type" content="text/html; charset=Shift_JIS"> <!-- before HTML5 -->

However, there is a contradiction that HTML must be parsed according to the character code in order to read the specified meta tag. Or so you might think. For example, UTF-8 and Shift_JIS are compatible in the range of ASCII characters, so it is not a problem to decode with UTF-8 once. (This method may parse twice. If you check the byte array of the meta tag beforehand, you may be able to determine the character code before parsing, but this time I focused on the code comprehensibility.)
So, you can write code like the following.

/**
  * Get the Jsoup Document from the response body
  * If the response body charset is not UTF-8, parse the charset again
  */
private fun getDocument(body: ResponseBody?): Document {
    val byte = body?.Bytes() ?: byteArrayOf()
    // If charset is specified in ResponseHeader, it is decoded with that charset
    val headerCharset = body?.contentType()?.charset()
    val html = String(byte, headerCharset ?: Charsets.UTF_8)
    val doc = Jsoup.parse(html)
    // If headerCharset is specified, the charset should parse correctly
    // return as is.
    If (headerCharset ! = null) {
        return doc
    }
    // Get the charset from the meta tag in the HTML.
    // If this charset is not present, the character code is unknown and the UTF-8 parsed doc is returned.
    val charsetName = extractCharsetFromMetaTag(html) ?: return doc
    val metaCharset =
        try {
            Charset.forName(charsetName)
        } catch (e: IllegalCharsetNameException) {
            Timer.w(e)
            return doc
        }
    // If the charset specified in the meta tag and UTF-8 are different, parse again with the charset specified in the meta tag
    // Parsing is a relatively heavy process, so don't double it.
    return if (metaCharset != Charsets.UTF_8) {
        Jsoup.parse(String(byte, metaCharset))
    } else {
        doc
    }
}
/**
  * Get the charset string from the HTML meta tag
  *
  * Less than HTTP5 -> meta[http-equiv=content-type]
  * HTTP5 or higher -> meta [charset]
  *
  * @return charset character string ex) "UTF-8", "shift_JIS"
  * Null if @return charset is not found
  */
private fun extractCharsetFromMetaTag(html: String): String? {
    val doc = Jsoup.parse(html)
    val metaTags = doc.select("meta[http-equiv=content-type], meta [charset]")
    for (metaTag in metaTags) {
        if (metaTag.hasAttr("charset")) {
            return metaTag.attr("charset")
        }
        val content = metaTag.attr("content")
        if (content.contains("charset=")) {
            return content.substingAfter("charset=").split(";")[0].trim()
        }
    }
    return null
}

Then, let's change the function that creates the Jsoup Document as follows using the process that we just created.

- val html = body?.String() ?: ""
- val doc = Jsoup.parse(html)
+ val doc = getDocument(body)

Conclusion

Thank you for reading this far. Most web pages use UTF-8 character code, and even if you use a different character code, most of the time the charset is specified in the BOM or response header. Therefore, I do not think that this kind of problem will occur very often. However, if you find such a site, it may be difficult to understand the cause and how to fix it.
I hope this article will help you.

Facebook

関連記事 | Related Posts

We are hiring!

【プロジェクトマネージャー】モバイルアプリ開発G/大阪

モバイルアプリ開発GについてKINTOテクノロジーズにおける、モバイルアプリ開発のスペシャリストが集まっているグループです。KINTOやmy routeなどのサービスを開発・運用しているグループと協調しながら品質の高いモバイルアプリを開発し、サービスの発展に貢献する事を目標としています。

【iOS/Androidエンジニア】モバイルアプリ開発G/東京

モバイルアプリ開発GについてKINTOテクノロジーズにおける、モバイルアプリ開発のスペシャリストが集まっているグループです。KINTOやmy routeなどのサービスを開発・運用しているグループと協調しながら品質の高いモバイルアプリを開発し、サービスの発展に貢献する事を目標としています。