Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Limitations in XmlProvider #1501

Open
ibrahim324 opened this issue Dec 30, 2023 · 10 comments
Open

Memory Limitations in XmlProvider #1501

ibrahim324 opened this issue Dec 30, 2023 · 10 comments

Comments

@ibrahim324
Copy link

I tried to parse a dump of some wikipedia pages with XmlProvider, but no matter what I try, I get a
System.OutOfMemoryException. Is there some guidance/pattern on how to parse large files with type providers?
The file is almost exactly 2 GB large.

my code:

#r "nuget: FSharp.Data"
open FSharp.Data

open System
open System.IO

type Wiki = XmlProvider<"""data/wikidata_sample.xml""">


let xmlFromFile = 
    task{
        let path = "data/wikidata.xml" 
        let! text = File.ReadAllTextAsync(path)
        
        Wiki.Parse(text).Pages
        |> Array.map (fun f -> f.Revision.Text)
        |> Array.iter (fun f -> printfn $"{f}")
    }

let xmlFromStream = 
    let options = 
        new FileStreamOptions(BufferSize=32)
    use stream = new FileStream("data/wikidata.xml", options)
    stream 
    |> Wiki.Load
    |> fun f -> f.Pages
    |> Array.map (fun f -> f.Revision.Text.Value)
    |> Array.iter (fun f -> printfn $"{f}")

xmlFromStream

// xmlFromFile 
// |> Async.AwaitTask
// |> Async.RunSynchronously
@cartermp
Copy link
Collaborator

Can you post the stack trace when this happens?

@ibrahim324
Copy link
Author

this is the complete error message (copied from fsi):

System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Text.StringBuilder.ToString()
   at FSharp.Data.Runtime.BaseTypes.XmlElement.Create(TextReader reader) in D:\a\FSharp.Data\FSharp.Data\src\FSharp.Data.Xml.Core\XmlRuntime.fs:line 59
   at <StartupCode$FSI_0003>.$FSI_0003.main@() in /Users/halilibrahimozcan/source/projects/fsharp_xml_parsing/script.fsx:line 25
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)
Aufgrund eines Fehlers beendet

Any other way to retrieve info about the error?

@cartermp
Copy link
Collaborator

Thanks! In this case it seems the size of the string is too big, as this is failing with the internal StringBuilder used in the XML Reader. Are you running in a 32-bit process?

@ibrahim324
Copy link
Author

@cartermp No, I have not configured fsi in any way. I'm running on MacOS if that makes a difference.

@Thorium
Copy link
Member

Thorium commented Feb 23, 2024

Does it matter if the source file encoding is UTF8 or UTF16 ?

@ibrahim324
Copy link
Author

@Thorium Can you point to where I should set the encoding? I tried the following:
let text = File.ReadAllText(path, Encoding.UTF32)
which didn't work, unfortunately. UTF16 wasn't available either.

@Thorium
Copy link
Member

Thorium commented Feb 26, 2024

I meant if you have the file as XML, if it's UTF16 then consider converting it to UTF8 to use less memory, e.g. Notepad++ tells you:
image

@ibrahim324
Copy link
Author

@Thorium Hi, I just opened it in Notepad++; The file was encoded in UTF-8 to begin with.

@dsyme
Copy link
Contributor

dsyme commented Mar 11, 2024

Try using fsiAnyCpu - fsi runs 32-bit by default

@ibrahim324
Copy link
Author

@dsyme That doesn't seem to be the issue - i ran the script within Rider, which is anycpu by default as I checked. I also ran a console program, but it's still an OutOfMemoryException.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants